Registry Parser
All articles

The regf hive file format, in practice

7 min read

Most analysts treat hives like opaque blobs that some tool turns into a tree. That works until it doesn't. The day you have a partially corrupted hive, or a hive that opens fine in one tool and silently drops half the keys in another, you need to know what is actually in the file. The regf format is not complicated. It is just unforgiving.

The 4096-byte base block

A regf hive starts with a single 4096-byte block. The first four bytes are the magic regf. After that you get a pair of sequence numbers (Primary and Secondary), a FILETIME for the last write, version fields (major and minor; Windows 10+ is 1.5 in most cases), file type, file format, and the offset of the root key cell. The block also carries a 64-character UTF-16 name (the hive's path at the time it was created), and at the end an XOR checksum across the first 508 bytes.

Two things matter operationally:

The sequence number pair is how the kernel tracks whether the hive was cleanly shut down. If Primary does not match Secondary, the hive was not flushed cleanly, and the transaction logs need to be replayed before the file is consistent. Tools that skip log replay will give you a partial view of the data and will not warn you about it.

The checksum is the first thing to verify on any hive you did not personally acquire. A hive with a busted base-block checksum has been tampered with, truncated, or corrupted in transit. None of those are good news.

HBIN blocks: the allocation containers

After the base block, the file is a sequence of hbin blocks, each starting with the four-byte magic hbin followed by its offset (relative to the start of the hive data, not the file) and its size. HBINs are 4096 bytes or a multiple thereof. They are the registry's equivalent of memory pages.

Inside each HBIN, the actual data lives in cells. A cell starts with a signed 32-bit size. If the size is negative, the cell is allocated. If positive, it is free. The absolute value of the size includes the size field itself. This is the single most common place where homegrown parsers go wrong: walking cells forward by reading the size as unsigned, missing the allocation flag, and then either skipping live data or marching off the end of the HBIN into the next one's header.

Free cells are not zeroed. This is the basis for deleted-key recovery, which I'll cover in another post.

The cell types you actually care about

Inside the cells you get a small zoo of typed records. The signatures are two-byte ASCII tags you will recognize immediately once you have looked at a few hives in a hex editor:

  • nk: a key node. The name, parent pointer, subkey list pointer, value list pointer, security descriptor pointer, classname pointer, LastWrite FILETIME, and various counts. This is the spine of the tree.
  • vk: a value. Name, data type (REG_SZ, REG_DWORD, REG_BINARY, and friends), data length, and either inline data (for values 4 bytes or smaller) or an offset to a data cell.
  • sk: a security descriptor. Stored once per unique descriptor and shared across keys via a linked list of references. The hive deduplicates these.
  • lh, lf, li, ri: subkey list types. lh is the modern default and is a sorted list of (name_hash, nk_offset) pairs. lf is the older variant. li is a flat index list. ri is an index-of-indexes used when there are too many subkeys to fit in a single lh. Real hives mix these depending on how the subkey tree grew.
  • db: big data. When a value's payload exceeds 16344 bytes, it gets split into a chain of segments and stored under a db record. Easy to miss if you assume all value data is in one cell.

A parser that handles nk, vk, sk, and lh covers maybe 95% of real-world hives. A parser that also handles lf, li, ri, and db is the one you actually want.

The key node tree

The root key is at the offset listed in the base block. From there it is a straightforward tree walk: read the root nk, follow its subkey list pointer, dereference each entry to get the child nks, recurse. Values hang off each nk via the value list pointer, which points at a cell containing an array of vk offsets.

Two pitfalls:

Name encoding on nk and vk records can be either ASCII or UTF-16. There is a flag in the record header that says which. Tools that assume one or the other will mojibake the values they parse from hives with non-ASCII names. This is common in localized Windows installs.

The LastWrite timestamp on an nk is updated when the key is modified, which includes when its subkeys or values change. It is not updated when a value's data changes without the value being added or removed. This nuance bites people who try to use LastWrite as a fine-grained activity timeline.

The security descriptor table

Every nk carries a pointer to an sk cell. The sk cells form a doubly-linked list (flink/blink) that the kernel walks to deduplicate descriptors. If you are looking at registry ACLs offline (and you should be, for any host where the attacker may have hidden keys via permissions), this is where the data lives. The descriptors themselves are standard SECURITY_DESCRIPTOR structures.

Practical use: dump the descriptor on HKLM\SYSTEM\CurrentControlSet\Services\<name> for any service that looks suspicious. If it restricts read access to SYSTEM only, that is intentional hiding. Real Microsoft services do not do that.

Transaction logs and the dual-LOG scheme

Each hive has two transaction log files, .LOG1 and .LOG2. Pre-Windows 8.1 there was only one. The dual-log scheme exists so the kernel can write to one log while keeping the other intact, preventing a power loss during log write from killing both copies.

A log is a sequence of dirty-page entries. Each entry says "at offset N in the primary hive, the next M bytes should be this". Replay is conceptually trivial: apply the entries in order. In practice you also need to validate sequence numbers, handle the case where the primary hive has been written to since the log was generated, and detect torn writes.

The dirty-page entries can contain data that never appears in the main hive at all. If the system crashed between the log being written and the hive being flushed, replaying the log gives you records the live registry has never seen.

If you ignore the logs, you are looking at a stale hive. Always replay.

Why this is harder than people assume

The format itself is documented. Maxim Suhanov's reverse-engineered notes are excellent. The problem is the volume of edge cases:

  • Some values claim a data length that exceeds the cell that holds them, and the spec says the parser should truncate. Different tools handle this differently.
  • AmCache is a regf hive but Microsoft uses it slightly differently, with minor format quirks that some parsers do not handle.
  • Hives saved with reg save versus copied with esentutl /vss versus extracted from a memory dump are not byte-identical, and minor differences (extra trailing bytes, slightly different header values) can confuse stricter parsers.

The mature tools handle this. The freshly written Python script in someone's blog post often does not.

Tools worth knowing

  • yarp by Maxim Suhanov. The lowest-level library and the closest to a reference implementation. If yarp complains, listen.
  • libregf by Joachim Metz. The C library that backs most forensic tools. Mature, fast, conservative.
  • RegRipper by Harlan Carvey. The plugin layer on top of Parse::Win32Registry. How most analysts actually consume hives.

Cross-validating between two of these on any hive that matters is cheap insurance. If yarp and RegRipper agree on the key count and the LastWrite times, you can trust the data. If they disagree, you have a story to chase.

Further reading

A hive file is not a black box. Spend an afternoon with a hex editor and the spec above, and the next time a tool gives you results that look wrong, you will be able to tell whether the tool is wrong or the hive is.