The regf base block: inside the Windows registry hive header

Every regf hive opens with a structure called the base block — the hive header. It is the first thing the Windows kernel reads when it mounts a hive, and it should be the first thing your parser reads too. The base block tells you what the file is, whether it was shut down cleanly, where the tree starts, and whether the header has been corrupted. Get it wrong and everything downstream is built on sand. This post takes the regf base block apart field by field and explains what each one means when you are recovering or validating a registry hive. It goes deeper than the regf hive format overview; read that first if you want the wider tour of HBINs and cells, then come back here for the header itself.

A 4 KB block where only the first 512 bytes matter

The base block occupies a full 4,096-byte block at the very start of the file. That size is fixed and it is not a free choice: the registry is paged, every allocation unit is a multiple of 4 KB, and the header reserves one whole page for itself. But almost all of those 4,096 bytes are zero. The fields that carry meaning sit in roughly the first 240 bytes, and everything the kernel actually validates lives in the first 512.

The split worth internalising is 508 + 4. The checksum at the end of the meaningful header covers the 508 bytes before it; the checksum field itself is the four bytes that bring the total to 512. The remaining ~3,584 bytes of the page are padding — reserved space, mostly zeros, occasionally carrying a few extra fields on newer formats but never anything the parser must have to walk the tree. So when people say "the 512-byte header" and "the 4 KB header" they are both right: 4 KB on disk, 512 bytes of substance, 508 bytes under the checksum.

The signature at offset 0

The first four bytes are the ASCII string regf — 72 65 67 66. That is the file magic, and it is where the format gets its name. If those four bytes are anything else, you are not looking at a hive, or you are looking at a hive whose header has been clobbered. There is no recovery path that begins with a wrong signature; you stop and report it.

This is also the cheapest sanity check in the entire format, which is exactly why it should be the first one you run. Carving hives out of unallocated space, memory dumps, or VSS deltas, the regf magic at a 4 KB-aligned offset is the anchor you scan for.

Two sequence numbers and the dirty-hive question

Immediately after the signature come two 32-bit values: the primary sequence number at offset 4 and the secondary sequence number at offset 8. They exist for one purpose — to answer the question "was this hive flushed cleanly?"

The kernel increments the primary number before it begins writing dirty data to the hive, and increments the secondary number after the write completes. In a hive that was shut down cleanly, the two numbers are equal. If they differ, a write was in flight when the hive was last touched: the primary moved but the secondary never caught up. That is a dirty hive, and the pending changes live in the transaction logs (.LOG1 / .LOG2), not in the primary file.

Operationally this is the single most consequential field in the header. A mismatch is the kernel's own signal that the hive on disk is incomplete and the logs must be replayed before the data is consistent. A tool that ignores the sequence numbers and parses a dirty hive anyway will hand you a stale tree and will not warn you. If primary and secondary disagree, do not trust a single value until you have replayed the logs — see recovering deleted registry keys for what else lives in that recovery surface.

The timestamp

At offset 12 sits an 8-byte Windows FILETIME (100-nanosecond ticks since 1601-01-01 UTC) recording the last time the hive was written. This is the hive-level last-write time, not to be confused with the per-key nk LastWrite timestamps inside the tree.

Treat it as a coarse acquisition marker. It tells you roughly when the hive last changed on disk, which is useful for sanity-checking that an image and the hive inside it line up in time, and for spotting a hive that was written long after you thought the system was off. It is not a fine-grained activity record — for that you go into the key nodes.

Major and minor version, type, and format

Three small fields describe what kind of hive this is and which format revision it uses.

The major version (offset 20) is 1 for every regf hive in existence. The minor version (offset 24) is where the real variation lives. Common values in the wild:

1.3 — volatile hives, BCD, and per-user classes hives.
1.5 — the standard for system hives (SYSTEM, SOFTWARE, SECURITY, SAM, DRIVERS) and NTUSER.DAT on modern Windows.
1.6 — differencing hives used by the containerized/virtualized registry.

The minor version matters because it gates which features the hive may use — the layered-key machinery in differencing hives, for example, does not appear in a 1.5 file. A parser that hard-codes assumptions for 1.5 and then meets a 1.6 hive will misread it.

The type field (offset 28) distinguishes a primary hive (the value here is 0) from the transaction-log and other variants. The format field (offset 32) records the storage format and is 1 (the standard "direct memory" layout) for the hives you will encounter. Neither field carries much investigative weight on its own, but a type or format value outside the known set is another tamper signal — real hives produced by Windows do not stray here.

The root cell offset

At offset 36 is a 32-bit value: the offset of the root key's nk cell. Like every offset in the format, it is measured from the start of the hive bins data — that is, from the byte immediately after the 4 KB base block — not from the start of the file. To turn it into a file offset you add 4,096.

This is the entry point to the entire tree. The kernel reads it, jumps to that nk, and the whole recursive walk of keys, subkey lists, and values unfolds from there. If the root cell offset is wrong, you have no tree, regardless of how intact the rest of the file is. On a damaged hive, validating that this offset lands on a 4 KB-or-cell-aligned location holding a valid nk signature is part of deciding whether the header can be trusted.

The hive bins data size

At offset 40 is the length of the hive bins data — the total size of all the HBINs that follow the header. Crucially, this is not the file size. It is the file size minus the 4,096-byte base block. A 1 MB hive file carries a length of 1,044,480 (1,048,576 − 4,096) here.

This field is a structural check you should actually run. The number must be a multiple of 4 KB, and base block plus this length should equal the on-disk file size (allowing for the fact that some acquisition methods append slack). A length that disagrees with the file size means the file was truncated, padded, or assembled from pieces — exactly the kind of thing that happens when a hive is pulled badly from an image or recovered from fragments.

The checksum: XOR of the first 508 bytes

The last meaningful field, at offset 0x1FC (508), is a 32-bit checksum. It is computed as the XOR of the 127 DWORDs that make up the preceding 508 bytes — the entire header from the signature up to but not including the checksum field. Simple, fast, and not cryptographic: it catches accidental corruption of the header, not deliberate forgery, because anyone editing the header can trivially recompute it.

Two edge cases the kernel enforces are worth knowing if you ever recompute a checksum yourself: a computed result of 0xFFFFFFFF is stored as 0xFFFFFFFE, and a computed result of 0x00000000 is stored as 0x00000001. The all-ones and all-zeros values are reserved as never-valid, so the writer nudges away from them.

What a bad checksum means, in practice: the header has been altered or corrupted since it was last written cleanly. The Windows kernel treats this as fatal — a header that fails the checksum is rejected as a corrupt hive (STATUS_REGISTRY_CORRUPT) and none of the deeper parsing runs at all. For an analyst, a bad base-block checksum on a hive you did not acquire yourself is a red flag: the file was truncated, damaged in transit, or tampered with. None of those are good news, and all of them mean you should not present the parsed contents as authoritative without explaining the discrepancy.

Why you validate the base block first

Pulling all of this together, the base block is the gatekeeper, and a parser earns the right to walk the tree only after it has passed these checks in order:

Signature. Four bytes are regf, or you stop.
Sequence numbers. Primary equals secondary, or the hive is dirty and the logs must be replayed before you trust anything.
Checksum. The XOR over the first 508 bytes matches the stored value, or the header is corrupt and the file should not be trusted.
Version, type, format. Known values, or you are looking at something unusual — possibly tampered, possibly a format you do not yet handle.
Root cell offset and bins length. The offset points somewhere sane and the length is consistent with the file size, or the structure is damaged.

Skip these and you can still produce a tree from many hives, because real-world files are often forgiving enough to parse despite a problem. That is the trap. A tool that parses a dirty or corrupt hive without telling you is worse than one that refuses, because it gives you confident-looking output that is quietly wrong. The base block exists precisely so the kernel — and you — can refuse early. Honour it.

If you want to see these fields on a real file without a hex editor, you can parse a hive in your browser and the header is decoded for you, nothing uploaded. For the implementor's view of the same structure, libregf by Joachim Metz documents the base block from a parser author's perspective, and Red Hat's hivex carries an independent reading worth cross-checking against. For the wider context of how the base block fits the rest of registry internals, start from Windows registry internals.