Hive bins and cells: how a registry hive is allocated on disk
9 min read
A registry hive is not a tree on disk. It is a flat sequence of fixed-size blocks called hive bins, and inside those bins are variable-length chunks called cells. The tree you see in a registry editor is an abstraction the parser builds by chasing offsets from one cell to another. If you want to understand why deleted keys are recoverable, why a corrupted hive can still yield most of its data, or why homegrown parsers march off the end of a bin and into garbage, you have to understand the hbin-and-cell allocation layer underneath. The regf overview sketched this; this post goes down to the bytes.
This is the storage tier. Everything above it — the nk, vk, sk, and subkey-list records that make up the logical tree — lives inside cells, and the cells live inside hive bins. Get the allocation layer right and the rest is pointer-following. Get it wrong and nothing above it parses correctly.
The base block, then nothing but hbins
A hive file opens with a single 4096-byte base block — the regf header carrying sequence numbers, the last-write timestamp, the version, and the offset of the root key. After those 4096 bytes, the rest of the file is hive bins, one after another, with no padding or index between them. There is no allocation table, no file-allocation map, no directory of where things are. To find anything, you start at the first bin and walk.
This matters for one specific reason that trips up every first-time parser author: all offsets stored inside the hive are relative to the start of the hive bins data, not the start of the file. The hive bins data begins at file offset 4096, immediately after the base block. So when the base block says the root key is at offset 0x20, the root key is at file offset 0x1020 — that is 0x1000 (the base block) plus 0x20. Forget to add the 4096-byte base block and every pointer you dereference lands 4 KB too early. The symptom is a parser that reads plausible-looking but wrong data from the tail of the base block instead of real cells.
The hbin header
Each hive bin begins with a 32-byte header. The fields that matter:
- Offset
0x00(4 bytes): the signature, the ASCII byteshbin. Every bin starts with it. If you walk to where the next bin should be and do not findhbin, you have either miscounted a size or hit the end of the allocated bins. - Offset
0x04(4 bytes): the offset of this bin, relative to the start of the hive bins data. The first bin reports0x00, the next reports its own position, and so on. In practice this field is validated against the bin's actual position early in load and is effectively a constant — useful as a sanity check, not as a thing you compute from. - Offset
0x08(4 bytes): the size of the bin in bytes. This is always a multiple of 4096.
The remaining bytes through 0x1F are mostly zero in healthy hives — a reserved pair, a timestamp field, and a spare value that often mirrors the size. They frequently hold remnant data, which is worth a glance in an investigation but is not part of the live structure.
The smallest bin is 4096 bytes. Larger bins are multiples of that — 0x2000, 0x3000, and up — created when the registry needed a single allocation larger than one page could hold. Treat a hive bin as the registry's equivalent of a memory page: the unit in which the hive grows.
Cells: the allocation unit inside a bin
The 32-byte header is followed by cells, packed end to end with no gaps. A cell is the actual allocation unit — every key node, every value, every subkey list, every blob of value data lives inside one cell. The structure of a cell could not be simpler:
- Offset
0x00(4 bytes): the cell size, a signed 32-bit integer. - Offset
0x04onward: the cell payload.
The size includes its own 4-byte field, and it is always a multiple of 8 — cells are 8-byte aligned, so the three low bits of the size are always zero. The sum of all cell sizes in a bin equals the bin's size; there are no gaps and no slack between cells. That invariant is what lets you walk a bin blind: start right after the 32-byte header, read a size, advance by that size, read the next, and you will land exactly on the next cell boundary every time — until the running total equals the bin size, at which point you are at the next bin's hbin header.
The sign is the whole story
Here is the single insight that the entire recovery story hangs on. The cell size is a signed 32-bit integer, and its sign encodes allocation state:
- Negative size: the cell is allocated (in use, live data).
- Positive size: the cell is free (unallocated, available for reuse).
So an allocated cell of 128 bytes stores its size as 0xFFFFFF80 — that is -128 as a two's-complement signed integer. A free cell of 32 bytes stores 0x00000020 — positive +32. To get the cell's actual length in either case, take the absolute value.
This is the place where parsers written from memory go wrong most often. The author reads the size as an unsigned 32-bit integer, gets a number near 4 billion for every allocated cell (because the high bit is set), and either treats it as an absurd length or strips the sign without recording what the sign meant. Reading it unsigned and using it as a length will march the parser straight off the end of the bin into the next bin's header, where hbin looks like nonsense and the parse collapses. The fix is one line: read it signed, branch on the sign, advance by the absolute value. libregf and yarp both do exactly this, and cross-checking against either will catch the mistake immediately.
A walk over one bin, in full:
pos = bin_start + 32 # skip the 32-byte hbin header
while pos < bin_start + bin_size:
raw = read_int32_signed(pos) # the cell size field
size = abs(raw)
allocated = raw < 0 # negative = allocated, positive = free
payload = read(pos + 4, size - 4)
pos += size # always advance by the absolute size
Walk every bin this way and you have enumerated every cell in the hive — allocated and free alike. The logical tree is then built by following offsets between the allocated cells; the free cells are what most parsers throw away. They should not.
Why free cells are recoverable deleted data
When a key or value is deleted, the registry does not zero its cell. It flips the sign of the size field from negative to positive — allocated to free — and, where it can, merges the freed cell with adjacent free cells by extending the first one's size. The bytes of the deleted nk or vk record, its name, its timestamp, its value data, all remain sitting in the cell exactly as they were. The cell is simply marked available for the next allocation. Until something writes over it, the deleted record is fully intact in a positive-size cell.
That is the entire basis for recovering deleted registry keys and values. Walk the bins, collect the free cells (the positive-size ones), and parse their contents looking for the same nk and vk signatures you find in live cells. A free cell whose payload begins with nk is a deleted key node. Its parent pointer, its name, its LastWrite timestamp — all still there, all still parseable. You reconstruct deleted subtrees by matching recovered child nodes against recovered (or live) parents by offset.
The catches are the ones you would expect from any unallocated-space recovery:
- Overwrite. A free cell is fair game for the next allocation. The longer the hive stays in use after a deletion, the more likely the cell has been partially or fully reused, leaving a truncated or corrupted record.
- Merging. When the loader frees a cell, it may coalesce it with neighbors, and on load it will forcefully repair a bin whose cells do not sum correctly by creating one big free cell from the failing point to the end of the bin. That single free cell can contain the remnants of several deleted records jammed together, which you carve rather than parse cleanly.
- Dangling pointers. A recovered key may point at a parent or subkey list that has itself been overwritten. You get the node but not always its place in the tree.
None of this makes recovery unreliable — it makes it probabilistic, which is the normal state of unallocated-space forensics. A freshly deleted key in a hive pulled minutes later is almost always perfectly recoverable. See recovering deleted registry keys for the full carving workflow and the tools that do it.
What this buys you in practice
Understanding the allocation layer changes how you read tool output. When a parser reports a "free cell count" or surfaces deleted entries, you now know exactly what it did: it walked the bins, branched on the sign of each cell size, and parsed the positive-size ones. When a hive is partially corrupted, you know that a single bad bin does not necessarily doom the file — the bins are independent enough that a walk can often resynchronize at the next valid hbin signature and recover everything after the damage. And when two tools disagree on key count, the difference is frequently in how they treat free cells: one is showing you live data only, the other is including recovered records.
The allocation model is deliberately simple — fixed bins, signed cell sizes, no index — and that simplicity is why it has survived essentially unchanged for decades and why offline recovery works at all. A format with a real free list and zeroed-on-delete cells would be tidier and forensically useless.
Further reading
- Google Project Zero, The Windows Registry Adventure #5: The regf file format: the deep reverse-engineering walkthrough that inspired this series. Strong on the loader's validation behavior and the bin-repair logic.
- Joachim Metz, libregf: the REGF format documentation is written from the parser implementor's seat — exact field offsets for the hbin header and cell layout.
- Maxim Suhanov's yarp and his format specification: the closest thing to a reference implementation, with first-class recovery support.
- The broader Windows registry internals overview ties the bins-and-cells layer to the artifacts above it.
You can parse a hive in your browser and watch this in action — the tree you navigate is the allocated cells, and the recovery view is the positive-size ones nothing has overwritten yet.