Registry Parser
All articles

Big-data records: how the registry stores values larger than 16 KB

8 min read

Most registry values are tiny: a DWORD, a short string, a path. The format is built around that assumption, and parsers walk single-cell value data without thinking. Then you hit a REG_BINARY blob that is 200 KB, or a multi-string list that ran long, and the single-cell assumption breaks. The registry's answer to large registry values is the big-data record — the db cell — and it is exactly the kind of edge case a hand-rolled parser quietly gets wrong. A parser that does not implement the registry db record does not error out. It hands you the first 16 KB and moves on, and you never know the rest existed.

This post covers that mechanism: why the cap exists, what a big data record looks like on disk, how the segments reassemble, and what it means when you recover a deleted value. It assumes you know the shape of a hive; if not, start with the regf format overview and the vk value record writeup, then come back.

Why a single cell is not enough

A hive is a sequence of bins (the hbin blocks), and inside each bin the data lives in cells. A cell is the unit of allocation: a signed 32-bit size field followed by the payload. Bins are 4096 bytes or a multiple thereof, and a single cell has to fit inside a bin — already a hard ceiling on how much one cell can carry. But the registry imposes a tighter, more specific limit on value data before it gets near the bin boundary.

The practical threshold is 16,344 bytes. A value at or below that size is stored the ordinary way: the vk record points at a single data cell (or, for four bytes or fewer, stores the data inline in the vk itself). A value whose data exceeds 16,344 bytes cannot use that path. Instead, the vk record points at a big-data record, and the data is split into segments.

The 16,344 figure is the number a correct parser keys off: the maximum amount of value data the registry stores in one segment cell. libregf states it plainly — segment data is stored directly in a bin cell, and the Windows implementation ignores anything beyond 16,344 bytes in a single segment. If you see a slightly different number elsewhere (the interplay of cell size, cell header, and bin alignment invites off-by-a-few confusion), treat 16,344 as the operative value and the surrounding arithmetic as its explanation, not a competing figure.

The db record on disk

When a value goes big, the vk record's data-offset field no longer points at the value's bytes. It points at a small structure — the big-data record — that acts as an index. Per the libregf notes, that structure is 12 bytes:

  • Offset 0x00 (2 bytes): the signature, the two ASCII bytes db.
  • Offset 0x02 (2 bytes): the number of segments.
  • Offset 0x04 (4 bytes): the offset to the segment list, relative to the start of the bin data area (the same relative-offset convention every other pointer in the hive uses).
  • Offset 0x08 (4 bytes): padding/alignment, which — like a lot of registry padding — is not reliably zeroed and can carry remnant bytes.

The segment list it points at is an array of 4-byte offsets, one per segment, again relative to the bin data area. Each points at a cell whose payload is up to 16,344 bytes of the actual value data. So the indirection is: vkdb record → segment-offset list → N segment cells. Three hops instead of one, and a parser that knows only the one-hop case reads the db record's first bytes as value data and returns garbage or a truncated read.

A worked layout, for a 40 KB value:

vk record
  data length: 40960
  data offset: ----> db record
                       signature:  "db"
                       segments:   3
                       list offset:----> segment list
                                            [0] ----> cell: 16344 bytes
                                            [1] ----> cell: 16344 bytes
                                            [2] ----> cell:  8272 bytes

Three segments: two full 16,344-byte cells and a final partial cell holding the remainder. 16344 + 16344 + 8272 = 40960, which matches the vk record's declared data length. That arithmetic check is the parser's friend.

Reassembly

Reassembly is concatenation, in list order, and nothing more. Walk the segment-offset list from index 0 to N-1, follow each offset to its cell, take the cell's data, append. The segments are ordered; you do not sort or deduplicate them, you read them in the sequence the list gives.

There is one subtlety on the last segment. Every segment except the last carries a full 16,344 bytes; the final segment carries the remainder — the declared length modulo 16,344. A robust parser does not trust the final cell's allocation size to say how many bytes are real. It uses the vk record's data length as the authority and stops appending once it has accumulated that many bytes. The cell backing the last segment is a normal allocation and may be larger than the bytes that belong to the value, the slack being whatever the allocator left behind. Read to the declared length, not to the end of the cell.

With segments of up to 16,344 bytes and a 16-bit segment count, the format can in principle express values approaching a gigabyte. You will not see values that large in a healthy hive, but the headroom is real, and it is why the mechanism exists rather than the format simply capping value size at one bin.

Which Windows introduced it

Big-data support arrived with a bump in the hive's minor format version, in the Windows XP development era — the Whistler betas, in a revision that predates the modern 1.5 you see on Windows 10 and 11. Hives written by NT 4.0 and Windows 2000 used earlier minor versions without the db mechanism: on those formats a single value was bounded by what a cell could hold, so you simply did not get multi-hundred-KB values. From XP onward — which is every hive that matters in practice — big-data records are in play.

Be cautious about pinning the exact minor version and build: the public format docs and reverse-engineered notes agree on the XP/Whistler timeframe and the version-bump rationale, but phrase the precise revision number slightly differently. The operative fact is simpler: on any modern hive, assume db records can appear, and handle them.

Forensic and parsing implications

Silent truncation is the headline failure. A parser that treats the db record's offset as a normal data pointer reads the 12-byte db structure (or the first segment) and returns that as the value. No exception, no warning, just a value that is short. For a REG_SZ this looks like a mangled string; for a REG_BINARY it looks like a perfectly plausible smaller blob, and you cannot tell it is incomplete unless you cross-check the vk record's declared data length against the bytes you reconstructed. Make that check explicit. If the declared length exceeds what a single cell can hold and your reconstruction came back short, you missed a db record.

Large values are exactly the interesting ones. Stored credentials and cached secrets, serialized policy blobs, certificate and key material, malware configuration packed into a REG_BINARY, large Shellbags-style structures, and the occasional whole-file-in-the-registry trick attackers use to live off the land — these are the payloads big enough to trip the threshold. A truncating parser fails precisely on the data you most wanted intact.

Deleted big values multiply the recovery problem. Recovering a deleted ordinary value means recovering one cell from unallocated space. Recovering a deleted big value means recovering the db record and every segment cell it referenced and the segment-offset list — and they are not necessarily contiguous, since the allocator placed them wherever it had room. If any one segment cell has been overwritten by a later allocation, the reconstruction has a hole at a known position: you know which segment index is missing and therefore which 16,344-byte window of the value is gone. That is more recoverable than a blind gap, but only if your recovery logic understands the db indirection well enough to look for N fragments rather than one.

Cross-validate. Same advice as for the rest of the format: if two mature parsers agree on a large value's reconstructed length and bytes, trust it. libregf and yarp both implement the db path correctly; hivex handles it; RegRipper consumes it through its parsing layer. Be suspicious of a quick script that walked vk records and dereferenced data offsets with no branch for the db signature.

To check whether a value is backed by a db record without writing code, parse a hive in your browser and open a large REG_BINARY: a correct reconstruction reports the full declared length and reassembles every segment, with nothing leaving the page.

Closing

Big-data records are a small mechanism with an outsized failure mode. The structure is trivial — a db signature, a segment count, an offset to a list of segment cells — and reassembly is ordered concatenation up to the value's declared length. But ignoring it does not crash; it returns large registry values silently truncated, and deleted-value recovery that hands back the first fragment as if it were the whole thing. Implement the db path, check declared length against reconstructed length on every value, and treat the 16,344-byte threshold as the line where ordinary parsing stops being enough. For the rest of the format, see Windows registry internals.

Further reading

  • Google Project Zero, The Windows Registry Adventure #5: The regf file format: the deepest public writeup of the on-disk format, big-data records included.
  • Joachim Metz, libregf: the format documentation written from a parser implementor's point of view, with the db cell layout and the 16,344-byte segment limit spelled out.
  • The hivex library: a second independent implementation worth diffing against when a reconstruction looks wrong.