Registry transaction logs: recovering a dirty hive
9 min read
A registry hive on disk is rarely the whole story. The kernel does not write every change straight into the primary file; it writes to a journal first and flushes the primary lazily. So at any given instant the on-disk hive can be behind the truth, and the difference lives in the registry transaction logs sitting right next to it. If you acquire a hive while Windows is running, or pull one from an unclean shutdown, you have almost certainly grabbed a dirty hive: a primary file that needs its .LOG1/.LOG2 replayed before it represents the current state. Skip that step and you are analysing stale data without knowing it.
This post is about how Windows marks a hive dirty, how the logs let it recover, and why your tooling has to do the replay or quietly lie to you. It assumes you already know the regf on-disk format; if you want the recovery workflow — log replay plus unallocated-cell carving plus VSS diffing as one ensemble — that lives in the deleted-key recovery post. Here we go down a level, into the mechanism itself.
Two sequence numbers and the dirty bit
Open any primary hive in a hex editor. The first 4096 bytes are the base block, magic regf. Two little fields four bytes apart decide whether the file you are holding is trustworthy:
- The primary sequence number at offset 4.
- The secondary sequence number at offset 8.
The contract is simple and elegant. When the Configuration Manager begins a write operation against the primary file, it increments the primary sequence number first. When the write finishes and the data is durable, it increments the secondary to match. So in a cleanly shut-down hive the two numbers are equal. If they differ, a write started and did not complete: the machine lost power mid-flush, the hive was copied live while a write was in flight, or the file is otherwise torn. That is the canonical definition of a dirty hive. A wrong base-block checksum is the other trigger; either condition means the file requires recovery before use.
This is worth internalising because it is the cheapest integrity check you have. Read offsets 4 and 8, compare. Equal and the checksum is good: the hive is consistent on its own, though the logs may still hold newer data. Unequal: do not trust a single value in that file until you have replayed the logs. Plenty of "the data looked wrong" incidents trace back to an analyst reading a hive whose two sequence numbers never matched.
The files alongside the hive
Each primary hive ships with companion log files that share its base name. For SYSTEM you get SYSTEM.LOG1 and SYSTEM.LOG2; for an NTUSER.DAT you get NTUSER.DAT.LOG1 and NTUSER.DAT.LOG2. Older systems, and hives loaded with the single-log flag, carry a lone .LOG. These logs are the journal: the kernel writes changes into them first, and only later does a flush push the accumulated changes back into the primary file. The primary lagging the log is the normal steady state, not an error condition.
This is exactly why a live acquisition produces a dirty hive so often. The system is actively writing; the journal is ahead of the primary by design. You copy the primary, the log, or both at slightly different moments, and the snapshot you walk away with is internally inconsistent unless you also took the logs and replay them. Which leads to the single most important operational point in this entire post, stated up front so nobody misses it:
Always acquire the .LOG1 and .LOG2 files together with the hive. A primary file without its logs is, for a dirty hive, an incomplete artifact. You cannot reconstruct the journal after the fact, and no amount of clever parsing recovers data that only ever existed in a log you left behind.
Old single-log vs. the modern alternating scheme
The dual .LOG1/.LOG2 filenames have been present since Windows Vista, but the behaviour changed meaningfully later. It helps to separate two things: how many log files exist, and how the kernel uses them.
In the old logging scheme (and you will still meet it on older or specially-loaded hives), the transaction log carries a dirty vector: a bitmap, introduced by a DIRT signature, where each bit flags one 512-byte page of the hive as dirty, followed by the dirty page bodies themselves. The log base block's file-type field is 1. Recovery means reading the bitmap, then writing each flagged page back into the primary at its corresponding offset. Conceptually it is a page-level patch set: "these pages changed, here is their new content."
The new logging scheme, which arrived with Windows 8.1, restructures the log into a sequence of discrete log entries. The log base block's file-type field is 6, and each entry begins with the signature HvLE (hive log entry) and carries its own size, flags, a sequence number, the hive bins data size, a count of dirty pages, two hashes for integrity, an array of dirty-page references (each giving an offset and size into the hive bins data), and finally the dirty page bodies. Instead of one monolithic bitmap, the journal becomes an append-only chain of self-describing, individually validated transactions.
The other half of the modern scheme is the alternation between the two logs. Under the new format the kernel does not nail itself to .LOG1; it regularly swaps the log file in use, .LOG1 to .LOG2 and back. The point is durability: while one log is being written, the other holds a known-good prior state, so a crash or torn write during a log update cannot destroy both copies at once. (The older fallback semantics — use .LOG1, switch to .LOG2 only on a write error — are a different, more conservative behaviour; the modern scheme treats the swap as routine, not exceptional.) For a forensic tool the consequence is the same either way: both logs are potentially live, both must be parsed, and you cannot assume the freshest data is in the one with the higher number on its name.
Replaying the log to get a current view
Recovery, in either scheme, is replay: take the dirty pages the logs describe and write them into the primary hive at the offsets they name, producing a consistent, current image. The integrity machinery is what keeps replay honest. Sequence numbers order the entries and define where replay must stop — if a log entry with sequence number N is not followed by one numbered N+1, recovery applies everything through N and halts at that gap. A torn or partially-written tail entry therefore truncates cleanly instead of corrupting the result. The per-entry hashes in the new format let a parser reject an entry whose body does not match its header before it ever touches the primary.
What replay actually surfaces is the whole reason to bother. Some pages exist only in the logs — changes the kernel journalled but had not yet flushed to the primary when the hive was captured or the machine died. Replay those and registry keys and values materialise that the unreplayed primary file simply does not contain, and that the live regedit view on a now-rebooted machine may never have shown. On a deleted-key recovery job this is frequently where the decisive evidence is: a persistence value written, used, and removed inside a single sync window lives in the log replay and nowhere else.
Why so many tools quietly get this wrong
Here is the uncomfortable part. A large number of registry "parsers" — including ones people trust — open the primary file, walk the nk/vk cell tree, and render it, full stop. They never look at offsets 4 and 8. They never read the .LOG files, often because the logs were not even in the same directory when the hive was handed to them. The output looks complete. It has a tidy tree, plausible timestamps, real values. And it is stale, because the most recent changes were still sitting in a log that the tool ignored. Worse, most such tools give you no warning at all that the hive was dirty — no "primary != secondary, results may be incomplete" banner, nothing.
So the discipline is non-negotiable:
- Check the sequence numbers. If primary and secondary disagree, the hive is dirty by definition and any view without replay is suspect.
- Replay both logs, then verify the tool says it did. "Recovery: applied N log entries" is the message you want. Silence is not the same as a clean hive. yarp (
--recover) and regipy (--apply_transaction_logs) do this explicitly and report it; cross-check a contested hive in a second implementation. - Treat a missing log as a finding. If you only have the primary and the sequence numbers don't match, document that you could not recover it. Then go back to the source and get the logs.
A tool that fails to flag a dirty hive is not neutral — it is actively misleading, because it presents stale data with the same confidence as clean data. When you parse a hive in your browser the dirty-hive state and log replay should be visible, not hidden; that surfacing is the whole difference between a current view and a plausible-looking old one.
Putting it together
The registry's journalling model is genuinely good engineering: two sequence numbers give you an instant integrity verdict, and an alternating pair of logs gives durability against the exact failure modes — power loss, torn writes — that a single log cannot survive. For the analyst it collapses to a short checklist. A dirty hive is normal, not alarming; it just means the on-disk primary is behind its journal. The .LOG1 and .LOG2 files are first-class evidence, not scratch files. Replay is mandatory, and a tool that skips it silently is worse than one that refuses to open the file.
Grab the logs. Always. The five seconds it takes to copy two extra files at acquisition time is the difference between a recoverable case and a permanently stale hive.
For the broader picture of how these pieces fit, see Windows registry internals.
Further reading
- Google Project Zero, The Windows Registry Adventure #4: Hives and rules — on the Configuration Manager, hive layout, and the write-to-log-then-flush model that makes recovery necessary.
- Maxim Suhanov, Windows registry file format specification — the canonical reverse-engineered reference for the base-block sequence numbers, the
DIRTdirty vector, and theHvLElog-entry format. His yarp library is the closest thing to a reference replay implementation. - Joachim Metz, libregf — format and transaction-log documentation written from a parser implementor's perspective.