Registry Parser
All articles

Why parsing a registry hive safely is hard: the attack surface

9 min read

There is a quiet assumption baked into most registry tooling: that a hive is data, and data is safe. It is not. A hive file is a graph of structs that point at each other by offset, and the moment you load one you did not create, you have accepted attacker-controlled input into a parser that has to dereference those offsets. The registry attack surface is large precisely because the on-disk format was designed when the file was trusted, and that trust was never fully unwound. This is a defensive walk through why hive parsing security is hard, the bug classes the format invites, and the robustness rules that follow for anyone who writes a parser — forensic tools included. The authoritative reference, and the source for the vulnerability classes below, is Google Project Zero's "The Windows Registry Adventure #7: Attack surface analysis".

If you have not read the regf format overview or the hive bins and cells breakdown, skim them first. This assumes you know what a cell index is and that an nk points at a subkey list which points at more nks.

Two surfaces, not one

The registry exposes two distinct attack surfaces, and they fail in different ways.

The first is the userland API surface: NtCreateKey, NtSetValueKey, NtDeleteKey, and the rest, reachable by any unprivileged process without special rights — the classic privilege-escalation surface. The attacker is not feeding you a malformed file here; they are driving a live registry through a sequence of legal-looking operations, hunting for a state the implementation did not anticipate. Bugs here tend to be logic and state-management bugs: an operation that half-completes and leaves the hive inconsistent, a refcount that drifts, a transaction that does not roll back cleanly.

The second is the hive-loading surface: the kernel's Configuration Manager parsing a hive file supplied via NtLoadKey / RegLoadKey. Project Zero notes this code lives in ntoskrnl.exe and so is not behind mitigations like the win32k lockdown. An unprivileged caller who can point the loader at a file they control hands raw, attacker-authored struct data straight into kernel parsing code. That is the surface that matters most to anyone writing a parser, because a forensic tool does the exact same thing the kernel does — reading a file someone else produced — just in userland.

Why loading an untrusted hive is dangerous

The core problem is historical. For most of the registry's life, hives were system files: written by the kernel, read by the kernel, never crossing a trust boundary. So the format carries no defensive armor — counts, sizes, types, and offsets are simply believed. The validation that did exist was oriented toward recovering a hive after a crash, not defending against a malicious one.

Project Zero points at the consequence: once a hive passes load-time validation, later code paths tend to assume the data is safe and stop checking. That split is the vulnerability factory. Validate-once-at-load only holds if the load-time check was complete, and it rarely is, because, as Project Zero puts it, "there are many logical-sounding requirements that are not enforced in practice." The set of invariants a programmer assumes is larger than the set the loader actually checks.

The bug classes a self-referential struct format invites

A hive is a self-referential graph: cells reference other cells by cell index, the registry's equivalent of a pointer. The instant your data model is "structs that point at each other by offset, with sizes and types stored inline," you have signed up for a specific family of failure modes. Naming them as classes — not recipes — is the useful exercise.

Out-of-bounds offsets. A cell index is just a number on disk. Nothing forces it to point inside the hive, at the start of a real cell, or at a cell of the expected type. Project Zero observes that with cell-index references "it's natural to expect common issues like buffer overflows or use-after-frees." A parser that dereferences an offset without checking it lands within bounds, on a cell boundary, inside an allocated cell, is one crafted offset away from touching memory it should not.

Integer and size mismatches. Sizes are stored too, and stored sizes lie. The class is any place where a length on disk is trusted against a buffer it does not fit, or where size arithmetic overflows. Project Zero documents an instance (CVE-2022-37988) where an alignment condition true for cells written by Windows was not required to hold for cells loaded — the writer's invariant quietly assumed by the reader. The refcount cases (e.g. CVE-2023-28248) are the same shape: a counter normally living only in memory is, for hives, initialized from disk, so an attacker sets its starting value and overflow follows.

Type confusion between cell types. Subkey lists come in several encodings — lh, lf, li, ri — and the cell's signature byte says which. Trusting the hive header version to imply the list type, rather than reading the actual signature, produced CVE-2022-38037: a hive violating that assumption got a list of one type interpreted as another. Any format with a tagged union invites this — read the tag, then trust it for the wrong field.

Reference loops. The graph is supposed to be a tree (for keys) and a set of linked lists (for security descriptors), but nothing on disk guarantees acyclicity. A subkey list that points back at an ancestor, or an sk flink/blink chain that loops, sends a naive recursive or list-walking parser into unbounded recursion or an infinite loop — at best a denial of service, and combined with other state, worse.

Inconsistent state from "impossible" inputs. Real malformed hives carry things the writer would never emit: duplicate value names under one key, the same security descriptor stored multiple times, key names stored uncompressed where Windows would compress. Project Zero's framing is that "any behavior that deviates from expected logic, whether documented or assumed, could lead to vulnerabilities." Each assumed-unique, assumed-canonical property the loader does not enforce is a place where downstream code reasoning about uniqueness or canonical form can be steered wrong.

What this means for any registry parser

The lesson is not specific to kernel code. If you parse hives — forensic tool, IR script, library — you are the reader and the writer is hostile. Treat the file accordingly:

  • Validate every offset before you dereference it. A cell index must land within the hive, on a cell boundary, inside an allocated cell, at a cell whose type you expect. "It came from the file" is not validation.
  • Bound every read against its real container, not against the size the structure claims for itself. When a value's stated data length exceeds the cell holding it, handle that deliberately — clamp, flag, move on — never read past the cell.
  • Never trust a stored type, or the header to imply a type. Read and validate the cell signature on every dereference. Do not let a version field in the base block decide how you interpret a cell three hops away.
  • Detect cycles. Walking the key tree or the sk list, carry a visited-set or a depth bound. A self-referential format means a self-referential input is always possible; your walk must terminate rather than recurse forever.
  • Keep recovery separate from parsing. Free cells are not zeroed and deleted records can be recovered, but recovery is an opt-in pass over data you have already proven well-shaped. Do not let the main parse "repair" around a structural violation — halt and report, the way Project Zero notes the kernel itself prefers.
  • Fuzz the error paths, not the happy path. The interesting bugs live in rarely-executed handlers: truncated cells, out-of-range counts, exhausted allocations. Those paths almost never run on a clean hive, which is why they are under-tested and over-represented in the CVE list.

None of this is exotic; it is the discipline any parser of an adversarial binary format needs. The registry just makes the cost of skipping it unusually high, because its on-disk data was trusted for decades and the habits formed around that trust persist in the code.

Where the client-side WASM sandbox helps

A forensic parser cannot make a malicious hive benign, but it can shrink the blast radius if one trips a parsing bug. Registry Parser runs its parser as WebAssembly inside the browser tab. Two properties follow. First, the hive never leaves the machine: parsing is client-side, so there is no server-side parser to attack and no upload of evidence. Second, the parser runs in the browser's WASM sandbox — memory-safe and capability-confined, with no filesystem and no host memory access. A crafted offset that would be a memory-corruption primitive against an unsandboxed native parser is, here, contained to a sandboxed linear-memory fault in one tab. That is not a substitute for the validation rules above; it is the defense-in-depth layer underneath them — you still validate every offset, bound every read, detect every cycle, and the sandbox catches the bug you missed.

Closing

The registry attack surface is wide because it is really two surfaces — a live userland API reachable by anyone, and a kernel hive-parser fed attacker-controlled files — and because the on-disk format was born trusted. The registry vulnerabilities that follow are not exotic zero-days so much as the predictable output of a struct-based, self-referential format whose stored offsets, sizes, types, and counts were historically believed rather than checked. Hive parsing security comes down to a short, unglamorous list: validate every offset, bound every read against its real container, never trust a stored type, detect cycles, keep recovery separate from parsing. For the full taxonomy and the CVEs behind these classes, read Project Zero's "The Windows Registry Adventure #7: Attack surface analysis". For broader context, see our Windows registry internals overview — and if you want to parse a hive safely in your browser, that is exactly what this tool's sandbox is for.