building a nullifier database from first principles
introduction
as of block 25,446,304, the tornado cash pools on ethereum mainnet hold 278,827 deposits across every denomination of eth, dai, cdai, usdc, usdt, and wbtc. rotortree ingests all of them, durably, with a write-ahead log and an fsync, in about 10 milliseconds on a laptop. that number fell out of designing a database around one narrow workload instead of adapting a general purpose one, and this post walks through how.
most privacy protocols store their nullifier and commitment sets in merkle trees backed by off-the-shelf databases like rocksdb or leveldb. i’ve built merkleization layers on top of these before, and the mismatch always bothered me: an append-only merkle tree needs almost nothing that a general purpose kv store spends its complexity budget on. about a year ago i picked up database internals and started building the thing i actually wanted, slowly. rotortree is the result: an append-only, n-ary variant of the LeanIMT with a persistence layer shaped around exactly this kind of data. it is a research experiment, it makes many trade-offs, and it is not production ready. the readme says this louder than i will here.
the workload
a nullifier database does four things. append a 32-byte value that is already a hash (nullifiers and commitments come out of poseidon or similar, pre-constrained to the field). never update it. never delete it. prove things about it: membership of a specific leaf, and consistency between an old root and a new one so a light client can follow along without the leaves.
that is the whole api. now look at what a general purpose kv store charges for. compaction rewrites data that will never change. tombstones and version chains track deletes that cannot happen. variable-length key encoding wraps values that are all exactly 32 bytes. bloom filters accelerate random point lookups the workload never issues, because reads here are proof walks: one node per level, at an index you can compute with arithmetic. every one of those facilities costs write amplification, memory, or cpu whether you use it or not. qmdb exists because layerzero hit a similar wall from a different direction; their design still targets general verifiable state with in-place updates, so it solves a harder problem than the one i have.
take updates and deletes away and the storage question collapses. every node is 32 bytes, forever. a level of the tree is a flat array of hashes, an offset into it is a multiplication, and the on-disk format can be the in-memory format. the entire design of rotortree follows from this observation.
one honest note before going further: you still want a small kv set beside the tree to reject duplicate nullifiers before insertion. the tree happily appends the same leaf twice.
an n-ary leanimt
the tree itself is a generalization of the LeanIMT from PSE, picked because functional equivalents exist in zk dsls and solidity. the leanimt tricks: depth is dynamic, growing only as leaves arrive, and there is no zero-hash padding. a group with a single member is lifted to the next level unchanged, and a partial rightmost group of k < N members is hashed over exactly those k children. rotortree keeps all of that and makes the branching factor a const generic, so LeanIMT<H, N, MAX_DEPTH> is monomorphized per arity with fixed-size stack buffers everywhere.
a single append walks one spine from leaf to root:
let mut node = leaf;
let mut idx = index;
for level in 0..depth {
inner.levels[level].set(idx, node)?;
let child_pos = idx % N;
if child_pos != 0 {
let group_start = idx - child_pos;
let count = child_pos + 1;
let mut children = [[0u8; 32]; N];
inner.levels[level].get_group(group_start, child_pos, &mut children);
children[child_pos] = node;
node = hasher.hash_children(&children[..count]);
}
idx /= N;
}
one group hash per level, siblings gathered into a stack array, zero heap allocation. depth is ceil_log_n(size, N), so the whole tree shape is a function of the leaf count alone. that property cuts both ways: it makes the structure lean, and it moves real security work into the verifier, which must recompute the expected depth and each node’s expected position from tree_size and reject anything that disagrees. there are dedicated tests for padded proofs, truncated proofs, and spoofed leaf indices.
i got this wrong once. the first consistency proof verifier checked shared + new_only <= max_sum, an upper bound on the prover-supplied counts. an attacker could under-report the new-sibling count and forge a transition to a chosen root. the fix (b2d6c7b) recomputes the exact count the tree would have produced and demands equality:
let expected_new_count = new_level_size.min(group_start + N) - boundary - 1;
if shared != expected_shared || new_only != expected_new_count {
return Err(TreeError::MathError);
}
range checks on prover-supplied values leave room to forge; the verifier has to rederive what the only valid value is. cheap lesson here, it usually costs more.
so why n-ary at all? intuitively, wider trees are shallower: N=16 holds a billion leaves in 8 levels where binary needs 30. shallower means smaller proofs and fewer sequential hash dependencies. the catch shows up inside a zk circuit, where each level now hashes N children. the repo ships two empirical models for this (fitted, commit b899db0). for jolt execution cycles:
and for noir circuit size under ultrahonk:
where D is the compile-time MAX_DEPTH and d is the effective depth for the current leaf count. the per level punishes wide trees hard: going from N=2 to N=16 at a billion leaves cuts d from 30 to 8 while gates per level grow ~36x. the two models disagree on the winner, jolt bottoms out near N=4 while noir keeps rewarding N=2, but both agree that wide arities lose. N=4 is where the examples land once proof size and storage density join the argument. there is a model calculator for picking N and MAX_DEPTH against your own leaf count, with proof sizes, memory, and disk laid out per configuration.
leaves are stored verbatim, let mut node = leaf;, a leaf write is a memcpy. getting there took a reversal. at 00:36 one night, commit be72587 landed a nice optimization fusing per-leaf hashing into the bulk extend path, parallelized with rayon. at 02:11 the same night, a3d3267 deleted the premise: nullifiers arrive prehashed, so hashing them again was pure overhead. the optimization survived 95 minutes. the surviving design is a protocol-level bet, and it has a cost worth stating plainly: with no leaf domain tag, second-preimage hygiene is the caller’s job. feed the tree fixed-width prehashed field elements or don’t use it.
consistency proofs carry the light-client story. update_inclusion_proof takes a valid membership proof against an old root and rolls it forward to the new root using only the consistency data, no leaves transferred. the light_client example bootstraps once, drops the tree, and tracks a single leaf across batches with fixed-size proof structs; bandwidth per update is constant while raw leaves would scale linearly with batch size.
the persistence layer
the storage feature wraps the tree in a RotorTree with a wal, checkpoints, and mmap tiering. the design goal is blunt: no i/o on the insert path, ever.
pub fn insert(&self, leaf: Hash) -> Result<(Hash, DurabilityToken), RotorTreeError> {
self.shared.check_closed()?;
let (root, token) = {
let mut state = self.shared.state.lock();
let root = LeanIMT::<H, N, MAX_DEPTH>::_insert(
&mut state.inner,
&self.shared.hasher,
leaf,
)?;
let seq = state.next_seq;
wal::serialize_entry(&mut state.buffer, seq, wal::WalPayload::Single(leaf));
state.next_seq = state.next_seq.checked_add(1).expect("seq overflow; qed");
// ... counters for auto-checkpoint bookkeeping ...
let snap = state.inner.snapshot();
self.shared.snapshot.store(Arc::new(snap));
let token = self.shared.durability.token(seq);
(root, token)
};
self.maybe_auto_checkpoint();
Ok((root, token))
}
that “wal write” is an append into an in-memory Vec<u8>. no syscall happens under the lock, or anywhere else on this path. a background thread wakes every 10ms, takes the whole buffer, and lands it with one write_all and one sync_data. classic group commit: every insert from the last interval shares a single fdatasync. visibility and durability are decoupled, the new root is readable immediately via the arc-swap snapshot while the DurabilityToken lets a caller block until its specific sequence number is on disk. a crash can eat up to 10ms of recent inserts unless you waited on the token; that trade is deliberate and priced in.
wal records are framed as a length prefix, a wincode payload, and a crc32c computed over both. the crash model lives in one branch of the frame reader: a crc mismatch at the tail of the file is a torn final write from losing power mid-append, and recovery silently drops it. a crc mismatch anywhere else is bit rot and fails loudly. wincode handles all persistence framing because its fixed, schema-derived layout gives a computable serialized size up front; serde exists on the proof types only, because the jolt prover needs it.
checkpoints bound wal growth. a checkpoint materializes each level’s new chunks into shard files, 65,536 chunks of 4 KiB each, so a shard caps at 256 MiB, then writes a meta file through the classic atomic dance: write tmp, fsync tmp, rename, fsync the parent directory. shard data is appended in place and the meta lands last, so a crash mid-checkpoint just falls back to the previous meta and a longer wal replay. after the meta is durable, the wal is truncated back to a bare header. an early version had a race here, truncating the wal while concurrent inserts kept buffering; the fix holds the wal file lock across the entire checkpoint and flushes the buffer a second time after truncation (d917264).
after a checkpoint, cold levels move out of ram. TieringConfig::pin_above_level remaps chunks of every level below the threshold onto the mmap’d shard files (the name reads backwards, it tripped me too). tiering evicts the leaves first, which sounds wrong until you look at where the bytes are: level 0 holds roughly of the whole tree while a proof touches one node per level. dropping the bottom levels to memmap2 reclaims nearly all the memory and costs a page-cache hit per proof level. mapped chunks are read through a copy-on-write mapping, so a stray write can never corrupt the checkpoint files.
recovery inverts the checkpoint: read the meta, mmap the shards, wrap each 4 KiB chunk as a pointer into the mapping, take the root hash from the meta, replay the small wal delta. nothing is hashed. the full bottom-up root recomputation only runs when verify_checkpoint is set, and with verification on, recovering a 100M-entry tree takes about a second on my machine.
mechanical sympathy
everything in this section exists to hand the hardware work in the shape it already likes. some optimizations carry an isolated before/after measurement and some are design reasoning whose benefit only shows up inside end-to-end numbers; each item below is tagged measured or structural accordingly.
levels are arrays of pages (structural). forget node objects and pointers; a level is a flat run of page-sized chunks:
pub(crate) const CHUNK_SIZE: usize = 128;
pub(crate) const CHUNKS_PER_SEGMENT: usize = 256;
pub(crate) struct ChunkedLevel {
/// Immutable segments of committed chunks, shared with snapshots.
segments: Vec<Arc<[Chunk; CHUNKS_PER_SEGMENT]>>,
/// Mutable buffer of committed chunks not yet frozen into a segment.
pending: Vec<Chunk>,
/// Fixed-size tail buffer (partially filled).
tail: [Hash; CHUNK_SIZE],
tail_len: usize,
len: usize,
}
128 hashes of 32 bytes is 4096 bytes, one os page (usually; can be greater). a sibling group is a contiguous span inside a chunk, so visiting it is a linear scan the prefetcher already understands, and when a level lives on mmap, one chunk faults exactly one page. the comment in the source motivates the chunk with structural sharing rather than any cache claim, the page-size coincidence is what makes the tiering clean.
full groups are borrowed (structural). CHUNK_SIZE % N == 0 for every simd-eligible arity, so a full group of N siblings never straddles a chunk boundary and group_slice returns &[Hash] directly into the chunk. no copy between the tree and the hasher.
parent hashing is batched into the simd kernel (measured). the batch path gathers up to 16 borrowed groups per call:
/// Parents gathered per `hash_many_into` call. The Blake3 override
/// re-splits this into `simd_degree`-sized SIMD calls internally.
const WINDOW: usize = 16;
let take = (full_parents - parent_idx).min(WINDOW);
let mut filled = 0;
for slot in refs.iter_mut().take(take) {
let start = (parent_idx + filled) * N;
match child.group_slice(start, N) {
Some(g) => *slot = g,
None => break, // scalar tail handles boundary stragglers
}
filled += 1;
}
hasher.hash_many_into(&refs[..filled], &mut out[..filled]);
and the blake3 adapter feeds them to the multi-input hash_many kernel, which compresses simd_degree independent inputs per pass: 4 lanes on neon, 8 on avx2, 16 with avx-512. every parent at a level is independent of its neighbours, a single hash call just can’t saturate the vector units.
match arity {
2 => dispatch!(64),
4 => dispatch!(128),
8 => dispatch!(256),
16 => dispatch!(512),
_ => false,
}
arity is capped at {2, 4, 8, 16} because byte-identity with blake3::hash holds only for inputs that are whole 64-byte blocks within a single 1024-byte chunk; past that the batched output silently diverges. that kernel is a #[doc(hidden)] api driven with hand-copied iv and flag constants, so the dependency is pinned to exactly =1.8.5 and a unit test asserts byte-identity per arity. the cargo.toml comment puts it well: the test guards behavior, the pin guards the api. of everything in this section, the simd path alone carries its own number: a criterion baseline diff showed insert_many going 2.87-2.94x faster across all arity/count combinations on neon, 87.3ms down to 30.5ms for 1M leaves.
the hot path never allocates (structural). sibling gathering uses a [[0u8; 32]; N] stack array thanks to const-generic N. batched inserts presize every parent level in one pass before hashing begins, so no chunk allocations land mid-loop, and writes go through set_preallocated, which skips the length-extension branch. the parallel path reuses one scratch buffer across levels through MaybeUninit writes followed by set_len, avoiding a zero-fill of memory that is about to be overwritten.
depth is two bit-count instructions (structural). the tree recomputes its depth on every insert and every proof verification:
#[inline(always)]
pub(crate) fn ceil_log_n(size: u64, n: usize) -> usize {
if size <= 1 {
return 0;
}
if n.is_power_of_two() {
let k = n.trailing_zeros();
let bits = u64::BITS - (size - 1).leading_zeros();
return bits.div_ceil(k) as usize;
}
(size - 1).ilog(n as u64) as usize + 1
}
power-of-two arities compile down to tzcnt/lzcnt and a division-free ceiling. the general-base fallback pays for an iterative ilog, one reason the readme’s future-work list opens with precomputing this as a table.
parallelism respects its own overhead (structural). rayon only engages when a level has at least 1024 parents to compute (tunable via ROTORTREE_PARALLEL_THRESHOLD), in tasks of 64 parents each, with split_at_mut producing provably disjoint child/parent borrows. below the threshold, fork-join coordination costs more than the hashing; above it, disjoint ranges mean workers never contend on the same cache lines.
files only see sequential appends (structural). the wal flush is one write_all at the append position. checkpoint chunks are appended to their shards at computed offsets. locating chunk i on disk is arithmetic:
pub(crate) const CHUNK_BYTE_SIZE: usize = CHUNK_SIZE * 32;
pub(crate) const CHUNKS_PER_SHARD: usize = 65_536;
#[inline]
pub(crate) fn shard_address(chunk_idx: usize) -> (usize, usize) {
let shard_idx = chunk_idx / CHUNKS_PER_SHARD;
let offset_in_shard = (chunk_idx % CHUNKS_PER_SHARD) * CHUNK_BYTE_SIZE;
(shard_idx, offset_in_shard)
}
fixed 32-byte records buy the absence of any index structure, no b-tree descent, no key comparison, a divide and a modulo. here the workload analysis from earlier pays out: when values never change size and never disappear, addressing degenerates into multiplication.
snapshots copy one page (structural). a snapshot for proof serving clones the segment Arcs (a pointer-equality test pins this invariant) and copies the mutable tail by value, at most 4 KiB. readers grab it through arc-swap without ever taking the write lock.
the sympathy here lives in data layout, access order, and keeping syscalls off the hot path, and apart from the simd path it is validated in aggregate by the ~380-benchmark suite.
numbers
everything below is from an m4 pro, 14 cores, 48 gb, and the exact configurations are in the repo. the benchmark suite is ~380 criterion benchmarks generated by crabtime macros, monomorphized per arity, published here.
peak in-memory throughput on the parallel path reaches up to ~190M leaves/sec depending on run conditions; the committed criterion report peaks at 164.63M elem/s on its fastest sample (161.86M mean) for insert_many, N=16, 1M leaves. a bulk_load, 10M leaves at N=4 with the wal, 10ms interval flush, 256 MiB checkpoint threshold, and full mmap tiering, gave a median of roughly 50M inserts/sec for the in-memory portion of each 1M block and ~27M inserts/sec durable, i.e including the wait for the fsync token. proof generation off a snapshot held ~167ns median at depth 12, verification ~1.2µs. the harshest committed benchmark, sustained_checkpoint, which checkpoints every 5 to 500 entries and generates and verifies proofs inside the timed loop, floors at 14-16M leaves/sec.
those three figures spread almost 4x, and the spread is the honest story: ~50M/s is what the tree computes, ~27M/s is what survives an fsync, and 14-16M/s is what you get with checkpoints and proof traffic in the loop. quoting only the first would be flattering and useless. note also that reads never touch disk on the query path, always memory or page-cache-backed mmap, so comparisons against merkle stores that hit disk per read are apples to oranges by construction.
recovery of a 100M-entry tree with full root verification takes about 1s. capacity is pure arithmetic: total nodes are at 32 bytes each, so N=4 with MAX_DEPTH=16 holds ~4.3B nullifiers in ~170 GiB on disk, of which only ~43 GiB (the non-leaf levels) needs to be resident once level 0 tiers out to mmap. N=8 with MAX_DEPTH=10 holds ~1B in ~37 GiB.
back to the opening claim. tornado cash accumulated 278,827 deposits across all mainnet pools over seven years, 262,941 of them in the four eth pools; each pool is a fixed-depth-20 binary tree, so the busiest one (1 eth, 85,936 deposits) sits at 8% of its 2^20 capacity. as a rotortree at N=4 that entire history is a depth-10 tree ingested by one insert_many in ~5.6ms buffered, ~10.3ms durable, one flush interval, with membership proofs at ~167ns after. to be clear about what that comparison is: tornado’s trees hash with mimc inside a snark, so this measures the database layer at that scale, with blake3 as the hasher. the Hasher trait is generic precisely so a circuit-friendly hash can be swapped in, at a large throughput cost.
limitations and future work
repeating the disclaimer: rotortree is a research experiment, unsuitable for production; the readme enumerates the trade-offs in caps.
no leaf domain separation exists, by design; the prehashed-leaves bet pushes second-preimage responsibility onto the caller. the simd path rides a private blake3 api held together by an exact version pin and a byte-identity test, which is a maintenance landmine i accepted with eyes open. profiling on macos has been a wall: instruments and samply give decent cpu profiles, but disk i/o and hardware counters stay opaque, so a beefy linux box with perf is high on the wishlist. io_uring for wal and checkpoint writes would be nice eventually, though with zero i/o on the hot path it can wait. recovery has room, and trees are single-generation, so long-lived deployments should rotate trees per epoch and encode the generation into the nullifier.
turns out a database gets easy to make fast once you delete every feature the workload doesn’t need.
Thanks for reading!