a sub-microsecond orderbook in rust

introduction

most orderbook implementations reach for a BTreeMap or some variant of a balanced tree to organize price levels. it’s the textbook answer: O(log n) inserts and lookups with sorted traversal for free. but “log n fast” and “cache-friendly fast” are different things when the hot path is a tight matching loop hammering the same data structures millions of times per second.

some time ago i came across writeups about how production matching engines at places like LMAX and Databento achieve single-digit microsecond latencies. the common thread: flat data structures, pre-allocated memory, respect for the cache hierarchy, hardware empathy.

an implementation is a price-time priority matching engine in Rust. it supports limit and market orders, cancellations, and market depth queries. on an M4 Pro, it sustains over 1.5 million operations per second with sub-microsecond latency per operation. the core idea: replace tree-based price level storage with flat Vec arrays indexed by direct arithmetic, and pre-allocate all order storage in a pool backed by MaybeUninit.

order modification, iceberg orders, multiple tick sizes, regulatory requirements, all out of scope here. the goal was to understand how far a single-threaded matching engine can go when you think about cache lines instead of algorithmic complexity classes.

why trees hurt

a BTreeMap stores keys in sorted order across heap-allocated nodes. each node packs up to 11 key-value pairs (Rust’s B parameter is 6), which is better than a binary tree, but looking up a price level still means chasing pointers from root to leaf through multiple nodes. every pointer chase is a potential L1 cache miss, because the system allocator doesn’t guarantee that parent and child nodes sit near each other in physical memory.

for an orderbook, this matters more than it might seem. matching is the hot path: an incoming order walks the opposite side of the book from best price outward, consuming liquidity at each level until it’s filled. with a tree, each price level access means pointer chasing. with a flat array, it’s base_ptr + index * stride, which the hardware prefetcher handles almost for free.

price range is constrained to 1024 levels on each side of a base price (i.e. 1024 buy levels and 1024 sell levels). you lose flexibility: prices outside this window require re-centering the book, which hasn’t been implemented. for instruments with bounded tick ranges (i.e. equities, futures with known contract specs), 1024 levels is plenty. for something like a crypto perpetual with wild price swings, re-centering would need to be handled gracefully.

price-to-index conversion is a subtraction and a division. with tick_size = 1 (the default), the division optimizes away entirely and the whole function compiles to a subtract and a bounds check:

#[inline]
fn buy_price_to_idx(&self, price: u64) -> Option<usize> {
    if price >= self.base_price {
        return None;
    }
    let idx = ((self.base_price - price) / self.tick_size) as usize;
    if idx < PRICE_LEVELS { Some(idx) } else { None }
}

for buys, lower indices map to higher prices, so the lowest populated index is the best bid. for sells, index 0 maps to base_price (i.e. the best ask). both sides increase outward from the spread, so walking from best toward worse prices is a sequential scan through contiguous memory.

data layout

most orderbook implementations online stuff extra fields into their order struct (trader ID, exchange metadata, string tags) that bloat each record well past a cache line. here, an order carries only what the matching loop needs to touch. a typical L1 cache line is 64 bytes on x86 and 128 bytes on Apple Silicon. compact orders mean fewer cache misses per matching sweep, and the difference is measurable once you’re processing millions of fills per second.

pub struct Order {
    pub order_id: u64,  // 8 bytes
    pub price: u64,     // 8 bytes
    pub quantity: u64,  // 8 bytes
    pub timestamp: u64, // 8 bytes
    flags: u8,          // 1 byte (padded to alignment)
}

side and order type are packed into a single u8 using bit flags. bit 0 encodes side (0 = buy, 1 = sell), bit 1 encodes order type (0 = limit, 1 = market). with padding, the struct lands at 40 bytes. on x86, one order occupies most of a 64-byte cache line with 24 bytes spare. on Apple Silicon with 128-byte lines, three orders fit per line. extraction compiles to a conditional move at -O3:

#[inline]
pub fn side(&self) -> Side {
    if self.flags & 1 == 0 { Side::Buy } else { Side::Sell }
}

all fields are u64. no floating point anywhere in the critical path. prices and quantities represent the smallest tick as integers, so there’s no precision issue. timestamps are generated via Instant::now(), though the current implementation has a subtle bug: it calls .elapsed() immediately after creating the instant, so all timestamps are approximately zero. in practice, FIFO ordering within a price level is maintained by insertion order into the vec (i.e. append-only) rather than timestamp comparison.

each price level holds an aggregated quantity and a vec of order indices:

pub struct PriceLevel {
    pub price: u64,
    pub total_quantity: u64,
    pub order_indices: Vec<usize>,
}

remove_order uses swap_remove for O(1) deletion from the level. one caveat: swap_remove moves the last element into the removed slot, so cancellations can break arrival order within a price level. the matching loop uses .retain() instead, which preserves FIFO order but is O(n). a production implementation would want to decouple these two removal strategies more carefully.

pre-allocated memory pool

heap allocation in the matching hot path is the enemy. a single call to Vec::push or Box::new can trigger the global allocator, which might take hundreds of nanoseconds, or worse, block on a page fault if the kernel needs to map new memory.

the OrderPool pre-allocates a contiguous block of MaybeUninit<Order> slots and maintains a free-list as a stack of indices:

pub struct OrderPool {
    pool: Vec<MaybeUninit<Order>>,
    free_indices: Vec<usize>,
}

impl OrderPool {
    #[inline]
    pub fn allocate(&mut self, order: Order) -> Option<usize> {
        if let Some(index) = self.free_indices.pop() {
            self.pool[index] = MaybeUninit::new(order);
            Some(index)
        } else {
            None
        }
    }

    #[inline]
    pub fn deallocate(&mut self, index: usize) {
        self.free_indices.push(index);
    }
}

allocation is a pop followed by a write. deallocation is a push. both O(1) with no system calls and no fragmentation. orders sit in a contiguous Vec, so iterating during matching benefits from spatial locality.

capacity is fixed at creation time. if the pool runs out, orders are rejected. for benchmarks, 1 million slots are pre-allocated (i.e. roughly 40MB at 40 bytes per order). wrapping this behind the Allocator trait (still nightly-only, but the codebase already requires nightly for portable_simd) is the obvious next step so the pool can back standard collections directly, giving Vec<Order, OrderPoolAlloc> semantics without the unsafe access pattern.

accessing a pooled order currently requires unsafe because of MaybeUninit. the safety invariant is simple: never read from an index that hasn’t been allocated or has been deallocated. the free-list discipline enforces this, but the compiler can’t verify it. a generational arena like thunderdome or slotmap would catch use-after-free at one extra comparison per access, which is probably the right trade-off for anything going to production.

matching engine

matching follows strict price-time priority. when a buy limit order arrives, the engine checks whether its price meets or exceeds the best ask. if so, it walks sell levels starting from best_ask_idx upward through worse prices until filled or no more levels cross.

within each price level, orders are stored in arrival order, matched front-to-back for FIFO priority. here’s the core of the buy-side matching loop, simplified:

// matching a buy order against sell side
let mut current_idx = self.best_ask_idx;
while let Some(idx) = current_idx {
    if order.quantity == 0 { break; }
    let price = self.sell_idx_to_price(idx);
    if price > order.price { break; }

    if let Some(ref mut level) = self.sell_levels[idx] {
        for resting_idx in level.order_indices.clone() {
            if order.quantity == 0 { break; }
            let resting = unsafe { self.order_pool.get_mut(resting_idx) };
            let fill_qty = std::cmp::min(resting.quantity, order.quantity);
            resting.quantity -= fill_qty;
            order.quantity -= fill_qty;
            level.total_quantity -= fill_qty;
            // ... emit execution, remove if filled
        }
        if level.is_empty() {
            self.sell_levels[idx] = None;
            // advance to next populated level
            current_idx = ((idx + 1)..PRICE_LEVELS)
                .find(|&i| self.sell_levels[i].is_some());
            self.best_ask_idx = current_idx;
            continue;
        }
    }
    // advance to next level
    current_idx = ((idx + 1)..PRICE_LEVELS)
        .find(|&i| self.sell_levels[i].is_some());
}

level.order_indices.clone() is the wart here. cloning the index vec before iterating is necessary because the loop body mutates the level (removing filled orders), and you can’t iterate and mutate the same vec. it’s a heap allocation per match cycle per level. a production implementation would use a cursor or drain pattern. that said, allocations here are small (a few hundred usize values at most) and don’t dominate benchmark numbers.

best bid and best ask indices are cached and updated lazily. only when a level empties does the engine scan forward to find the next non-empty level. for dense books near the spread, this scan is short. for 1024 levels, worst case is still just 1024 comparisons against Option::is_some(). i initially re-computed best prices on every operation and found that the scan was eating 15-20% of matching time, so lazy updates were a clear win.

benchmarks and build tuning

on the build side, the release profile is aggressive:

[profile.release]
opt-level = 3
lto = "fat"
codegen-units = 1
panic = "abort"
overflow-checks = false

codegen-units = 1 forces single compilation unit for better cross-function optimization. lto = "fat" extends this across dependency boundaries, so the linker can inline across crate borders. panic = "abort" avoids unwinding overhead. overflow-checks = false removes integer overflow guards, which matters when millions of subtractions per second are happening in the matching loop.

in .cargo/config.toml, target-cpu=native lets LLVM emit machine-specific instructions (i.e. AVX2/512 on Intel, NEON on Apple Silicon). the inline threshold is raised to 1000, roughly 3x default, telling LLVM to inline larger functions. no-vectorize-loops is counterintuitive but intentional: matching loops have data-dependent control flow (early exits on fill, price checks) that auto-vectorization handles poorly. disabling it avoids bad vectorization decisions while still allowing explicit SIMD.

the codebase also includes a PriceLookupTable using portable_simd to compare four prices in parallel via simd_eq. it ended up not being wired into the main orderbook, since flat-array indexing already gives O(1) access and there’s nothing to search. it would make more sense in a variant where prices aren’t constrained to a fixed range and level lookup requires scanning.

with all that in place, five benchmark scenarios on an M4 Pro (48GB):

insertion of 100,000 limit orders across 100 price levels lands at roughly 700ns per order. matching 1,000 aggressive sell orders against 10,000 resting buys: around 300ns per match, including execution report generation. cancellation of 100,000 orders: approximately 200ns per cancel. market depth retrieval for 10 levels: about 50ns, which is essentially just a short array scan.

a mixed workload (70% inserts, 20% cancels, 10% market orders) sustained over 60 seconds: 1.5 million operations per second, average latency under 700ns. inserts are most expensive because they may trigger matching against the opposite side. market orders always trigger matching. cancels and depth queries are cheapest, which makes sense since they don’t touch the matching logic at all.

for a single-threaded implementation on commodity hardware, these are reasonable. to be fair, production systems at major exchanges do much better. they achieve sub-100ns latencies through kernel bypass (DPDK/io_uring), dedicated NIC hardware, and careful NUMA-aware memory placement. the gap between “fast on a laptop” and “fast in a colocation rack” is enormous.

limitations and what comes next

at 1024 price levels per side, the range constraint is the biggest limitation. real instruments can have price ranges spanning thousands of ticks, and re-centering on a price move is non-trivial. a hybrid approach, flat array for the hot zone around the spread plus a tree for outlier levels, would handle this.

there’s no support for order modification (cancel-replace). in real markets, market makers send modifications at far higher rates than new orders. on a typical equity exchange, cancel-replace volume can be 10x the new order volume. so that path needs to be just as fast as insertion, if not faster.

currently, unsafe in the order pool is minimal and well-scoped, but wrapping it behind the Allocator trait or switching to a generational arena would be a strict improvement. the free-list approach works, but it’s the kind of thing that bites you six months later when someone adds a code path that doesn’t respect the allocation discipline.

multi-threaded designs haven’t been explored. real exchange matching engines are single-threaded by necessity (price-time priority requires total event ordering), but gateway and risk layers around them are not. interfaces between those layers need careful thought about lock-free queues and memory ordering. long story short, the single-threaded matching loop is the easy part; everything around it is where the real engineering lives.

Thanks for reading!