Aaryamann Challani

Engineer and amateur cook writing about privacy, cryptography, and distributed systems

← Back to posts

memcmp, but better*

Introduction

The title of this post is intentionally provocative. Please note that the method I describe to gain around 40% performance improvement over the Rust standard library's memcmp is specific to systems that support the NEON instruction set, which is all Apple Silicon Macbooks. The asterisk is used because the word "better" is subjective, and performance gains may not always be objectively better depending on how the resultant code looks.

Some time last year, I was investigating a bottleneck in one of the instructions on the FuelVM. It was the MEQ operation, which under the hood makes use of memcmp through a series of abstractions. These abstractions were not necessarily the bottleneck, due to Rust's zero cost* approach to dealing with them, but it was the memcmp operation itself. Please note that this bottleneck was only observed in a zkVM environment, such as SP1 and Risc0. Based on the construction of the zkVMs at the time, memory operations were slow, which made me dive deeper into why this was the case.

The Problem

On average, the MEQ operation was consuming 20x the number of execution cycles in zkVMs compared to all other opcodes supported by the FuelVM instruction set. Digging deeper, this was the function that was behaving slowly while comparing large slices of memory -

pub(crate) fn memeq(
    memory: &mut MemoryInstance,
    result: &mut Word,
    pc: RegMut<PC>,
    b: Word,
    c: Word,
    d: Word,
) -> SimpleResult<()> {
    *result = (memory.read(b, d)? == memory.read(c, d)?) as Word;
    Ok(inc_pc(pc)?)
}

The == refers to the PartialEq implementation for &[u8] here, which is memcmp under the hood:

impl<A: ?Sized, B: ?Sized> PartialEq<&B> for &A
where
    A: PartialEq<B>,
{
    #[inline]
    fn eq(&self, other: &&B) -> bool {
        PartialEq::eq(*self, *other)
    }
    #[inline]
    fn ne(&self, other: &&B) -> bool {
        PartialEq::ne(*self, *other)
    }
}

The offending function had been found, so I started researching ways to improve the performance of this operation.

SIMD

SIMD (Single Instruction Multiple Data) is a set of instructions that can be applied to different pieces of data in parallel. These types of instructions are only available on modern CPUs, with its first usage pioneered by the ILLIAC IV in 1972. It's been over five decades since then, and with time, massive improvements have been made to consumer CPUs, with different operating modes, different kinds of instructions, as well as industry standards revolving around SIMD instructions.

At first sight, SIMD appeared to be a viable solution to the problem. However, as it goes with the rest of software engineering, everything is a trade-off. At the time of writing this article, access to portable SIMD instructions is not supported by Rust's stable toolchain. There were 2 solutions:

  1. switch the whole FuelVM to nightly, and potentially make a breaking change to the fuel network in doing so.
  2. make use of cpu intrinsics for each target architecture with some unsafe code.

Both options posed significant implications for the stability of our network, but option 1 is more testable and would not require us to run miri around the unsafe code that would be required with option 2.

Please note that all benchmarks and tests were run in release mode, with the following RUSTFLAGS:

-C target-cpu=native -C target-arch=neon

Option 1: Portable SIMD is not all sunshine and daisies

While I have respect for the maintainers of the portable_simd module, I was unable to get significant performance gains for large slices (4kb+) in memory. In some cases, there was even performance degradation compared to the standard library's implementation.

After inspecting the disassembly using godbolt with the same flags, I noticed that the auto-vectorization by llvm was quite poor, and didn't make use of as many lanes as possible while comparing chunks of the memory slices.

I ended up going with Option 2 at this point, to see if performance gains were on the table or not.

Option 2: CPU intrinsics

Since portable_simd abstracted over the intrinsics, and we wanted to bypass the abstraction, there was a lot of documentation to read, specific instructions for NEON, specific instructions for Intel (AVX2, AVX512), Risc-V packed SIMD, etc.

I eventually got an iteration working, however, I noticed the same performance as the standard library, and the portable_simd version. This is when I realized I forgot to be a bit more aggressive with the loop unrolling and lane usage, and after increasing the number of lanes used, I achieved a staggering 40% performance improvement for large slices. I brushed it off thinking there was something wrong with the implementation, but after fuzzing the implementation, it removed that doubt from my mind.

Please note that if you're too aggressive with how many lanes you use, you can slow down other parts of your software, which is why any system should be profiled as a whole before and after.

The benchmarks were conducted with divan, and here is the performance for different slice sizes:

meq_performance_divan_plain  fastest       │ slowest       │ median        │ mean          │ samples │ iters
╰─ meq_performance                        │               │               │               │         │
   ├─ 100                   40.31 ns      │ 61.66 µs      │ 41.31 ns      │ 928.8 ns      │ 100     │ 100
   ├─ 2000                  50.74 ns      │ 51.72 ns      │ 51.39 ns      │ 51.26 ns      │ 100     │ 12800
   ├─ 4000                  80.03 ns      │ 112.5 ns      │ 81.34 ns      │ 82.78 ns      │ 100     │ 6400
   ├─ 8000                  126.9 ns      │ 133.4 ns      │ 129.5 ns      │ 129.5 ns      │ 100     │ 3200
   ├─ 16000                 220.6 ns      │ 1.218 µs      │ 228.4 ns      │ 237.8 ns      │ 100     │ 1600
   ╰─ 32000                 413.3 ns      │ 582.6 ns      │ 423.7 ns      │ 424.8 ns      │ 100     │ 1600



meq_performance_divan_optimized  fastest    │ slowest       │ median        │ mean         │ samples │ iters
╰─ meq_performance                         │               │               │              │         │
   ├─ 100                    40.17 ns      │ 57.95 µs      │ 41.17 ns      │ 872 ns       │ 100     │ 100
   ├─ 2000                   36.28 ns      │ 40.19 ns      │ 36.93 ns      │ 37.37 ns     │ 100     │ 12800
   ├─ 4000                   49.3 ns       │ 53.86 ns      │ 49.62 ns      │ 49.68 ns     │ 100     │ 12800
   ├─ 8000                   75.34 ns      │ 87.06 ns      │ 77.94 ns      │ 77.41 ns     │ 100     │ 6400
   ├─ 16000                  129.3 ns      │ 947.1 ns      │ 131.9 ns      │ 140.1 ns     │ 100     │ 1600
   ╰─ 32000                  230.9 ns      │ 384.5 ns      │ 236.1 ns      │ 236.7 ns     │ 100     │ 1600

Similarly, it was re-implemented for targets that support AVX2 & AVX512 instructions, with similar performance improvements.

Why the asterisk?

Long story short, the code is ugly. Here's a little look at it

fn slices_equal_neon(a: &[u8], b: &[u8]) -> bool {
    use std::arch::aarch64::*;

    if a.len() != b.len() {
        return false;
    }

    let len = a.len();
    let mut i = 0;
    const CHUNK: usize = 96;

    // if the slices are small, we don't need to
    // use SIMD instructions due to overhead
    if a.len() < CHUNK {
        return slices_equal_fallback(a, b);
    }

    unsafe {
        while i + CHUNK <= len {
            let mut cmp =
                vceqq_u8(vld1q_u8(a.as_ptr().add(i)), vld1q_u8(b.as_ptr().add(i)));

            cmp = vandq_u8(
                cmp,
                vceqq_u8(
                    vld1q_u8(a.as_ptr().add(i + 16)),
                    vld1q_u8(b.as_ptr().add(i + 16)),
                ),
            );
            cmp = vandq_u8(
                cmp,
                vceqq_u8(
                    vld1q_u8(a.as_ptr().add(i + 32)),
                    vld1q_u8(b.as_ptr().add(i + 32)),
                ),
            );
            cmp = vandq_u8(
                cmp,
                vceqq_u8(
                    vld1q_u8(a.as_ptr().add(i + 48)),
                    vld1q_u8(b.as_ptr().add(i + 48)),
                ),
            );
            cmp = vandq_u8(
                cmp,
                vceqq_u8(
                    vld1q_u8(a.as_ptr().add(i + 64)),
                    vld1q_u8(b.as_ptr().add(i + 64)),
                ),
            );
            cmp = vandq_u8(
                cmp,
                vceqq_u8(
                    vld1q_u8(a.as_ptr().add(i + 80)),
                    vld1q_u8(b.as_ptr().add(i + 80)),
                ),
            );

            if vmaxvq_u8(cmp) != 0xFF {
                return false;
            }

            i += CHUNK;
        }

        // Scalar comparison for the remainder
        a[i..] == b[i..]
    }
}

Imagine that, but for as many target architectures you want to support with SIMD acceleration. It is a pain to maintain, but for some applications, is a significant unlock for throughput. See the PR for the whole picture.

Conclusion

Better performance is not always the best, it is almost always at the cost of readable code. If your business absolutely depends on you shaving off a few milliseconds in a service, this might be the kind of optimizations you do. For MOST cases, this is just classic over-engineering and shouldn't be replicated at all. This ultimately did not solve the problem with zkVMs consuming several more cycles to handle the MEQ instruction, because they don't support SIMD yet, and neither does the memory model for the cryptography support it. That being said, I'm quite happy with the outcome of this experiment.

Thanks for reading!