memcmp, but better*
Introduction
The title of this post is intentionally provocative.
Please note that the method I describe to gain around 40% performance improvement over the Rust standard library's memcmp
is specific to systems that support the NEON instruction set, which is all Apple Silicon Macbooks.
The asterisk is used because the word "better" is subjective, and performance gains may not always be objectively better depending on how the resultant code looks.
Some time last year, I was investigating a bottleneck in one of the instructions on the FuelVM. It was the MEQ
operation, which under the hood makes use of memcmp
through a series of abstractions. These abstractions were not necessarily the bottleneck, due to Rust's zero cost* approach to dealing with them, but it was the memcmp
operation itself. Please note that this bottleneck was only observed in a zkVM environment, such as SP1 and Risc0. Based on the construction of the zkVMs at the time, memory operations were slow, which made me dive deeper into why this was the case.
The Problem
On average, the MEQ
operation was consuming 20x the number of execution cycles in zkVMs compared to all other opcodes supported by the FuelVM
instruction set. Digging deeper, this was the function that was behaving slowly while comparing large slices of memory -
pub(crate) fn memeq(
memory: &mut MemoryInstance,
result: &mut Word,
pc: RegMut<PC>,
b: Word,
c: Word,
d: Word,
) -> SimpleResult<()> {
*result = (memory.read(b, d)? == memory.read(c, d)?) as Word;
Ok(inc_pc(pc)?)
}
The ==
refers to the PartialEq
implementation for &[u8]
here, which is memcmp
under the hood:
impl<A: ?Sized, B: ?Sized> PartialEq<&B> for &A
where
A: PartialEq<B>,
{
#[inline]
fn eq(&self, other: &&B) -> bool {
PartialEq::eq(*self, *other)
}
#[inline]
fn ne(&self, other: &&B) -> bool {
PartialEq::ne(*self, *other)
}
}
The offending function had been found, so I started researching ways to improve the performance of this operation.
SIMD
SIMD (Single Instruction Multiple Data) is a set of instructions that can be applied to different pieces of data in parallel. These types of instructions are only available on modern CPUs, with its first usage pioneered by the ILLIAC IV
in 1972. It's been over five decades since then, and with time, massive improvements have been made to consumer CPUs, with different operating modes, different kinds of instructions, as well as industry standards revolving around SIMD instructions.
At first sight, SIMD appeared to be a viable solution to the problem. However, as it goes with the rest of software engineering, everything is a trade-off. At the time of writing this article, access to portable SIMD instructions is not supported by Rust's stable toolchain. There were 2 solutions:
- switch the whole
FuelVM
to nightly, and potentially make a breaking change to the fuel network in doing so. - make use of cpu intrinsics for each target architecture with some unsafe code.
Both options posed significant implications for the stability of our network, but option 1 is more testable and would not require us to run miri around the unsafe code that would be required with option 2.
Please note that all benchmarks and tests were run in release
mode, with the following RUSTFLAGS
:
-C target-cpu=native -C target-arch=neon
Option 1: Portable SIMD is not all sunshine and daisies
While I have respect for the maintainers of the portable_simd
module, I was unable to get significant performance gains for large slices (4kb+) in memory. In some cases, there was even performance degradation compared to the standard library's implementation.
After inspecting the disassembly using godbolt with the same flags, I noticed that the auto-vectorization by llvm was quite poor, and didn't make use of as many lanes as possible while comparing chunks of the memory slices.
I ended up going with Option 2 at this point, to see if performance gains were on the table or not.
Option 2: CPU intrinsics
Since portable_simd
abstracted over the intrinsics, and we wanted to bypass the abstraction, there was a lot of documentation to read, specific instructions for NEON, specific instructions for Intel (AVX2, AVX512), Risc-V packed SIMD, etc.
I eventually got an iteration working, however, I noticed the same performance as the standard library, and the portable_simd
version. This is when I realized I forgot to be a bit more aggressive with the loop unrolling and lane usage, and after increasing the number of lanes used, I achieved a staggering 40% performance improvement for large slices. I brushed it off thinking there was something wrong with the implementation, but after fuzzing the implementation, it removed that doubt from my mind.
Please note that if you're too aggressive with how many lanes you use, you can slow down other parts of your software, which is why any system should be profiled as a whole before and after.
The benchmarks were conducted with divan, and here is the performance for different slice sizes:
meq_performance_divan_plain fastest │ slowest │ median │ mean │ samples │ iters
╰─ meq_performance │ │ │ │ │
├─ 100 40.31 ns │ 61.66 µs │ 41.31 ns │ 928.8 ns │ 100 │ 100
├─ 2000 50.74 ns │ 51.72 ns │ 51.39 ns │ 51.26 ns │ 100 │ 12800
├─ 4000 80.03 ns │ 112.5 ns │ 81.34 ns │ 82.78 ns │ 100 │ 6400
├─ 8000 126.9 ns │ 133.4 ns │ 129.5 ns │ 129.5 ns │ 100 │ 3200
├─ 16000 220.6 ns │ 1.218 µs │ 228.4 ns │ 237.8 ns │ 100 │ 1600
╰─ 32000 413.3 ns │ 582.6 ns │ 423.7 ns │ 424.8 ns │ 100 │ 1600
meq_performance_divan_optimized fastest │ slowest │ median │ mean │ samples │ iters
╰─ meq_performance │ │ │ │ │
├─ 100 40.17 ns │ 57.95 µs │ 41.17 ns │ 872 ns │ 100 │ 100
├─ 2000 36.28 ns │ 40.19 ns │ 36.93 ns │ 37.37 ns │ 100 │ 12800
├─ 4000 49.3 ns │ 53.86 ns │ 49.62 ns │ 49.68 ns │ 100 │ 12800
├─ 8000 75.34 ns │ 87.06 ns │ 77.94 ns │ 77.41 ns │ 100 │ 6400
├─ 16000 129.3 ns │ 947.1 ns │ 131.9 ns │ 140.1 ns │ 100 │ 1600
╰─ 32000 230.9 ns │ 384.5 ns │ 236.1 ns │ 236.7 ns │ 100 │ 1600
Similarly, it was re-implemented for targets that support AVX2 & AVX512 instructions, with similar performance improvements.
Why the asterisk?
Long story short, the code is ugly. Here's a little look at it
fn slices_equal_neon(a: &[u8], b: &[u8]) -> bool {
use std::arch::aarch64::*;
if a.len() != b.len() {
return false;
}
let len = a.len();
let mut i = 0;
const CHUNK: usize = 96;
// if the slices are small, we don't need to
// use SIMD instructions due to overhead
if a.len() < CHUNK {
return slices_equal_fallback(a, b);
}
unsafe {
while i + CHUNK <= len {
let mut cmp =
vceqq_u8(vld1q_u8(a.as_ptr().add(i)), vld1q_u8(b.as_ptr().add(i)));
cmp = vandq_u8(
cmp,
vceqq_u8(
vld1q_u8(a.as_ptr().add(i + 16)),
vld1q_u8(b.as_ptr().add(i + 16)),
),
);
cmp = vandq_u8(
cmp,
vceqq_u8(
vld1q_u8(a.as_ptr().add(i + 32)),
vld1q_u8(b.as_ptr().add(i + 32)),
),
);
cmp = vandq_u8(
cmp,
vceqq_u8(
vld1q_u8(a.as_ptr().add(i + 48)),
vld1q_u8(b.as_ptr().add(i + 48)),
),
);
cmp = vandq_u8(
cmp,
vceqq_u8(
vld1q_u8(a.as_ptr().add(i + 64)),
vld1q_u8(b.as_ptr().add(i + 64)),
),
);
cmp = vandq_u8(
cmp,
vceqq_u8(
vld1q_u8(a.as_ptr().add(i + 80)),
vld1q_u8(b.as_ptr().add(i + 80)),
),
);
if vmaxvq_u8(cmp) != 0xFF {
return false;
}
i += CHUNK;
}
// Scalar comparison for the remainder
a[i..] == b[i..]
}
}
Imagine that, but for as many target architectures you want to support with SIMD acceleration. It is a pain to maintain, but for some applications, is a significant unlock for throughput. See the PR for the whole picture.
Conclusion
Better performance is not always the best, it is almost always at the cost of readable code. If your business absolutely depends on you shaving off a few milliseconds in a service, this might be the kind of optimizations you do. For MOST cases, this is just classic over-engineering and shouldn't be replicated at all. This ultimately did not solve the problem with zkVMs consuming several more cycles to handle the MEQ
instruction, because they don't support SIMD yet, and neither does the memory model for the cryptography support it. That being said, I'm quite happy with the outcome of this experiment.
Thanks for reading!