Haoren(Harrison) Hu

Posted on Feb 1

Rust SIMD Benchmark: std::simd vs NEON on Apple M4

#rust #simd #benchmark

A friend shared Sylvain Kerkour's post SIMD programming in pure Rust which covers AVX-512 on AMD Zen 5. That got me curious about ARM's side of the story—specifically how std::simd compares to hand-written NEON intrinsics. My main development machine is a MacBook Pro with Apple M4, so I ran the benchmarks there.

I ran 9 benchmarks comparing three approaches: scalar, std::simd, and NEON. All numbers below are averaged from 3 runs.

Key finding: std::simd ranged from 9x faster to 7.7x slower than scalar code. NEON always delivered speedups (1.2x–4.3x). The difference comes down to data layout—interleaved data like RGB images and stereo audio exposed limitations in the portable SIMD abstraction.

Test Environment

CPU: Apple M4 (MacBook Pro 2024)
Rust: rustc 1.94.0-nightly (2026-01-14)
OS: macOS
Command: cargo run --release

Three implementations per scenario:

Approach	Stability	Portability
Scalar	stable	everywhere
std::simd	nightly	cross-platform
std::arch (NEON)	stable	ARM64 only

Results Summary

Scenario	Scalar	std::simd	NEON	std::simd	NEON
RGB→Grayscale	6.16ms	18.51ms	1.45ms	0.33x	4.26x
Volume Adjust	1.39ms	3.42ms	0.82ms	0.40x	1.70x
Audio Mixing	0.63ms	4.84ms	0.53ms	0.13x	1.20x
Count Newlines	14.10ms	3.22ms	3.87ms	4.38x	3.65x
Find Byte	24.07ms	2.61ms	2.68ms	9.23x	9.00x
Dot Product	51.20ms	10.99ms	20.14ms	4.66x	2.54x
Matrix-Vec Mul	4.48ms	0.69ms	1.35ms	6.53x	3.32x
Range Check	24.93ms	8.24ms	9.33ms	3.02x	2.67x
Sorted Check	25.14ms	37.18ms	6.87ms	0.68x	3.66x

Three scenarios showed std::simd slower than scalar: RGB conversion (0.33x), audio mixing (0.13x), and sorted check (0.68x).

Scenario 1: RGB to Grayscale

Task: Convert 1920×1080 RGB image to grayscale using ITU-R BT.601 formula.

I used fixed-point arithmetic to avoid floating-point operations—much faster on integer pipelines: Gray = (77*R + 150*G + 29*B) >> 8

Scalar

The scalar version is straightforward with chunks_exact(3):

pub fn rgb_to_grayscale(rgb: &[u8], gray: &mut [u8]) {
    for (i, chunk) in rgb.chunks_exact(3).enumerate() {
        let r = chunk[0] as u32;
        let g = chunk[1] as u32;
        let b = chunk[2] as u32;
        gray[i] = ((77 * r + 150 * g + 29 * b) >> 8) as u8;
    }
}

std::simd

pub fn rgb_to_grayscale(rgb: &[u8], gray: &mut [u8]) {
    let chunks = rgb.chunks_exact(48);
    let weight_r = u16x16::splat(77);
    let weight_g = u16x16::splat(150);
    let weight_b = u16x16::splat(29);

    let mut out_idx = 0;
    for chunk in chunks {
        // No deinterleave instruction available
        // Must use scalar loop: 48 memory accesses per iteration
        let mut r_vals = [0u8; 16];
        let mut g_vals = [0u8; 16];
        let mut b_vals = [0u8; 16];

        for i in 0..16 {
            r_vals[i] = chunk[i * 3];
            g_vals[i] = chunk[i * 3 + 1];
            b_vals[i] = chunk[i * 3 + 2];
        }

        let r = u16x16::from_array(r_vals.map(|x| x as u16));
        let g = u16x16::from_array(g_vals.map(|x| x as u16));
        let b = u16x16::from_array(b_vals.map(|x| x as u16));

        let gray_u16 = (r * weight_r + g * weight_g + b * weight_b) >> Simd::splat(8);

        let gray_u8: [u8; 16] = gray_u16.to_array().map(|x| x as u8);
        gray[out_idx..out_idx + 16].copy_from_slice(&gray_u8);
        out_idx += 16;
    }
}

NEON

pub fn rgb_to_grayscale(rgb: &[u8], gray: &mut [u8]) {
    unsafe {
        let weight_r = vdupq_n_u8(77);
        let weight_g = vdupq_n_u8(150);
        let weight_b = vdupq_n_u8(29);

        for i in (0..simd_len).step_by(16) {
            // vld3q_u8: load + deinterleave in one instruction
            // [R0,G0,B0,R1,G1,B1,...] → R[], G[], B[]
            let rgb_data = vld3q_u8(rgb.as_ptr().add(i * 3));
            let r = rgb_data.0;
            let g = rgb_data.1;
            let b = rgb_data.2;

            // Widening multiply: u8 × u8 → u16
            let r_lo = vmull_u8(vget_low_u8(r), vget_low_u8(weight_r));
            let r_hi = vmull_high_u8(r, weight_r);
            let g_lo = vmull_u8(vget_low_u8(g), vget_low_u8(weight_g));
            let g_hi = vmull_high_u8(g, weight_g);
            let b_lo = vmull_u8(vget_low_u8(b), vget_low_u8(weight_b));
            let b_hi = vmull_high_u8(b, weight_b);

            let sum_lo = vaddq_u16(vaddq_u16(r_lo, g_lo), b_lo);
            let sum_hi = vaddq_u16(vaddq_u16(r_hi, g_hi), b_hi);

            // Shift right + narrow: u16 → u8
            let gray_lo = vshrn_n_u16(sum_lo, 8);
            let gray_hi = vshrn_n_u16(sum_hi, 8);
            let gray_vec = vcombine_u8(gray_lo, gray_hi);

            vst1q_u8(gray.as_mut_ptr().add(i), gray_vec);
        }
    }
}

Results

Implementation	Time	vs Scalar
Scalar	6.16ms	1.0x
std::simd	18.51ms	0.33x
NEON	1.45ms	4.26x

Analysis: vld3q_u8 performs load and 3-way deinterleave in one instruction. std::simd lacks this capability, requiring a 16-iteration scalar loop that dominates execution time.

Scenario 2: Audio Volume Adjustment

Task: Multiply 880K audio samples (i16) by gain factor 0.8.

I used fixed-point here too—multiplying by 256 and shifting right 8 bits keeps everything in integer domain.

Scalar

pub fn adjust_volume(samples: &mut [i16], volume: f32) {
    for sample in samples.iter_mut() {
        let adjusted = (*sample as f32 * volume) as i32;
        *sample = adjusted.clamp(-32768, 32767) as i16;
    }
}

std::simd

pub fn adjust_volume(samples: &mut [i16], volume: f32) {
    let vol_fixed = (volume * 256.0) as i32;
    let vol_vec = i32x8::splat(vol_fixed);

    for i in (0..simd_len).step_by(8) {
        // No i16 → i32 widening instruction
        let mut vals = [0i32; 8];
        for j in 0..8 {
            vals[j] = samples[i + j] as i32;
        }
        let v = i32x8::from_array(vals);

        let adjusted = (v * vol_vec) >> Simd::splat(8);
        let clamped = adjusted.simd_clamp(i32x8::splat(-32768), i32x8::splat(32767));

        // No i32 → i16 narrowing instruction
        for (j, &val) in clamped.to_array().iter().enumerate() {
            samples[i + j] = val as i16;
        }
    }
}

NEON

pub fn adjust_volume(samples: &mut [i16], volume: f32) {
    let vol_fixed = (volume * 256.0) as i32;

    unsafe {
        let vol_vec = vdupq_n_s32(vol_fixed);

        for i in (0..simd_len).step_by(8) {
            let v = vld1q_s16(samples.as_ptr().add(i));

            // vmovl: widen i16 → i32
            let v_lo = vmovl_s16(vget_low_s16(v));
            let v_hi = vmovl_high_s16(v);

            let mul_lo = vmulq_s32(v_lo, vol_vec);
            let mul_hi = vmulq_s32(v_hi, vol_vec);

            let shifted_lo = vshrq_n_s32(mul_lo, 8);
            let shifted_hi = vshrq_n_s32(mul_hi, 8);

            // vqmovn: saturating narrow i32 → i16
            let result_lo = vqmovn_s32(shifted_lo);
            let result_hi = vqmovn_s32(shifted_hi);
            let result = vcombine_s16(result_lo, result_hi);

            vst1q_s16(samples.as_mut_ptr().add(i), result);
        }
    }
}

Results

Implementation	Time	vs Scalar
Scalar	1.39ms	1.0x
std::simd	3.42ms	0.40x
NEON	0.82ms	1.70x

Analysis: std::simd lacks type conversion instructions. vmovl (widen) and vqmovn (saturating narrow) are single instructions in NEON but require scalar loops in std::simd.

Scenario 3: Audio Mixing

Task: Mix two audio tracks using (a + b) / 2 to prevent overflow.

Division by 2 is the classic way to mix audio without clipping. Simple but requires widening to i32 first.

Scalar

pub fn mix_tracks(track_a: &[i16], track_b: &[i16], output: &mut [i16]) {
    for ((a, b), out) in track_a.iter().zip(track_b.iter()).zip(output.iter_mut()) {
        *out = ((*a as i32 + *b as i32) / 2) as i16;
    }
}

std::simd

pub fn mix_tracks(track_a: &[i16], track_b: &[i16], output: &mut [i16]) {
    for i in (0..simd_len).step_by(8) {
        let mut a_vals = [0i32; 8];
        let mut b_vals = [0i32; 8];
        for j in 0..8 {
            a_vals[j] = track_a[i + j] as i32;
            b_vals[j] = track_b[i + j] as i32;
        }

        let a = i32x8::from_array(a_vals);
        let b = i32x8::from_array(b_vals);
        let mixed = (a + b) >> Simd::splat(1);

        for (j, &val) in mixed.to_array().iter().enumerate() {
            output[i + j] = val as i16;
        }
    }
}

NEON

pub fn mix_tracks(track_a: &[i16], track_b: &[i16], output: &mut [i16]) {
    unsafe {
        for i in (0..simd_len).step_by(8) {
            let a = vld1q_s16(track_a.as_ptr().add(i));
            let b = vld1q_s16(track_b.as_ptr().add(i));

            // vhaddq: halving add, computes (a + b) / 2 without overflow
            let mixed = vhaddq_s16(a, b);

            vst1q_s16(output.as_mut_ptr().add(i), mixed);
        }
    }
}

Results

Implementation	Time	vs Scalar
Scalar	0.63ms	1.0x
std::simd	4.84ms	0.13x
NEON	0.53ms	1.20x

Analysis: vhaddq_s16 performs halving add in one instruction—designed specifically for audio mixing. std::simd requires widening to i32, adding, shifting, and narrowing back, plus three scalar loops for type conversions.

Scenario 4: Counting Newlines

Task: Count occurrences of \n in 10MB text.

This is where SIMD really shines—contiguous data with simple comparison. I used filter().count() for scalar, which Rust's iterator makes clean.

Scalar

pub fn count_byte(data: &[u8], target: u8) -> usize {
    data.iter().filter(|&&b| b == target).count()
}

std::simd

pub fn count_byte(data: &[u8], target: u8) -> usize {
    let target_vec = u8x32::splat(target);
    let chunks = data.chunks_exact(32);
    let mut count = 0usize;

    for chunk in chunks {
        let v = u8x32::from_slice(chunk);
        let mask = v.simd_eq(target_vec);
        count += mask.to_bitmask().count_ones() as usize;
    }

    count += chunks.remainder().iter().filter(|&&b| b == target).count();
    count
}

NEON

pub fn count_byte(data: &[u8], target: u8) -> usize {
    let mut count = 0usize;

    unsafe {
        let target_vec = vdupq_n_u8(target);

        for i in (0..simd_len).step_by(16) {
            let v = vld1q_u8(data.as_ptr().add(i));
            let eq = vceqq_u8(v, target_vec);
            let ones = vshrq_n_u8(eq, 7);
            count += vaddvq_u8(ones) as usize;
        }
    }

    count
}

Results

Implementation	Time	vs Scalar
Scalar	14.10ms	1.0x
std::simd	3.22ms	4.38x
NEON	3.87ms	3.65x

Analysis: Contiguous data, simple comparison, no type conversions. Both SIMD approaches perform well. std::simd uses 256-bit vectors, NEON uses 128-bit, explaining the slight std::simd advantage.

Scenario 5: Finding a Byte

Task: Find position of @ in 10MB data (target near end).

I placed the target byte near the end to simulate worst-case search. The position() iterator is nice for scalar.

Scalar

pub fn find_byte(data: &[u8], target: u8) -> Option<usize> {
    data.iter().position(|&b| b == target)
}

std::simd

pub fn find_byte(data: &[u8], target: u8) -> Option<usize> {
    let target_vec = u8x32::splat(target);
    let chunks = data.chunks_exact(32);

    for (chunk_idx, chunk) in chunks.enumerate() {
        let v = u8x32::from_slice(chunk);
        let mask = v.simd_eq(target_vec);
        let bitmask = mask.to_bitmask();
        if bitmask != 0 {
            return Some(chunk_idx * 32 + bitmask.trailing_zeros() as usize);
        }
    }

    None
}

NEON

pub fn find_byte(data: &[u8], target: u8) -> Option<usize> {
    let len = data.len();
    let simd_len = len - (len % 16);

    unsafe {
        let target_vec = vdupq_n_u8(target);

        for i in (0..simd_len).step_by(16) {
            let v = vld1q_u8(data.as_ptr().add(i));
            let eq = vceqq_u8(v, target_vec);
            // vmaxvq_u8: if any 0xFF exists, returns 0xFF
            if vmaxvq_u8(eq) != 0 {
                // Found match, linear search for exact position
                for j in 0..16 {
                    if data[i + j] == target {
                        return Some(i + j);
                    }
                }
            }
        }
    }

    None
}

Results

Implementation	Time	vs Scalar
Scalar	24.07ms	1.0x
std::simd	2.61ms	9.23x
NEON	2.68ms	9.00x

Analysis: Best-case SIMD scenario. Compare 32 bytes per iteration, early exit on match.

Scenario 6: Dot Product

Task: Dot product of two 10M-element f32 vectors.

Classic numerical computing workload. Rust's iterator chain makes the scalar version readable.

Scalar

pub fn dot_product(a: &[f32], b: &[f32]) -> f32 {
    a.iter().zip(b.iter()).map(|(x, y)| x * y).sum()
}

std::simd

pub fn dot_product(a: &[f32], b: &[f32]) -> f32 {
    let chunks_a = a.chunks_exact(8);
    let chunks_b = b.chunks_exact(8);

    let mut acc = f32x8::splat(0.0);

    for (a_chunk, b_chunk) in chunks_a.zip(chunks_b) {
        let va = f32x8::from_slice(a_chunk);
        let vb = f32x8::from_slice(b_chunk);
        acc += va * vb;
    }

    acc.reduce_sum()
}

NEON

pub fn dot_product(a: &[f32], b: &[f32]) -> f32 {
    unsafe {
        let mut acc = vdupq_n_f32(0.0);

        for i in (0..simd_len).step_by(4) {
            let va = vld1q_f32(a.as_ptr().add(i));
            let vb = vld1q_f32(b.as_ptr().add(i));
            acc = vfmaq_f32(acc, va, vb);
        }

        vaddvq_f32(acc)
    }
}

Results

Implementation	Time	vs Scalar
Scalar	51.20ms	1.0x
std::simd	10.99ms	4.66x
NEON	20.14ms	2.54x

Analysis: std::simd outperforms hand-written NEON. std::simd uses f32x8 (256-bit), while the NEON implementation uses f32x4 (128-bit). The compiler generates efficient code from the portable abstraction.

Scenario 7: Matrix-Vector Multiplication

Task: 1024×1024 matrix times 1024-element vector.

I reused the dot product implementation row by row—keeps the code simple.

Scalar

pub fn matrix_vector_mul(matrix: &[f32], vector: &[f32], result: &mut [f32], rows: usize, cols: usize) {
    for i in 0..rows {
        let row_start = i * cols;
        result[i] = matrix[row_start..row_start + cols]
            .iter()
            .zip(vector.iter())
            .map(|(m, v)| m * v)
            .sum();
    }
}

std::simd

pub fn matrix_vector_mul(matrix: &[f32], vector: &[f32], result: &mut [f32], rows: usize, cols: usize) {
    for i in 0..rows {
        let row = &matrix[i * cols..(i + 1) * cols];
        result[i] = dot_product(row, vector);  // reuse SIMD dot product
    }
}

NEON

pub fn matrix_vector_mul(matrix: &[f32], vector: &[f32], result: &mut [f32], rows: usize, cols: usize) {
    for i in 0..rows {
        let row = &matrix[i * cols..(i + 1) * cols];
        result[i] = dot_product(row, vector);  // reuse NEON dot product
    }
}

Results

Implementation	Time	vs Scalar
Scalar	4.48ms	1.0x
std::simd	0.69ms	6.53x
NEON	1.35ms	3.32x

Analysis: Regular memory access pattern, same characteristics as dot product.

Scenario 8: Range Check

Task: Verify all 10M i32 values are in [0, 100).

Early exit on failure makes this fast when data is invalid. I kept all values in range to measure worst-case (full scan).

Scalar

pub fn all_in_range(data: &[i32], min: i32, max: i32) -> bool {
    data.iter().all(|&x| x >= min && x <= max)
}

std::simd

pub fn all_in_range(data: &[i32], min: i32, max: i32) -> bool {
    let min_vec = i32x8::splat(min);
    let max_vec = i32x8::splat(max);

    for chunk in data.chunks_exact(8) {
        let v = i32x8::from_slice(chunk);
        let ge_min = v.simd_ge(min_vec);
        let le_max = v.simd_le(max_vec);
        if !(ge_min & le_max).all() {
            return false;
        }
    }

    true
}

NEON

pub fn all_in_range(data: &[i32], min: i32, max: i32) -> bool {
    let len = data.len();
    let simd_len = len - (len % 4);

    unsafe {
        let min_vec = vdupq_n_s32(min);
        let max_vec = vdupq_n_s32(max);

        for i in (0..simd_len).step_by(4) {
            let v = vld1q_s32(data.as_ptr().add(i));
            let ge_min = vcgeq_s32(v, min_vec);
            let le_max = vcleq_s32(v, max_vec);
            let both = vandq_u32(ge_min, le_max);
            // vminvq: if any 0 exists, returns 0
            if vminvq_u32(both) == 0 {
                return false;
            }
        }
    }

    true
}

Results

Implementation	Time	vs Scalar
Scalar	24.93ms	1.0x
std::simd	8.24ms	3.02x
NEON	9.33ms	2.67x

Analysis: Simple comparisons on contiguous i32 data. No type conversions needed.

Scenario 9: Sorted Check

Task: Verify 10M i32 values are sorted ascending.

I used pre-sorted data to measure full-scan performance. The windows() iterator is convenient but has overhead.

Scalar

pub fn is_sorted(data: &[i32]) -> bool {
    data.windows(2).all(|w| w[0] <= w[1])
}

std::simd

pub fn is_sorted(data: &[i32]) -> bool {
    for window in data.windows(9) {
        let current = i32x8::from_slice(&window[0..8]);
        let next = i32x8::from_slice(&window[1..9]);
        if !current.simd_le(next).all() {
            return false;
        }
    }
    true
}

NEON

pub fn is_sorted(data: &[i32]) -> bool {
    unsafe {
        let mut i = 0;
        while i + 4 < data.len() {
            let current = vld1q_s32(data.as_ptr().add(i));
            let next = vld1q_s32(data.as_ptr().add(i + 1));
            let le = vcleq_s32(current, next);
            if vminvq_u32(le) == 0 {
                return false;
            }
            i += 4;
        }
    }
    true
}

Results

Implementation	Time	vs Scalar
Scalar	25.14ms	1.0x
std::simd	37.18ms	0.68x
NEON	6.87ms	3.66x

Analysis: windows(9) iterator creates overlapping slices with overhead. NEON uses direct pointer arithmetic.

Root Cause Analysis

After digging into the assembly, the pattern became clear. std::simd exposes only operations available across all target platforms. ARM-specific instructions cannot be represented:

Operation	std::simd	NEON
Deinterleave load	scalar loop	`vld3q_u8`
Widen i16→i32	scalar loop	`vmovl_s16`
Saturating narrow i32→i16	manual clamp	`vqmovn_s32`
Halving add	add + shift + narrow	`vhaddq_s16`

When these operations are needed, std::simd falls back to scalar loops, negating SIMD benefits and adding overhead.

NEON Instruction Reference

Naming pattern:

vld3q_u8
│││││└─ u8: data type
││││└── q: 128-bit register
│││└─── 3: 3-way deinterleave
││└──── ld: load
│└───── v: vector

Instruction	Operation
`vld1q_u8`	Load 16 bytes
`vld3q_u8`	Load + deinterleave RGB
`vmull_u8`	Widening multiply u8→u16
`vmovl_s16`	Widen i16→i32
`vqmovn_s32`	Saturating narrow i32→i16
`vfmaq_f32`	Fused multiply-add
`vhaddq_s16`	Halving add
`vaddvq_f32`	Horizontal sum

Recommendations

Data Pattern	Approach
Contiguous f32/i32 arrays	std::simd
Interleaved data (RGB, stereo audio)	Platform intrinsics
Operations requiring type conversion	Platform intrinsics
Cross-platform library	std::simd + scalar fallback
Maximum ARM performance	NEON intrinsics

Conclusion

My takeaway: std::simd is great for numerical workloads on contiguous data—often better than hand-written intrinsics because the compiler knows optimization tricks I don't.

But for image and audio processing, std::simd falls short. The portable abstraction cannot express instructions like vld3q_u8 or vhaddq_s16 that ARM provides specifically for these workloads.

If we're targeting ARM and working with interleaved data, NEON intrinsics remain the way to go.

Test Environment

Results Summary

Scenario 1: RGB to Grayscale

Scalar

std::simd

NEON

Results

Scenario 2: Audio Volume Adjustment

Scalar

std::simd

NEON

Results

Scenario 3: Audio Mixing

Scalar

std::simd

NEON

Results

Scenario 4: Counting Newlines

Scalar

std::simd

NEON

Results

Scenario 5: Finding a Byte

Scalar

std::simd

NEON

Results

Scenario 6: Dot Product

Scalar

std::simd

NEON

Results

Scenario 7: Matrix-Vector Multiplication

Scalar

std::simd

NEON

Results

Scenario 8: Range Check

Scalar

std::simd

NEON

Results

Scenario 9: Sorted Check

Scalar

std::simd

NEON

Results

Root Cause Analysis

NEON Instruction Reference

Recommendations

Conclusion

References

Source