CSR matvec (SpMV)#

csr_matvec computes y = A @ x where A is a CSRArray and x is a rank-1 dense vector. On Apple Silicon the Metal backend dispatches:

A scalar row kernel for short rows (one thread per row).
A threadgroup vector-reduction kernel for long rows, selected automatically from the known nnz / n_rows ratio without any host synchronisation.

All four value dtypes (float32, float16, bfloat16, complex64) and both index dtypes (int32, int64) are supported on CPU and GPU.

import mlx.core as mx
import numpy as np
import mlx_sparse as ms

ms.use_cpu()
print("Native extension:", ms.is_available())

Native extension: True

Build a medium-sized random sparse matrix#

We use mlx-sparse native random generation to build a reproducible CSR matrix.

rng = np.random.default_rng(42)
A = ms.random.rand(
    4096, 4096, density=0.00025, format="csr",
    dtype=mx.float32, rng=42, index_dtype=mx.int32,
)
print(A)
print(f"density: {A.nnz / (A.shape[0] * A.shape[1]) * 100:.3f}%")

CSRArray(shape=(4096, 4096), nnz=4194, dtype=mlx.core.float32, index_dtype=mlx.core.int32, sorted_indices=True, has_canonical_format=True)
density: 0.025%

Correctness check against a dense reference#

x_np = rng.standard_normal(4096).astype(np.float32)
x = mx.array(x_np)

y_ms = A @ x
mx.eval(y_ms)

y_ref = np.array(A.todense()) @ x_np
err = np.max(np.abs(np.array(y_ms) - y_ref))
print(f"max absolute error vs dense reference: {err:.2e}")
assert err < 1e-4, "Results diverge!"
print("Results match within float32 tolerance.")

max absolute error vs dense reference: 2.38e-07
Results match within float32 tolerance.

All value dtypes work on GPU#

As of v0.0.1b0, the Metal backend supports float32, float16, bfloat16, and complex64, all with int32 or int64 indices.

for value_dtype, mlx_dtype in [
    (np.float32,  mx.float32),
    (np.float16,  mx.float16),
    ("bfloat16",  mx.bfloat16),
    (np.complex64, mx.complex64),
]:
    if value_dtype == "bfloat16":
        A_typed = ms.csr_array(
            (A.data.astype(mx.bfloat16), A.indices, A.indptr),
            shape=A.shape, sorted_indices=True, canonical=True,
        )
        x_typed = x.astype(mx.bfloat16)
    else:
        A_typed = ms.csr_array(
            (A.data.astype(mlx_dtype), A.indices, A.indptr),
            shape=A.shape, sorted_indices=True, canonical=True,
        )
        x_typed = x.astype(mlx_dtype)
    y_typed = A_typed @ x_typed
    mx.eval(y_typed)
    name = getattr(value_dtype, '__name__', str(value_dtype))
    print(f"{name:<9} -> y.dtype = {y_typed.dtype}, shape {y_typed.shape}")

float32   -> y.dtype = mlx.core.float32, shape (4096,)
float16   -> y.dtype = mlx.core.float16, shape (4096,)
bfloat16  -> y.dtype = mlx.core.bfloat16, shape (4096,)
complex64 -> y.dtype = mlx.core.complex64, shape (4096,)

Timing: sparse vs dense on M5#

We compare csr_matvec to mx.matmul (dense) at increasing matrix sizes. Timings are the median of 50 iterations after 5 warmup rounds.

Environment: Apple M5, 10-core GPU, macOS 26.0, MLX 0.31, mlx-sparse 0.0.1b0

import time, statistics

def bench(fn, warmup=5, iters=50):
    for _ in range(warmup):
        mx.eval(fn())
    times = []
    for _ in range(iters):
        t0 = time.perf_counter()
        mx.eval(fn())
        times.append(time.perf_counter() - t0)
    return statistics.median(times) * 1000  # ms

print(f"{'shape':<15} {'nnz':<8} {'density':<9} {'sparse_ms':<11} {'dense_ms':<11} {'speedup'}")

for n, density in [(4096, 0.00025), (8192, 0.0001), (16384, 0.00003), (32768, 0.00001)]:
    A_b = ms.random.rand(
        n, n, density=density, format="csr",
        dtype=mx.float32, rng=0, index_dtype=mx.int32,
    )
    x_b = mx.array(np.random.randn(n).astype(np.float32))
    dense_b = A_b.todense()
    mx.eval(dense_b)

    t_sp = bench(lambda: A_b @ x_b)
    t_dn = bench(lambda: dense_b @ x_b)
    print(f"({n},{n}){' '*(12 - len(str(n))*2)} {A_b.nnz:<8} {density*100:.3f}%    {t_sp:.3f} ms   {t_dn:.3f} ms   {t_dn/t_sp:.1f}x")

shape           nnz      density   sparse_ms   dense_ms    speedup
(4096,4096)     4194     0.025%    0.122 ms   0.897 ms   7.4x
(8192,8192)     6711     0.010%    0.103 ms   3.843 ms   37.4x
(16384,16384)   8053     0.003%    0.128 ms   15.111 ms   118.5x
(32768,32768)   10737    0.001%    0.146 ms   694.919 ms   4748.9x

Key insight: at very low densities (< 0.01%) and large matrices, sparse is dramatically faster because it only touches the non-zero entries. Dense matmul costs O(n²) regardless of sparsity.