Currently supported#
This page is the authoritative record of what mlx-sparse implements, what is planned, and what is out of scope. Status is updated with each release.
Warning
mlx-sparse supports macOS and Linux. Linux support is CPU-only in this
release: CUDA and ROCm are not implemented, Metal is Apple-only, and Linux
builds do not use Accelerate, BLAS, or Sparse BLAS backends.
Current version: development branch
Sparse formats#
Feature |
Status |
Notes |
|---|---|---|
|
Done |
Immutable frozen dataclass. Allows duplicates and unsorted coordinates. |
|
Done |
Immutable frozen dataclass. |
|
Done |
Immutable frozen dataclass. Column-compressed dual of CSR with
|
Block CSR (BCSR) |
Planned |
Internal storage format for block-structured matrices. |
ELLPACK / SELL-C-σ |
Research |
Internal format for regular row lengths. No public API commitment. |
Sparse tensors (rank > 2) |
Not planned |
MLX’s lazy graph requires output shapes at graph-build time. General sparse tensors have dynamic shapes and are out of scope for v0.x. |
Constructors#
Feature |
Status |
Notes |
|---|---|---|
|
Done |
Accepts MLX arrays, NumPy arrays, or Python lists. |
|
Done |
Same input flexibility. |
|
Done |
Explicit CSC buffers with metadata or full validation. |
|
Done |
Sparse identity or shifted-diagonal matrix. Returns canonical CSR. |
|
Done |
One or more diagonals at specified offsets. Returns canonical CSR. |
|
Done |
Native staged conversion with optional threshold for near-zeros. Counts on the active backend, synchronizes row counts to allocate compact output buffers, then fills CSR data natively. |
|
Done |
PEP 8 and NumPy-oriented aliases for dense-to-CSR conversion. |
|
Done |
Converts any SciPy sparse matrix/array to canonical CSR, CSC, or COO. |
|
Done |
Extension smoke test / identity copy. |
|
Done |
Returns |
|
Done |
Converts existing sparse, SciPy sparse, dense MLX, NumPy, or Python rank-2 array-like inputs. Existing CSR/CSC inputs are preserved unless a dtype cast is requested, dense and SciPy inputs default to CSR. |
|
Done |
Public |
Conversions and structural operations#
Feature |
Status |
Notes |
|---|---|---|
|
Done |
Native primitive (CPU and Metal). Sorts by row then column. Preserves duplicates. |
|
Done |
Sorts and sums duplicates. |
|
Done |
Native |
|
Done |
Sorts row indices within columns and sums duplicates. |
|
Done |
Native |
|
Done |
Native |
|
Done |
Native primitive (CPU and Metal). Sums duplicate column entries. |
|
Done |
Native column-wise materialization (CPU and Metal). Sums duplicate row entries. |
|
Done |
Via |
|
Done |
Module-level dispatch helper. |
|
Done |
Native primitive (CPU and Metal). |
|
Done |
Native staged primitive (CPU and Metal). Dynamic output size requires a row-count synchronization before compact output fill. |
|
Done |
Combines |
|
Done |
Native CSC sort, duplicate-sum, and canonicalization primitives over compressed columns. |
|
Done |
Native primitive (CPU and Metal). Returns row-sorted CSRArray. |
|
Done |
|
|
Done |
Hermitian (conjugate) transpose. |
|
Done |
|
Sparse-dense arithmetic#
Feature |
Status |
Notes |
|---|---|---|
|
Done |
CPU and Metal GPU. Scalar row kernel plus vector-reduction kernel for long rows on Metal. |
|
Done |
Native CSC kernels. Forward matvec is column scatter-add, transpose matvec is segmented column reduction. |
|
Done |
CPU and Metal GPU. Scalar element kernel plus vector-reduction kernel for long rows on Metal. |
Batched dense RHS ( |
Done |
RHS with |
Sparse-sparse multiplication ( |
Done |
Native symbolic pass, prefix-sum allocation, and numeric pass returning canonical CSR. Dynamic output size requires host synchronization. |
Sparse-sparse multiplication ( |
Done |
Native coordinate-row symbolic/count pass, prefix allocation, sorted numeric fill, and zero pruning returning canonical COO. |
Sparse-sparse multiplication ( |
Done |
Native compressed-column symbolic/count pass, prefix allocation, sorted numeric fill, and zero pruning returning canonical CSC. |
Scalar multiply ( |
Done |
Scales stored values for COO, CSR, and CSC inputs while preserving the sparse format and structural metadata. |
Sparse-sparse addition |
Not planned |
Dynamic output size. May be added as a host-side utility. |
Sparse reductions#
Feature |
Status |
Notes |
|---|---|---|
COO reductions |
Done |
Native row/column sums, row/column L2 norms, diagonal extraction, and
trace. Sums and diagonal/trace operate directly on coordinates. Norms
canonicalize first when duplicates may be present so the result matches
dense semantics. Non- |
CSR reductions |
Done |
Native row/column sums, row norms, diagonal, and trace. Storage-aligned row reductions and long diagonal segments use threadgroup reductions on Metal, large traces use a staged partial-reduction path. |
CSC reductions |
Done |
Native row/column sums, row/column L2 norms, diagonal, and trace.
Column sums and column norms are storage-aligned compressed-column
reductions and are the fast path for CSC. Non- |
Sparse linear algebra#
For a solver-centric view of CPU, Metal GPU, and Accelerate coverage, see Solvers.
Feature |
Status |
Backend |
Notes |
|---|---|---|---|
|
Done |
CPU + GPU |
Full solver runs inside a single Metal kernel on GPU. |
|
Done |
CPU + GPU |
Each restart’s Arnoldi step dispatches the |
|
Done |
CPU + GPU |
Shifted Paige-Saunders recurrence runs in native CPU or Metal kernels. Diagonal/Jacobi preconditioners are supported when SPD. |
|
Done |
CPU + GPU |
Lanczos step dispatches |
|
Done |
CPU + GPU |
Arnoldi step dispatches |
|
Done |
CPU + GPU |
Dedicated normal-operator Lanczos step keeps
|
|
Done |
CPU only |
Symbolic fill-in factorisation is inherently sequential. Planned GPU path via supernodal Cholesky is out of scope for v0.x. |
|
Done |
CPU + GPU |
LU factorisation (partial pivoting) runs on CPU. Triangular
forward/back-substitution and permutation dispatch to Metal GPU via
|
|
Done |
CPU + GPU |
Natural-order ILU(0) setup runs on CPU and preserves the canonical CSR sparsity pattern. Application uses native CSR triangular solves for rank-1 or rank-2 right-hand sides on CPU or Metal. |
|
Optional |
CPU only |
Accelerate-enabled Apple builds use opaque Accelerate direct solves for
supported real |
|
Done |
CPU + GPU |
Native CSR row-merge reductions for |
Linalg GPU coverage notes#
Sparse linalg entrypoints accept CSR, COO, and CSC inputs. CSR is the execution
format for native kernels, so COO and CSC inputs are converted once to
canonical CSR at native solver entry. This keeps the existing Metal Krylov,
triangular solve, and permutation kernels active without doing repeated CSC
scatter-add matvecs inside solver iterations. Accelerate-enabled direct solves
instead validate and normalize real float32 CSR, COO, and CSC inputs into
canonical CSC storage because Apple’s sparse direct solvers are CSC-native.
The table above uses a simplified “CPU + GPU” label. The precise breakdown is:
CG: the entire conjugate-gradient iteration (SpMV, dot products, vector updates) runs inside a single Metal threadgroup kernel. The GPU path is fully independent of the CPU.
GMRES / MINRES / eigsh / eigs: the expensive Krylov-subspace step (Arnoldi or Lanczos, which accounts for most of the wall time at large
n) runs on GPU via thecsr_arnoldiorcsr_lanczosMetal kernels. Post-processing (a small dense eigensolve or least-squares solve of size≤ restartor≤ ncv) runs on CPU. Anmx.eval()synchronisation separates the two phases, at very smalln(≲ 1 000) the synchronisation overhead can exceed the GPU savings.Cholesky / LU / ILU(0) factorisation: row-by-row elimination with fill-in or no-fill incomplete updates runs on CPU. The resulting triangular solve (
SparseCholesky.solve,SparseLU.solve,spsolve, andpreconditioners.ilu0application) dispatches thecsr_triangular_solveMetal kernel and thecsr_permute_vectorMetal kernel for the LU row-permutation step where a permutation is present.svds: uses a dedicated normal-operator Lanczos step for
A.T @ (A @ x). The implementation does not materializeA.T @ Aand does not split the recurrence into Python-level sparse products. On Metal, the two sparse products are fused inside the native Lanczos step, the small tridiagonal eigensolve, Ritz-vector back transformation, and final singular vector assembly remain CPU work.
Automatic differentiation#
Feature |
Status |
Notes |
|---|---|---|
VJP w.r.t. dense |
Done |
Dispatches native transpose matvec. |
JVP w.r.t. dense |
Done |
Reuses forward |
VJP w.r.t. dense |
Done |
Dispatches native transpose matmul. |
JVP w.r.t. dense |
Done |
Reuses forward |
VJP/JVP w.r.t. sparse values ( |
Done |
Fixed-output data-gradient primitives for matvec and matmul on CPU and Metal GPU. |
Complex autodiff |
Done |
|
VJP/JVP w.r.t. |
Not planned |
Structural parameters are not differentiable variables. |
|
Done |
Batched dense RHS uses native batched sparse-dense kernels. |
VJP/JVP through batched dense RHS |
Done |
Native batched matvec/matmul primitives support sparse-value and dense-RHS differentiation. |
VJP/JVP through sparse-sparse |
Not planned for v0.1 |
Output topology is data-dependent and returned as a sparse container. Fixed-output sparse-dense products are the differentiable path. |
|
Not planned |
Batch of sparse matrices is an unusual use case. Deferred. |
Metal GPU kernel coverage#
Most sparse primitives cover the full value and index dtype matrix. A few
linalg kernels are intentionally float32-only, and dynamic-output
structural primitives synchronize counts or output structure before allocating
compact buffers.
Kernel |
Status |
Notes |
|---|---|---|
|
All value and index dtypes |
Scalar row kernel plus threadgroup vector reduction for long rows |
|
All value and index dtypes |
Native coordinate scatter products. |
|
All value and index dtypes |
Native batched coordinate scatter kernels |
|
All value and index dtypes |
Fixed-output sparse-value VJP over explicit coordinates |
|
All value and index dtypes |
Native batched dense-vector RHS kernel |
|
All value and index dtypes |
Fixed-output sparse-value VJP primitive |
|
All value and index dtypes |
|
|
All value and index dtypes |
Forward |
|
All value and index dtypes |
Forward |
|
All value and index dtypes |
Native batched compressed-column dense RHS kernels |
COO/CSC reductions |
All value and index dtypes |
Storage-aligned reductions use scalar or threadgroup vector kernels.
Scatter reductions use |
|
All value and index dtypes |
Fixed-output sparse-value VJP over compressed columns |
|
All value and index dtypes |
Scalar element kernel plus threadgroup vector reduction for long rows |
|
All value and index dtypes |
Native batched dense-matrix RHS kernel |
|
All value and index dtypes |
Fixed-output sparse-value VJP primitive |
|
All value and index dtypes |
|
|
All value and index dtypes |
Fixed-output materialization kernel |
|
All value and index dtypes |
Rank-based stable sort plus indptr build |
|
All value and index dtypes |
Rank-based stable column-major sort plus indptr build |
|
All value and index dtypes |
Parallel count/prefix plus deterministic fill |
|
All value and index dtypes |
Native count/prefix/fill conversions. GPU fill uses atomic offsets and
does not promise sorted output, call |
|
All value and index dtypes |
Parallel zero-fill plus column-wise materialization |
|
All value and index dtypes |
Rank-based stable per-row sort |
|
All value and index dtypes |
Rank-based stable per-column sort |
|
|
Full CG iteration for |
|
|
Krylov step for |
|
|
Krylov step for |
|
|
Forward/back-substitution for |
|
|
Row permutation step in |
|
|
Sparse Frobenius inner products with explicit complex conjugation semantics |
|
All value and index dtypes |
Staged count/prefix/fill primitive, dynamic output size requires row-count synchronization |
|
All value and index dtypes |
Staged per-column count/prefix/fill primitive, dynamic output size requires column-count synchronization |
|
All value and index dtypes |
Staged count/prefix/fill dense-to-CSR conversion |
|
All value and index dtypes |
Optimized host path by default, experimental staged Metal path behind
|
|
All value and index dtypes |
Optimized host path by default, experimental staged Metal path behind
|
|
All value and index dtypes |
Optimized host path by default, experimental staged Metal path behind
|
Known limitations#
GPU availability depends on the MLX and macOS Metal runtime.
Dynamic-output helpers (
fromdense(),canonicalize(), dense/SciPy construction, and sparse-sparsematmat) synchronize compact counts or structure to host before allocating final output buffers.CSC currently covers construction, conversion, canonicalization, dense materialization, reductions, dense vector/matrix products including batched dense RHS, same-format sparse-sparse matmul, one-time conversion at native linalg solver entry, and canonical CSC normalization for Accelerate-enabled opaque direct solves.
Sparse solver, factorization, and spectral kernels are real-valued.
float16andbfloat16inputs are promoted tofloat32before solver dispatch. Sparsedot/vdotsupportcomplex64.Full validation (
validate="full") may trigger host synchronization.