Why local LLM inference stalls on Apple Silicon (and how to fix it)

I spent a chunk of last month trying to run a 30B-class model locally on my M2 Max. 64GB of unified memory, a stack of GPU cores, no other apps running. Should be smooth. Instead I got around 3 tokens per second, a fan that sounded like a leaf blower, and the slow creeping suspicion that I was holding it wrong.

If you've tried serious local inference on Apple Silicon, you've probably hit this. The hardware is genuinely capable. The software stack often isn't — or rather, the generic software stack isn't. This came back into focus for me when antirez (yes, the Redis guy) posted ds4, a from-scratch Metal inference engine targeting DeepSeek. The README is pretty explicit that it's a focused, learning-oriented project rather than a general framework, but seeing it made me want to write up why the focused approach keeps winning on Apple Silicon, and what you can do about slow local inference today.

The root cause: it's bandwidth, not FLOPS

Here's the thing nobody tells you when you start: during token-by-token decoding, an LLM is almost entirely memory-bandwidth-bound, not compute-bound. Every generated token requires streaming the full set of weights (or at least every weight touched by that forward pass) from memory through the GPU, plus reading and writing the KV cache.

A quick napkin calculation. Say you have a 7B parameter model in 4-bit quantization. That's roughly 4GB of weights. To generate one token, you read all 4GB once. If your effective memory bandwidth to the GPU is around 200 GB/s (well under the theoretical peak on M-series Max chips, but realistic for many workloads), the floor on per-token latency is:

text

4 GB / 200 GB/s = 20 ms => ~50 tokens/sec ceiling

If you're getting 3 tokens/sec, you're not bandwidth-limited. You're losing somewhere in the stack. The questions are: where, and why.

Where time actually goes

When I profiled my run with Instruments and the Metal System Trace template, three things jumped out:

Tons of tiny kernel launches. Each transformer layer was firing off many small Metal compute encoders — softmax, RMSNorm, rotary embeddings, masks — and the GPU spent more time in dispatch overhead than in actual math.

Quantization on the wrong side of the bus. Some kernels were dequantizing weights into FP16, writing the FP16 back to memory, then doing the matmul. That literally destroys the point of quantization, which is to shrink the bytes you stream.

KV cache being copied around. The cache was being reallocated on every step in some paths instead of being grown in place.

Generic frameworks make these mistakes because they're trying to be everything to everyone. A focused inference engine for one model family can hardcode the right answers.

Step 1: fuse your kernels

The single biggest win is fusing the small operations in each transformer block into one or two big kernels. Here's the pattern I converged on, in pseudocode:

metal

// Fused: RMSNorm -> Q/K/V projection -> RoPE
// Avoids three separate dispatches and two round trips through memory.
kernel void attn_qkv_rope(
    device const half*  x        [[buffer(0)]],  // input activations
    device const uint8_t* w_qkv  [[buffer(1)]],  // 4-bit packed weights
    device const half*  scales   [[buffer(2)]],  // per-group scales
    device half*        q_out    [[buffer(3)]],
    device half*        k_out    [[buffer(4)]],
    device half*        v_out    [[buffer(5)]],
    constant Params&    p        [[buffer(6)]],
    uint tid [[thread_position_in_grid]]) {
    // 1) RMSNorm in-register, no temp buffer back to global mem
    float norm = rms_norm_inline(x, tid, p);

    // 2) Dequant + GEMV in the same pass: each weight tile is
    //    unpacked into registers and immediately consumed.
    half3 qkv = dequant_gemv_q4(w_qkv, scales, norm, tid, p);

    // 3) Apply rotary embeddings before the write-out.
    apply_rope(qkv, tid, p);

    write_split(q_out, k_out, v_out, qkv, tid);
}

Key idea: weights stay packed in 4-bit form in memory. They're unpacked into registers inside the kernel and consumed immediately. You never write a dequantized copy back to global memory. The matmul reads the small representation; the math happens on the wider one inside SIMD units.

That single change took my throughput on a small 7B model from "painful" to "actually usable." Your numbers will vary — but the principle holds for any chip with a memory wall.

Step 2: stop reallocating the KV cache

This one bit me hard. A naive implementation grows the KV tensor by allocating a bigger buffer each step and copying. On Metal that means a MTLBlitCommandEncoder round trip for every token. Don't do this.

Preallocate once, write in place:

// Preallocate KV for max_seq_len at startup.
// Writes are O(1) per token; no resize, no copy.
typedef struct {
    half* k;            // [n_layers][max_seq][n_kv_heads][head_dim]
    half* v;
    int   capacity;     // max_seq_len
    int   length;       // current logical length
} kv_cache_t;

static inline void kv_append(kv_cache_t* c,
                             const half* k_new,
                             const half* v_new,
                             int layer, int n_kv, int head_dim) {
    // Just write to the next slot; no allocation.
    size_t off = ((size_t)layer * c->capacity + c->length) * n_kv * head_dim;
    memcpy(c->k + off, k_new, n_kv * head_dim * sizeof(half));
    memcpy(c->v + off, v_new, n_kv * head_dim * sizeof(half));
}

If you want to support eviction or sliding windows later, add it as a logical layer on top. Keep the hot path branch-free.

Step 3: pick the right quantization for your hardware

Not all 4-bit schemes are equal on Metal. Group-wise quantization with a small group size (32 or 64) usually unpacks cleanly in SIMD lanes and plays nicely with the threadgroup memory you have. Block-wise schemes with larger groups save more on the scale-table side but can stall on misaligned reads.

My rough rule of thumb after migrating a few projects:

Q4 with group size 32: best balance for M-series; fast unpack, good quality.
Q5/Q6: noticeable quality bump, but you're trading away bandwidth — only worth it if you're already CPU-bound on dispatch.
Q8: simple, accurate, but uses 2x the bandwidth of Q4 for marginal quality. Use it for debugging quantization bugs, not production.

This is the kind of tradeoff a focused engine bakes in; a generic one usually exposes all of them and lets you pick the wrong one.

Prevention: profile before you optimize

Before you touch a single kernel, open Instruments with Metal System Trace and look at the timeline. You're looking for:

Long gaps between command buffer commits (CPU bottleneck — your encoding loop is too chatty).
Many tiny encoders inside one buffer (kernel fusion opportunity).
High occupancy but low achieved bandwidth (unaligned reads or scalar paths in your kernels).
Memory traffic that exceeds your model size per token (you're materializing dequantized weights — fix that first).

Apple's Metal Performance HUD and the official Metal Shading Language spec are your friends here. So is reading focused, single-model engines like ds4 — they tend to make the design choices explicit instead of hiding them behind abstraction.

The takeaway

Local inference on Apple Silicon isn't slow because the hardware is bad. It's slow when generic frameworks impose generic abstractions on a workload that punishes them. Fuse your kernels, keep weights packed, preallocate your KV cache, pick a quantization that maps well to SIMD, and profile before you guess. You'll get most of the way to what a hand-tuned engine achieves — and you'll understand your stack a lot better when something inevitably regresses.