AuthonAuthon Blog
debugging6 min read

How to Run a 400B Parameter LLM on a Phone (Yes, Really)

A 400B LLM ran on an iPhone 17 Pro. Here's how flash offloading and aggressive quantization make the impossible possible.

AW
Alan West
Authon Team
How to Run a 400B Parameter LLM on a Phone (Yes, Really)

A few days ago, a demo started making the rounds showing an iPhone 17 Pro running a 400B parameter large language model. Not a cloud API call. Not a clever proxy. An actual 400B model doing inference on the device.

My first reaction was the same as yours: "That's impossible. Where does the memory come from?"

Turns out, it's not impossible — it's just really, really clever engineering. And the techniques behind it are worth understanding, because they solve a problem that's about to hit a lot of us: how do you run models that are way bigger than your available RAM?

The Core Problem: Your Model Doesn't Fit

Let's do some napkin math. A 400B parameter model in FP16 needs roughly 800GB of memory. Even at 4-bit quantization, you're looking at around 200GB. The iPhone 17 Pro has 12GB of RAM.

So we're off by a factor of ~16x. That's not a rounding error. That's a "go home, this is impossible" gap.

Except it's not. The trick is that you don't need the entire model in memory at once. You only need the layers that are actively computing. Everything else can live on storage and get swapped in on demand.

The Solution: Layer-by-Layer Streaming From Flash

The technique that makes this work is sometimes called flash offloading or layer streaming. Here's the concept:

python
# Pseudocode for layer-streaming inference
for layer in model.layers:
    # Load only this layer's weights from flash/SSD into RAM
    weights = load_from_storage(layer.weight_path)
    
    # Run the forward pass for just this layer
    hidden_state = layer.forward(hidden_state, weights)
    
    # Free the memory immediately — we're done with this layer
    del weights
    release_memory()

Instead of loading 200GB into memory, you load maybe 500MB at a time (one transformer layer), compute, discard, and move to the next. The total memory footprint stays roughly constant regardless of model size.

The catch? It's slow if your storage is slow. This is where modern hardware actually helps.

Why This Works Now (And Didn't Before)

The iPhone 17 Pro's NVMe flash storage can push sequential read speeds north of 6-7 GB/s. That changes the math dramatically:

  • 200GB model at 6 GB/s = ~33 seconds to stream the full model for one forward pass
  • With 120+ transformer layers, each layer streams independently
  • Prefetching the next layer while computing the current one hides most of the latency

You're not going to get ChatGPT-speed responses. But you can get a token every few seconds, which is... actually usable for a lot of tasks.

Aggressive Quantization Is the Other Half

Flash offloading alone isn't enough. You also need to squeeze the model down as much as possible. The state of the art here has gotten surprisingly good:

python
# Example: quantizing a model to 2-4 bits with grouped quantization
import torch

def quantize_tensor_grouped(tensor, bits=4, group_size=128):
    """Quantize weights in groups for better accuracy retention."""
    # Reshape into groups
    orig_shape = tensor.shape
    tensor = tensor.reshape(-1, group_size)
    
    # Compute per-group scale and zero point
    t_min = tensor.min(dim=1, keepdim=True).values
    t_max = tensor.max(dim=1, keepdim=True).values
    
    scale = (t_max - t_min) / (2**bits - 1)
    zero_point = t_min
    
    # Quantize
    quantized = torch.round((tensor - zero_point) / scale).clamp(0, 2**bits - 1)
    
    return quantized.to(torch.uint8), scale, zero_point, orig_shape

Grouped quantization (computing scale/zero-point per group of 128 weights instead of per-tensor) keeps quality surprisingly high even at 3-4 bits. A 400B model at 3-bit quantization comes down to roughly 150GB — still huge, but manageable for flash streaming.

The Apple Neural Engine Angle

The demo that went viral used the ANEMLL framework, which is specifically designed to run LLM inference on Apple's Neural Engine (ANE). This matters because most on-device inference frameworks target the GPU, but Apple's ANE is actually better suited for certain operations:

  • Higher throughput for matrix multiplications at lower precision
  • Lower power consumption than running the same ops on the GPU
  • Dedicated hardware that doesn't compete with the rest of the system

The ANE has historically been underutilized for LLMs because it's picky about tensor shapes and memory layouts. Frameworks like ANEMLL and coremltools handle that translation.

bash
# Getting started with on-device LLM inference on Apple hardware
# Step 1: Convert your model to CoreML format
python -m coremltools.converters.convert \
    --model-path ./my_quantized_model \
    --output-path ./model.mlpackage \
    --compute-units ALL  # Uses ANE + GPU + CPU

# Step 2: Split into layer chunks for streaming
python split_model.py \
    --input ./model.mlpackage \
    --output-dir ./model_chunks \
    --chunk-size 1  # One transformer layer per chunk

Practical Takeaways for Developers

Okay, so should you go ship a 400B model in your iOS app? Probably not. But here's what actually matters:

What's realistic today

  • 7B-13B models run comfortably on recent iPhones and Android flagships at usable speeds (10-20 tokens/sec)
  • 70B models are feasible with flash offloading on devices with fast NVMe, with slower but acceptable speeds
  • 400B models are a proof of concept — cool demo, not production-ready

How to actually do it

  • Start with a pre-quantized model. Don't quantize yourself unless you have to. Grab GGUF or similar pre-quantized weights.
  • Use an established inference engine. llama.cpp supports Metal on Apple devices and has flash offloading built in. Don't reinvent the wheel.
  • Profile your memory budget. Your app needs RAM too. Don't allocate everything to the model.
  • Test on real devices. Simulators lie about memory pressure and thermal throttling.
  • The gotchas nobody mentions

    • Thermal throttling is real. Running a large model will heat up the device and the OS will throttle your compute. Your first token might come in 2 seconds; your 50th token might take 5.
    • Background kills. iOS will terminate your app if it holds too much memory while backgrounded. You need to handle this gracefully.
    • Battery drain. Users will notice. A sustained inference session can drain 1% battery per minute easily.

    Where This Is Heading

    The 400B-on-a-phone demo isn't really about running 400B models on phones. It's a proof that the boundary between "cloud model" and "on-device model" is blurring fast.

    Two years ago, running a 7B model on a phone was the impressive demo. Now it's 400B. The hardware is getting faster, the quantization is getting smarter, and the inference engines are getting more efficient.

    For those of us building apps, the practical implication is clear: start learning on-device inference now. The models that fit comfortably on a phone today are already good enough for a huge number of tasks — summarization, code completion, chat, classification. And they run with zero latency, zero API costs, and full privacy.

    The future of LLM deployment isn't cloud OR device. It's both, with smart routing between them. And the phone side of that equation just got a lot more interesting.

    How to Run a 400B Parameter LLM on a Phone (Yes, Really) | Authon Blog