How to Fine-Tune Gemma 4 on a GPU With Only 8GB of VRAM

If you've ever tried to fine-tune a modern language model on a consumer GPU, you know the pain. You download the model, load it up, write your training script, hit enter — and immediately get slapped with a CUDA out-of-memory error. It's the most demoralizing wall in local AI development.

The good news: fine-tuning Gemma 4 locally on just 8GB of VRAM is now achievable, thanks to quantized LoRA techniques and tooling improvements from the community. Let me walk you through the exact problem and how to get past it.

Why Fine-Tuning Blows Up Your VRAM

Full fine-tuning a large language model means loading the entire model weights, the optimizer states, the gradients, and the activations all into GPU memory simultaneously. For a model with billions of parameters, you're looking at tens of gigabytes before you even start a forward pass.

Here's the rough math for a model in fp16:

Model weights: ~2 bytes per parameter
Adam optimizer states: ~8 bytes per parameter (momentum + variance + master weights)
Gradients: ~2 bytes per parameter
Activations: varies, but substantial

So a 12B parameter model in full fine-tuning needs roughly 12 × 12 bytes = ~144GB just for weights, optimizer, and gradients. Your RTX 4060 with 8GB is laughing at you.

The Fix: QLoRA + 4-bit Quantization

The breakthrough that makes this possible is QLoRA (Quantized Low-Rank Adaptation). Instead of updating every weight in the model, you:

Quantize the base model to 4-bit precision (shrinks memory footprint dramatically)

Freeze those quantized weights

Attach small trainable adapter matrices (LoRA) on top

Train only the adapters in a higher precision

This takes your memory requirement from "buy a data center" to "my gaming GPU can handle this."

Step-by-Step: Setting Up Local Fine-Tuning

Prerequisites

You'll need:

A GPU with at least 8GB VRAM (RTX 3060 12GB, RTX 4060 8GB, etc.)
Python 3.10+
CUDA 12.1+ installed
At minimum 16GB of system RAM (32GB recommended)

Install the Dependencies

bash

# Create a clean virtual environment first — trust me on this
python -m venv gemma-ft
source gemma-ft/bin/activate

# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install the core fine-tuning libraries
pip install transformers datasets accelerate peft bitsandbytes trl

The bitsandbytes library handles the 4-bit quantization. peft provides the LoRA implementation. trl gives us a clean training interface with SFTTrainer.

Load the Model in 4-bit

Here's where the magic happens. We load Gemma 4 quantized to 4-bit NormalFloat precision:

python

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

model_id = "google/gemma-4-12b"  # check Hugging Face for exact model ID

# 4-bit quantization config — NF4 is the sweet spot for quality vs memory
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,  # compute in bf16 for stability
    bnb_4bit_use_double_quant=True,  # nested quantization saves ~0.4GB
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
)

# Prepare for k-bit training — handles gradient checkpointing quirks
model = prepare_model_for_kbit_training(model)

That bnb_4bit_use_double_quant=True flag is doing more than you'd think. It applies a second round of quantization to the quantization constants themselves, saving roughly 0.4GB on a 12B model. When you're fighting for every megabyte, that matters.

Configure LoRA Adapters

python

lora_config = LoraConfig(
    r=16,                          # rank — 8-32 is the usual range
    lora_alpha=32,                 # scaling factor, typically 2x rank
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Expect something like: trainable params: ~50M || all params: ~12B || 0.4%

You're only training about 0.4% of the total parameters. That's the key insight — you don't need to touch every weight to meaningfully change model behavior.

Train It

python

from trl import SFTTrainer
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./gemma4-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=1,    # keep this at 1 for 8GB
    gradient_accumulation_steps=8,     # effective batch size = 8
    gradient_checkpointing=True,       # trades compute for memory
    learning_rate=2e-4,
    bf16=True,
    logging_steps=10,
    save_strategy="epoch",
    optim="paged_adamw_8bit",          # 8-bit optimizer saves more VRAM
    max_grad_norm=0.3,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=your_dataset,       # your formatted dataset here
    tokenizer=tokenizer,
    args=training_args,
    max_seq_length=1024,              # reduce if still OOM
)

trainer.train()

A couple of things to call out:

gradient_checkpointing=True is non-negotiable on 8GB. It recomputes activations during the backward pass instead of storing them, trading ~30% more compute time for massive memory savings.
paged_adamw_8bit uses CPU RAM as overflow when GPU memory gets tight. It's slower but prevents OOM crashes.
per_device_train_batch_size=1 with gradient_accumulation_steps=8 gives you an effective batch size of 8 without needing the VRAM for 8 samples.

Common Pitfalls and Bug Fixes

The `bitsandbytes` CUDA Version Mismatch

This trips up almost everyone. If you see errors about CUDA libraries not being found:

bash

# Check what CUDA bitsandbytes actually found
python -c "import bitsandbytes; print(bitsandbytes.cuda_setup.main())"

# Often fixed by setting the library path explicitly
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

Gradient Checkpointing + LoRA Conflict

Some versions of peft and transformers have a bug where gradient checkpointing breaks with LoRA if use_reentrant isn't set correctly. If you get errors about tensors not requiring grad:

python

# Add this to your TrainingArguments
gradient_checkpointing_kwargs={"use_reentrant": False}

This one burned me for two hours. The error message gives you zero indication that this is the fix.

OOM Despite Doing Everything Right

If you're still hitting OOM:

Drop max_seq_length to 512 or even 256
Reduce LoRA rank r from 16 to 8
Make sure no other process is eating GPU memory (nvidia-smi is your friend)
Set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to reduce fragmentation

Faster Alternative: Unsloth

The community has built dedicated tooling for exactly this use case. Unsloth is an open-source library that optimizes the fine-tuning pipeline with custom CUDA kernels, reportedly achieving 2x faster training and ~60% less memory usage compared to vanilla Hugging Face + PEFT setups. It's worth looking into if you're serious about local fine-tuning on constrained hardware — their GitHub repo has Gemma-specific notebooks.

Is the Quality Actually Good?

Honest answer: it depends on your task. QLoRA fine-tuning is excellent for:

Domain adaptation (teaching the model your company's terminology or codebase)
Instruction tuning on a specific format
Classification tasks where you need consistent structured output

It's less ideal for fundamentally changing the model's knowledge or capabilities. You're adjusting behavior, not rebuilding the brain.

Prevention: Don't Waste Training Runs

Before you burn hours on a fine-tune:

Start with a small dataset test (100 samples, 1 epoch) to make sure training actually converges
Monitor your loss curves — if loss flatlines immediately, your learning rate is too low or your data formatting is wrong
Validate your chat template — Gemma models are picky about special tokens. A malformed template will train the model on garbage
Save checkpoints frequently — a power blip at epoch 2.9 of 3 is a special kind of pain

The barrier to local fine-tuning keeps dropping. A few years ago you needed a multi-GPU server. Now an 8GB consumer card gets the job done. That's a pretty remarkable shift, and it means more developers can experiment with customized models without cloud bills or API rate limits.

Why Fine-Tuning Blows Up Your VRAM

The Fix: QLoRA + 4-bit Quantization

Step-by-Step: Setting Up Local Fine-Tuning

Prerequisites

Install the Dependencies

Load the Model in 4-bit

Configure LoRA Adapters

Train It

Common Pitfalls and Bug Fixes

The bitsandbytes CUDA Version Mismatch

Gradient Checkpointing + LoRA Conflict

OOM Despite Doing Everything Right

Faster Alternative: Unsloth

Is the Quality Actually Good?

Prevention: Don't Waste Training Runs

The `bitsandbytes` CUDA Version Mismatch