AuthonAuthon Blog
debugging7 min read

How to Fine-Tune Gemma 4 on a GPU With Only 8GB of VRAM

Step-by-step guide to fine-tuning Gemma 4 on a consumer GPU with just 8GB VRAM using QLoRA, 4-bit quantization, and practical tips to avoid common pitfalls.

AW
Alan West
Authon Team
How to Fine-Tune Gemma 4 on a GPU With Only 8GB of VRAM

If you've ever tried to fine-tune a modern language model on a consumer GPU, you know the pain. You download the model, load it up, write your training script, hit enter — and immediately get slapped with a CUDA out-of-memory error. It's the most demoralizing wall in local AI development.

The good news: fine-tuning Gemma 4 locally on just 8GB of VRAM is now achievable, thanks to quantized LoRA techniques and tooling improvements from the community. Let me walk you through the exact problem and how to get past it.

Why Fine-Tuning Blows Up Your VRAM

Full fine-tuning a large language model means loading the entire model weights, the optimizer states, the gradients, and the activations all into GPU memory simultaneously. For a model with billions of parameters, you're looking at tens of gigabytes before you even start a forward pass.

Here's the rough math for a model in fp16:

  • Model weights: ~2 bytes per parameter
  • Adam optimizer states: ~8 bytes per parameter (momentum + variance + master weights)
  • Gradients: ~2 bytes per parameter
  • Activations: varies, but substantial

So a 12B parameter model in full fine-tuning needs roughly 12 × 12 bytes = ~144GB just for weights, optimizer, and gradients. Your RTX 4060 with 8GB is laughing at you.

The Fix: QLoRA + 4-bit Quantization

The breakthrough that makes this possible is QLoRA (Quantized Low-Rank Adaptation). Instead of updating every weight in the model, you:

  • Quantize the base model to 4-bit precision (shrinks memory footprint dramatically)
  • Freeze those quantized weights
  • Attach small trainable adapter matrices (LoRA) on top
  • Train only the adapters in a higher precision
  • This takes your memory requirement from "buy a data center" to "my gaming GPU can handle this."

    Step-by-Step: Setting Up Local Fine-Tuning

    Prerequisites

    You'll need:

    • A GPU with at least 8GB VRAM (RTX 3060 12GB, RTX 4060 8GB, etc.)
    • Python 3.10+
    • CUDA 12.1+ installed
    • At minimum 16GB of system RAM (32GB recommended)

    Install the Dependencies

    bash
    # Create a clean virtual environment first — trust me on this
    python -m venv gemma-ft
    source gemma-ft/bin/activate
    
    # Install PyTorch with CUDA support
    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
    
    # Install the core fine-tuning libraries
    pip install transformers datasets accelerate peft bitsandbytes trl

    The bitsandbytes library handles the 4-bit quantization. peft provides the LoRA implementation. trl gives us a clean training interface with SFTTrainer.

    Load the Model in 4-bit

    Here's where the magic happens. We load Gemma 4 quantized to 4-bit NormalFloat precision:

    python
    from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
    from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
    import torch
    
    model_id = "google/gemma-4-12b"  # check Hugging Face for exact model ID
    
    # 4-bit quantization config — NF4 is the sweet spot for quality vs memory
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,  # compute in bf16 for stability
        bnb_4bit_use_double_quant=True,  # nested quantization saves ~0.4GB
    )
    
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        quantization_config=bnb_config,
        device_map="auto",
    )
    
    # Prepare for k-bit training — handles gradient checkpointing quirks
    model = prepare_model_for_kbit_training(model)

    That bnb_4bit_use_double_quant=True flag is doing more than you'd think. It applies a second round of quantization to the quantization constants themselves, saving roughly 0.4GB on a 12B model. When you're fighting for every megabyte, that matters.

    Configure LoRA Adapters

    python
    lora_config = LoraConfig(
        r=16,                          # rank — 8-32 is the usual range
        lora_alpha=32,                 # scaling factor, typically 2x rank
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                        "gate_proj", "up_proj", "down_proj"],
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
    )
    
    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()
    # Expect something like: trainable params: ~50M || all params: ~12B || 0.4%

    You're only training about 0.4% of the total parameters. That's the key insight — you don't need to touch every weight to meaningfully change model behavior.

    Train It

    python
    from trl import SFTTrainer
    from transformers import TrainingArguments
    
    training_args = TrainingArguments(
        output_dir="./gemma4-finetuned",
        num_train_epochs=3,
        per_device_train_batch_size=1,    # keep this at 1 for 8GB
        gradient_accumulation_steps=8,     # effective batch size = 8
        gradient_checkpointing=True,       # trades compute for memory
        learning_rate=2e-4,
        bf16=True,
        logging_steps=10,
        save_strategy="epoch",
        optim="paged_adamw_8bit",          # 8-bit optimizer saves more VRAM
        max_grad_norm=0.3,
    )
    
    trainer = SFTTrainer(
        model=model,
        train_dataset=your_dataset,       # your formatted dataset here
        tokenizer=tokenizer,
        args=training_args,
        max_seq_length=1024,              # reduce if still OOM
    )
    
    trainer.train()

    A couple of things to call out:

    • gradient_checkpointing=True is non-negotiable on 8GB. It recomputes activations during the backward pass instead of storing them, trading ~30% more compute time for massive memory savings.
    • paged_adamw_8bit uses CPU RAM as overflow when GPU memory gets tight. It's slower but prevents OOM crashes.
    • per_device_train_batch_size=1 with gradient_accumulation_steps=8 gives you an effective batch size of 8 without needing the VRAM for 8 samples.

    Common Pitfalls and Bug Fixes

    The bitsandbytes CUDA Version Mismatch

    This trips up almost everyone. If you see errors about CUDA libraries not being found:

    bash
    # Check what CUDA bitsandbytes actually found
    python -c "import bitsandbytes; print(bitsandbytes.cuda_setup.main())"
    
    # Often fixed by setting the library path explicitly
    export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

    Gradient Checkpointing + LoRA Conflict

    Some versions of peft and transformers have a bug where gradient checkpointing breaks with LoRA if use_reentrant isn't set correctly. If you get errors about tensors not requiring grad:

    python
    # Add this to your TrainingArguments
    gradient_checkpointing_kwargs={"use_reentrant": False}

    This one burned me for two hours. The error message gives you zero indication that this is the fix.

    OOM Despite Doing Everything Right

    If you're still hitting OOM:

    • Drop max_seq_length to 512 or even 256
    • Reduce LoRA rank r from 16 to 8
    • Make sure no other process is eating GPU memory (nvidia-smi is your friend)
    • Set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to reduce fragmentation

    Faster Alternative: Unsloth

    The community has built dedicated tooling for exactly this use case. Unsloth is an open-source library that optimizes the fine-tuning pipeline with custom CUDA kernels, reportedly achieving 2x faster training and ~60% less memory usage compared to vanilla Hugging Face + PEFT setups. It's worth looking into if you're serious about local fine-tuning on constrained hardware — their GitHub repo has Gemma-specific notebooks.

    Is the Quality Actually Good?

    Honest answer: it depends on your task. QLoRA fine-tuning is excellent for:

    • Domain adaptation (teaching the model your company's terminology or codebase)
    • Instruction tuning on a specific format
    • Classification tasks where you need consistent structured output

    It's less ideal for fundamentally changing the model's knowledge or capabilities. You're adjusting behavior, not rebuilding the brain.

    Prevention: Don't Waste Training Runs

    Before you burn hours on a fine-tune:

    • Start with a small dataset test (100 samples, 1 epoch) to make sure training actually converges
    • Monitor your loss curves — if loss flatlines immediately, your learning rate is too low or your data formatting is wrong
    • Validate your chat template — Gemma models are picky about special tokens. A malformed template will train the model on garbage
    • Save checkpoints frequently — a power blip at epoch 2.9 of 3 is a special kind of pain

    The barrier to local fine-tuning keeps dropping. A few years ago you needed a multi-GPU server. Now an 8GB consumer card gets the job done. That's a pretty remarkable shift, and it means more developers can experiment with customized models without cloud bills or API rate limits.

    How to Fine-Tune Gemma 4 on a GPU With Only 8GB of VRAM | Authon Blog