Fine-Tuning Llama-4-3B for Privacy-First Edge Apps [2026]
Most AI workloads still route every inference through a third-party API—shipping your users' data to a remote server they don't control. Llama-4-3B changes that equation. At 3 billion parameters, it fits comfortably on a laptop GPU, a Jetson Orin, or a high-end smartphone SoC. Fine-tune it on your domain data, quantize it, and you have a capable, privacy-preserving model that never phones home.
This tutorial walks through the complete pipeline: pulling the base model, QLoRA fine-tuning, and packaging a GGUF binary ready for Ollama—entirely on a single machine with zero cloud dependency.
Prerequisites
Before You Begin
- Hardware: NVIDIA GPU with ≥8 GB VRAM (16 GB recommended); Apple Silicon M3 supported via MPS backend
- OS: Ubuntu 22.04+ or macOS 14+
- Python: 3.11 or 3.12
- CUDA: 12.1+ (Linux / Windows GPU path)
- Disk: ~20 GB free — model weights + dataset + GGUF output
- Accounts: Hugging Face account with a read token; Llama-4 access granted at hf.co/meta-llama
- Ollama: v0.4+ installed locally for the deployment step
Step 1 — Environment Setup
Create an isolated environment and pin key library versions. The bitsandbytes / transformers compatibility matrix is notoriously brittle—these pins are tested together.
python -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
pip install \
torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121 \
transformers==4.47.0 \
peft==0.13.0 \
trl==0.12.0 \
bitsandbytes==0.44.1 \
datasets==3.2.0 \
accelerate==1.1.0 \
huggingface_hub
huggingface-cli login # paste your HF read token
Verify CUDA is visible before proceeding:
python -c "import torch; print(torch.cuda.get_device_name(0))"
# Expected: NVIDIA RTX 4090 (or your device name)
Step 2 — Load & Quantize the Base Model
QLoRA loads the base weights in 4-bit NF4 precision, freezes them, and trains only a small set of low-rank adapter matrices. This reduces peak VRAM from ~6 GB (bf16) to ~2.3 GB for Llama-4-3B, leaving headroom for activations and optimizer states.
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
MODEL_ID = "meta-llama/Llama-4-3B-Instruct"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # nested quant saves ~0.4 GB extra
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=False,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token
Step 3 — Prepare Your Training Dataset
Format your domain data as instruction-response pairs using the ChatML template that Llama-4-Instruct was trained with. If your data contains PII, proprietary IDs, or sensitive customer records, sanitize it first. Our Data Masking Tool can pseudonymize names, emails, and numeric identifiers before they ever enter the training pipeline—keeping your compliance team happy without altering data structure or token distribution.
from datasets import Dataset
raw = [
{"instruction": "Summarize this support ticket.",
"response": "The user reports a login failure after the 2.3.1 update..."},
# ... your domain examples
]
def format_chatml(example):
return {
"text": (
"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
f"<|im_start|>user\n{example['instruction']}<|im_end|>\n"
f"<|im_start|>assistant\n{example['response']}<|im_end|>"
)
}
dataset = Dataset.from_list(raw).map(format_chatml)
dataset = dataset.train_test_split(test_size=0.1)
Aim for 200–500 high-quality examples for task specialization. More than 2,000 rarely improves a 3B model without also tuning the learning rate schedule—quality beats quantity at this scale.
Step 4 — Configure QLoRA
Inject LoRA adapters into the attention projection layers only. Targeting q_proj and v_proj is the standard starting point; adding k_proj and o_proj helps for complex reasoning tasks at a slight VRAM cost.
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 6,815,744 || all params: 3,008,751,616 || trainable%: 0.2265
Only 0.23% of parameters are updated—this is the core efficiency win of QLoRA. The frozen base weights stay in 4-bit; adapters train in bf16.
Step 5 — Run the Fine-Tuning Loop
SFTTrainer from TRL handles sequence packing, loss masking on prompt tokens, and gradient checkpointing in one call. The settings below target a single 24 GB GPU; halve per_device_train_batch_size for 8 GB cards.
from trl import SFTTrainer, SFTConfig
training_args = SFTConfig(
output_dir="./llama4-3b-finetuned",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=8, # effective batch = 16
warmup_steps=50,
learning_rate=2e-4,
bf16=True,
logging_steps=10,
save_strategy="epoch",
optim="paged_adamw_8bit",
max_seq_length=2048,
dataset_text_field="text",
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
tokenizer=tokenizer,
)
trainer.train()
A 500-example run completes in ~18 minutes on an RTX 4090, or ~55 minutes on an RTX 3080. Watch the training loss curve: it should fall from roughly 2.4 → below 1.1 by epoch 2 for a well-structured dataset. Stagnation above 1.5 usually signals a template mismatch—check your ChatML delimiters.
Step 6 — Export to GGUF & Deploy with Ollama
Merge the LoRA adapter back into the base weights, then convert to GGUF Q4_K_M—the sweet spot between quality and file size at ~1.9 GB on disk.
# 1. Merge adapter weights into base model
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=torch.float16)
merged = PeftModel.from_pretrained(base, "./llama4-3b-finetuned/checkpoint-final")
merged = merged.merge_and_unload()
merged.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")
# 2. Convert to GGUF (requires llama.cpp cloned alongside)
python llama.cpp/convert_hf_to_gguf.py ./merged-model \
--outfile llama4-3b-custom.gguf \
--outtype q4_k_m
# 3. Import into Ollama
echo 'FROM ./llama4-3b-custom.gguf' > Modelfile
ollama create llama4-3b-custom -f Modelfile
# 4. Run a quick inference test
ollama run llama4-3b-custom "Summarize this support ticket: [paste sample]"
Key Takeaway: Why Q4_K_M?
Q4_K_M uses a mixed-precision scheme—most layers at 4-bit, attention and feed-forward gate weights at 6-bit—that recovers the majority of quality lost in naive INT4 quantization. On standard benchmarks it scores within 2–4% of the fp16 reference while cutting inference memory by 60%. For edge deployments where 8 GB is the ceiling, this tradeoff is almost always the right call. Avoid Q2_K for instruction-following tasks; the coherence degradation at that depth is clearly perceptible.
Verification & Expected Output
After ollama run, you should see responses within 1–2 seconds TTFT on a modern CPU, or sub-300 ms on GPU. Run these sanity checks:
# Throughput check — add --verbose flag
ollama run llama4-3b-custom --verbose "What is 17 multiplied by 23?"
# Expected:
# eval rate: 35–42 tokens/s (Apple M3 Max, CPU)
# eval rate: 140–160 tokens/s (RTX 4090, GPU)
# Confirm zero external egress during inference
sudo tcpdump -i any port 443 &
ollama run llama4-3b-custom "Describe our return policy."
# Expected: zero HTTPS packets to external hosts captured
If output looks coherent and domain-appropriate, the adapter merged correctly. If responses revert to generic Llama behavior, the checkpoint path in the merge step is likely wrong—open ./llama4-3b-finetuned/checkpoint-final/adapter_config.json and confirm base_model_name_or_path matches your MODEL_ID exactly.
Troubleshooting Top-3
1. CUDA Out of Memory during training
Reduce per_device_train_batch_size to 1 and increase gradient_accumulation_steps to 16 to keep effective batch size stable. Also enable gradient_checkpointing=True in SFTConfig—it trades compute for memory and typically saves 30–40% VRAM at the cost of a ~20% slower step time. If you're still hitting the wall, drop max_seq_length from 2048 to 1024; most domain fine-tuning tasks don't need long context.
2. bitsandbytes import error on Linux
Run ldconfig after CUDA installation and verify libcuda.so.1 is on LD_LIBRARY_PATH. The most common cause is a system CUDA install shadowing the pip-installed runtime. Export BNB_CUDA_VERSION=121 to force correct detection. On Ubuntu, also confirm the nvidia-cuda-toolkit package version matches your driver.
3. Model responds in wrong style or language after merge
The ChatML template must exactly match what the instruct model expects. Verify your format_chatml function produces <|im_start|> / <|im_end|> delimiters—not the older [INST] / [/INST] format used by Llama 2. A template mismatch here is the leading cause of quality regression post-merge. Print three formatted examples before training and compare against the model card's expected format.
What's Next
You now have a domain-specific, fully local Llama-4-3B model running with zero cloud dependency. Engineers typically extend this pipeline in three directions:
- Structured output: Add a JSON grammar file to Ollama (
--format json) to enforce typed responses—critical for tool-calling and agentic pipelines where downstream code parses model output. - Multi-adapter serving: Use PEFT's
load_adapterto hot-swap domain adapters at inference time without reloading the 3B base. One model binary, many specializations, minimal VRAM overhead. - Continuous fine-tuning: Pipe new labeled examples weekly through the same pipeline. Version adapter checkpoints in Git alongside application code so rollbacks are trivial.
- Benchmarking: Run lm-evaluation-harness on a held-out domain test set before and after fine-tuning to quantify improvement with real numbers, not vibes.
The tooling around small model orchestration has matured to the point where what once required a dedicated ML platform team can now ship in a single sprint—and stay entirely within your security perimeter. The remaining barrier is data quality, not infrastructure.
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.
Related Deep-Dives
Zero-Trust CI/CD: A Cheat Sheet for OIDC and Ephemeral Secrets
How to eliminate long-lived credentials from your pipeline using OIDC federation and short-lived tokens.
Developer ToolsOptimizing Go 1.28 Performance: Leveraging Profile-Guided Optimization
A deep dive into PGO in Go 1.28 and how to measure real-world throughput gains in production services.
System ArchitectureDesigning Self-Healing REST APIs: Automated Retry & Fallback
Architectural patterns for building APIs that recover automatically from transient failures without client-side changes.