Open-Source LLM Fine-Tuning with LoRA & QLoRA [2026]
Bottom Line
Use LoRA when the base model already fits in memory and you want the simplest path. Use QLoRA when memory is the bottleneck: the training flow is slightly stricter, but the VRAM savings are substantial.
Key Takeaways
- ›LoRA updates only adapter weights; the base model stays frozen.
- ›QLoRA combines 4-bit quantization with LoRA adapters to cut VRAM further.
- ›A practical 2026 stack is transformers 5.5.4, peft 0.19.1, trl 1.1.0, and bitsandbytes 0.49.2.
- ›For QLoRA training, use NF4, bf16 compute, and target_modules="all-linear".
- ›Your fastest sanity check is a saved adapter plus a short generation that follows the fine-tuning style.
Fine-tuning an open-source LLM no longer means updating billions of weights or renting oversized GPU nodes. As of April 28, 2026, the practical path is to keep the base model frozen and train compact adapters with LoRA or QLoRA. The two methods share the same basic idea, but they differ sharply in memory pressure, setup constraints, and the kind of hardware they unlock. This guide walks through both with current Hugging Face APIs and a minimal training workflow.
- LoRA is the clean default when the full model fits in memory.
- QLoRA adds 4-bit quantization so you can fine-tune larger models on smaller GPUs.
- A current working stack is transformers 5.5.4, peft 0.19.1, trl 1.1.0, and bitsandbytes 0.49.2.
- For QLoRA, the safest starting point is NF4 quantization, bf16 compute, and all-linear targeting.
| Dimension | LoRA | QLoRA | Edge |
|---|---|---|---|
| Base model load | Standard precision load | 4-bit quantized load | QLoRA |
| VRAM usage | Lower than full fine-tuning | Lowest of the two | QLoRA |
| Training simplicity | Simpler | More moving parts | LoRA |
| Throughput | Usually faster | Often slower due to quantization overhead | LoRA |
| Best fit | When the model already fits | When memory is the bottleneck | It depends |
Prerequisites
Bottom Line
LoRA is the simplest default if your model fits. QLoRA is the practical move when it does not, and the extra setup is usually worth the memory savings.
Before you start
- Python 3.10+. Official docs for transformers, peft, trl, and bitsandbytes all support current Python 3.10+ workflows.
- PyTorch 2.4+ is the current baseline documented by transformers.
- bitsandbytes is the key dependency for QLoRA-style 4-bit loading. Current docs list support across NVIDIA GPUs, Intel XPU, Intel Gaudi, and CPU, with NVIDIA CUDA support spanning 11.8 through 13.0.
- A supervised dataset in JSONL with one
textfield per row is the simplest input for SFTTrainer.
Choose LoRA or QLoRA
Choose LoRA when:
- Your base model already fits in GPU memory without quantization.
- You want the shortest path from dataset to adapter checkpoint.
- You care more about cleaner debugging and speed than absolute memory efficiency.
Choose QLoRA when:
- Your target model is too large for standard loading on your hardware.
- You are fine-tuning on a single smaller GPU and need aggressive memory savings.
- You want broad adapter coverage by targeting all linear layers instead of only attention projections.
A good rule of thumb is simple: start with LoRA if it fits, switch to QLoRA if it does not. You will spend less time debugging quantization edge cases, and you can still keep training costs low because the backbone stays frozen.
Step-by-Step Tutorial
Step 1: Create a clean environment and install the stack
python -m venv .venv source .venv/bin/activate pip install -U "transformers[torch]" peft trl bitsandbytes accelerate datasetsThis matches the officially documented package flow. If you paste or tweak the training script heavily, run it through the TechBytes Code Formatter before saving it into your repo.
Step 2: Prepare a minimal training file
{"text":"### Instruction:\nExplain idempotency in REST APIs.\n### Response:\nIdempotency means repeating the same request has the same effect as sending it once."} {"text":"### Instruction:\nSummarize what a feature flag does.\n### Response:\nA feature flag lets you enable or disable behavior without redeploying code."}Save that as
train.jsonl. The datasets loader can ingest local JSON or JSONL directly withload_dataset("json", ...).Step 3: Write one script that can run either LoRA or QLoRA
import torch from datasets import load_dataset from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig from trl import SFTConfig, SFTTrainer from peft import LoraConfig, TaskType, get_peft_model, prepare_model_for_kbit_training MODEL_ID = "Qwen/Qwen2.5-3B-Instruct" DATA_PATH = "train.jsonl" OUTPUT_DIR = "artifacts/lora-demo" USE_QLORA = True tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token if USE_QLORA: quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, ) model = AutoModelForCausalLM.from_pretrained( MODEL_ID, quantization_config=quantization_config, dtype="auto", ) model = prepare_model_for_kbit_training(model) peft_config = LoraConfig( r=16, lora_alpha=32, lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM, target_modules="all-linear", ) else: model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto") peft_config = LoraConfig( r=16, lora_alpha=32, lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM, target_modules=["q_proj", "v_proj"], ) model = get_peft_model(model, peft_config) model.print_trainable_parameters() dataset = load_dataset("json", data_files=DATA_PATH, split="train") trainer = SFTTrainer( model=model, train_dataset=dataset, processing_class=tokenizer, args=SFTConfig( output_dir=OUTPUT_DIR, per_device_train_batch_size=1, gradient_accumulation_steps=8, learning_rate=1e-4, num_train_epochs=1, logging_steps=10, save_steps=100, max_length=1024, gradient_checkpointing=True, report_to="none", ), ) trainer.train() trainer.save_model(OUTPUT_DIR) tokenizer.save_pretrained(OUTPUT_DIR)Two details matter here. First, official PEFT guidance exposes
prepare_model_for_kbit_training()for k-bit training prep. Second, official LoRA guidance explicitly recommendstarget_modules="all-linear"for QLoRA-style coverage.Step 4: Run training
python train_lora.pyFor small smoke tests, keep the run intentionally short. You are verifying the pipeline, not chasing benchmark quality yet.
Step 5: Load the adapter and test one prompt
from transformers import AutoTokenizer from peft import AutoPeftModelForCausalLM model = AutoPeftModelForCausalLM.from_pretrained( "artifacts/lora-demo", device_map="auto", dtype="auto", ) tokenizer = AutoTokenizer.from_pretrained("artifacts/lora-demo") prompt = "### Instruction:\nExplain idempotency in REST APIs.\n### Response:\n" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs.to(model.device), max_new_tokens=80) print(tokenizer.decode(outputs[0], skip_special_tokens=True))If the adapter learned anything at all, the output should stay on topic and preserve your response style better than the untouched base model.
Verification and Expected Output
Your first pass should answer three questions: did the model wrap correctly, did training finish, and did the saved adapter actually load?
- Adapter wrap: you should see a
trainable params:line aftermodel.print_trainable_parameters(). - Training loop: SFTTrainer should emit step logs and finish with a train runtime and train loss summary.
- Saved artifact: the output directory should contain adapter files plus tokenizer artifacts.
trainable params: ... || all params: ... || trainable%: ...
{'train_runtime': ..., 'train_loss': ..., 'epoch': 1.0}For a practical quality check, compare one or two generations from the base model against the adapter-tuned model using the same prompt format. You are looking for behavioral alignment with your dataset, not perfection after a single epoch.
Top 3 Troubleshooting Fixes
Out-of-memory during QLoRA loading or training
- Confirm that USE_QLORA is actually enabled.
- Use NF4 plus double quant as shown above.
- Lower
max_length, lower batch size, or increasegradient_accumulation_steps.
Training runs, but outputs look unchanged
- Check that your dataset rows all contain the expected
textfield. - Make sure your prompt format at inference matches the prompt format seen during training.
- Run more than a toy number of steps before judging quality.
- Check that your dataset rows all contain the expected
Model loads, but adapter generation fails or behaves oddly
- Save and reload from the adapter output directory, not from the original base model path.
- Ensure tokenizer files were saved alongside the adapter.
- For QLoRA, keep your inference dtype and tokenizer setup consistent with training.
What's Next
- Increase dataset size before increasing model size. Data quality usually beats brute-force parameter growth.
- Try LoRA first on a model that fits, then switch the same script to QLoRA only if memory forces the move.
- Add a lightweight evaluation harness with a fixed prompt set so you can compare adapter revisions consistently.
- When the workflow is stable, move on to packing, longer contexts, or distributed setups such as FSDP-QLoRA.
The important engineering lesson is that LoRA and QLoRA are not exotic research tricks anymore. They are the default operational path for adapting open-source LLMs without paying the full fine-tuning tax.
Frequently Asked Questions
What is the difference between LoRA and QLoRA in practice? +
When should I use target_modules="all-linear"? +
target_modules="all-linear" when you want QLoRA-style coverage across the transformer's linear layers. Official PEFT guidance recommends this because it is more portable than hand-listing module names that differ across architectures.Do I need bitsandbytes for standard LoRA fine-tuning? +
Why does my fine-tuned model ignore the new behavior I trained? +
text for a simple SFTTrainer workflow.Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.