What is the difference between LoRA and QLoRA in practice?

LoRA keeps the base model frozen and trains small adapter matrices, but the base model is still loaded in its normal precision. QLoRA adds 4-bit quantization for the frozen base model, which cuts memory further while still training LoRA adapters.

When should I use target_modules="all-linear"?

Use target_modules="all-linear" when you want QLoRA-style coverage across the transformer's linear layers. Official PEFT guidance recommends this because it is more portable than hand-listing module names that differ across architectures.

Do I need bitsandbytes for standard LoRA fine-tuning?

No. bitsandbytes is primarily needed for 8-bit or 4-bit quantized loading, which is what makes QLoRA practical. Plain LoRA can run without it if the full base model already fits in memory.

Why does my fine-tuned model ignore the new behavior I trained?

The two most common causes are dataset formatting drift and too little effective training signal. Make sure your inference prompt structure matches training, and verify that your dataset uses the field expected by the trainer, usually text for a simple SFTTrainer workflow.

Open-Source LLM Fine-Tuning with LoRA & QLoRA [2026]

Fine-tuning an open-source LLM no longer means updating billions of weights or renting oversized GPU nodes. As of April 28, 2026, the practical path is to keep the base model frozen and train compact adapters with LoRA or QLoRA. The two methods share the same basic idea, but they differ sharply in memory pressure, setup constraints, and the kind of hardware they unlock. This guide walks through both with current Hugging Face APIs and a minimal training workflow.

LoRA is the clean default when the full model fits in memory.
QLoRA adds 4-bit quantization so you can fine-tune larger models on smaller GPUs.
A current working stack is transformers 5.5.4, peft 0.19.1, trl 1.1.0, and bitsandbytes 0.49.2.
For QLoRA, the safest starting point is NF4 quantization, bf16 compute, and all-linear targeting.

Dimension	LoRA	QLoRA	Edge
Base model load	Standard precision load	4-bit quantized load	QLoRA
VRAM usage	Lower than full fine-tuning	Lowest of the two	QLoRA
Training simplicity	Simpler	More moving parts	LoRA
Throughput	Usually faster	Often slower due to quantization overhead	LoRA
Best fit	When the model already fits	When memory is the bottleneck	It depends

Prerequisites

Bottom Line

LoRA is the simplest default if your model fits. QLoRA is the practical move when it does not, and the extra setup is usually worth the memory savings.

Watch out: adapter tuning is cheap enough to encourage sloppy dataset handling. If your training data contains tickets, emails, API keys, or customer IDs, redact it first with TechBytes' Data Masking Tool.

Before you start

Python 3.10+. Official docs for transformers, peft, trl, and bitsandbytes all support current Python 3.10+ workflows.
PyTorch 2.4+ is the current baseline documented by transformers.
bitsandbytes is the key dependency for QLoRA-style 4-bit loading. Current docs list support across NVIDIA GPUs, Intel XPU, Intel Gaudi, and CPU, with NVIDIA CUDA support spanning 11.8 through 13.0.
A supervised dataset in JSONL with one text field per row is the simplest input for SFTTrainer.

Choose LoRA or QLoRA

Choose LoRA when:

Your base model already fits in GPU memory without quantization.
You want the shortest path from dataset to adapter checkpoint.
You care more about cleaner debugging and speed than absolute memory efficiency.

Choose QLoRA when:

Your target model is too large for standard loading on your hardware.
You are fine-tuning on a single smaller GPU and need aggressive memory savings.
You want broad adapter coverage by targeting all linear layers instead of only attention projections.

A good rule of thumb is simple: start with LoRA if it fits, switch to QLoRA if it does not. You will spend less time debugging quantization edge cases, and you can still keep training costs low because the backbone stays frozen.

Step-by-Step Tutorial

Step 1: Create a clean environment and install the stack
```
python -m venv .venv
source .venv/bin/activate
pip install -U "transformers[torch]" peft trl bitsandbytes accelerate datasets
```
This matches the officially documented package flow. If you paste or tweak the training script heavily, run it through the TechBytes Code Formatter before saving it into your repo.

Step 2: Prepare a minimal training file

{"text":"### Instruction:\nExplain idempotency in REST APIs.\n### Response:\nIdempotency means repeating the same request has the same effect as sending it once."}
{"text":"### Instruction:\nSummarize what a feature flag does.\n### Response:\nA feature flag lets you enable or disable behavior without redeploying code."}

Save that as train.jsonl. The datasets loader can ingest local JSON or JSONL directly with load_dataset("json", ...).

Step 3: Write one script that can run either LoRA or QLoRA

import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from trl import SFTConfig, SFTTrainer
from peft import LoraConfig, TaskType, get_peft_model, prepare_model_for_kbit_training

MODEL_ID = "Qwen/Qwen2.5-3B-Instruct"
DATA_PATH = "train.jsonl"
OUTPUT_DIR = "artifacts/lora-demo"
USE_QLORA = True

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

if USE_QLORA:
    quantization_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,
    )
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        quantization_config=quantization_config,
        dtype="auto",
    )
    model = prepare_model_for_kbit_training(model)
    peft_config = LoraConfig(
        r=16,
        lora_alpha=32,
        lora_dropout=0.05,
        bias="none",
        task_type=TaskType.CAUSAL_LM,
        target_modules="all-linear",
    )
else:
    model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
    peft_config = LoraConfig(
        r=16,
        lora_alpha=32,
        lora_dropout=0.05,
        bias="none",
        task_type=TaskType.CAUSAL_LM,
        target_modules=["q_proj", "v_proj"],
    )

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

dataset = load_dataset("json", data_files=DATA_PATH, split="train")

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    processing_class=tokenizer,
    args=SFTConfig(
        output_dir=OUTPUT_DIR,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=8,
        learning_rate=1e-4,
        num_train_epochs=1,
        logging_steps=10,
        save_steps=100,
        max_length=1024,
        gradient_checkpointing=True,
        report_to="none",
    ),
)

trainer.train()
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

Two details matter here. First, official PEFT guidance exposes prepare_model_for_kbit_training() for k-bit training prep. Second, official LoRA guidance explicitly recommends target_modules="all-linear" for QLoRA-style coverage.

Step 4: Run training
```
python train_lora.py
```
For small smoke tests, keep the run intentionally short. You are verifying the pipeline, not chasing benchmark quality yet.

Step 5: Load the adapter and test one prompt

from transformers import AutoTokenizer
from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained(
    "artifacts/lora-demo",
    device_map="auto",
    dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained("artifacts/lora-demo")

prompt = "### Instruction:\nExplain idempotency in REST APIs.\n### Response:\n"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=80)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

If the adapter learned anything at all, the output should stay on topic and preserve your response style better than the untouched base model.

Pro tip: do one tiny end-to-end run before touching rank, alpha, epochs, or packing. Most failed fine-tuning projects are broken by environment drift or data formatting, not by hyperparameter choice.

Verification and Expected Output

Your first pass should answer three questions: did the model wrap correctly, did training finish, and did the saved adapter actually load?

Adapter wrap: you should see a trainable params: line after model.print_trainable_parameters().
Training loop: SFTTrainer should emit step logs and finish with a train runtime and train loss summary.
Saved artifact: the output directory should contain adapter files plus tokenizer artifacts.

trainable params: ... || all params: ... || trainable%: ...
{'train_runtime': ..., 'train_loss': ..., 'epoch': 1.0}

For a practical quality check, compare one or two generations from the base model against the adapter-tuned model using the same prompt format. You are looking for behavioral alignment with your dataset, not perfection after a single epoch.

Top 3 Troubleshooting Fixes

Out-of-memory during QLoRA loading or training
- Confirm that USE_QLORA is actually enabled.
- Use NF4 plus double quant as shown above.
- Lower max_length, lower batch size, or increase gradient_accumulation_steps.
Training runs, but outputs look unchanged
- Check that your dataset rows all contain the expected text field.
- Make sure your prompt format at inference matches the prompt format seen during training.
- Run more than a toy number of steps before judging quality.
Model loads, but adapter generation fails or behaves oddly
- Save and reload from the adapter output directory, not from the original base model path.
- Ensure tokenizer files were saved alongside the adapter.
- For QLoRA, keep your inference dtype and tokenizer setup consistent with training.

What's Next

Increase dataset size before increasing model size. Data quality usually beats brute-force parameter growth.
Try LoRA first on a model that fits, then switch the same script to QLoRA only if memory forces the move.
Add a lightweight evaluation harness with a fixed prompt set so you can compare adapter revisions consistently.
When the workflow is stable, move on to packing, longer contexts, or distributed setups such as FSDP-QLoRA.

The important engineering lesson is that LoRA and QLoRA are not exotic research tricks anymore. They are the default operational path for adapting open-source LLMs without paying the full fine-tuning tax.

Open-Source LLM Fine-Tuning with LoRA & QLoRA [2026]

Bottom Line

Prerequisites

Bottom Line

Before you start

Choose LoRA or QLoRA

Choose LoRA when:

Choose QLoRA when:

Step-by-Step Tutorial

Step 1: Create a clean environment and install the stack

Step 2: Prepare a minimal training file

Step 3: Write one script that can run either LoRA or QLoRA

Step 4: Run training

Step 5: Load the adapter and test one prompt

Verification and Expected Output

Top 3 Troubleshooting Fixes

Out-of-memory during QLoRA loading or training

Training runs, but outputs look unchanged

Model loads, but adapter generation fails or behaves oddly

What's Next

Frequently Asked Questions

Get Engineering Deep-Dives in Your Inbox