Mastering LoRA — Fine-tune a 7B Model on a Single Notebook

Mastering LoRA — Fine-tune a 7B Model on a Single Notebook
What if you could fine-tune a 7-billion-parameter model on a single GPU?
Just two years ago, LLM fine-tuning required 8x A100 GPUs and hundreds of gigabytes of memory — a luxury reserved for big tech companies. LoRA (Low-Rank Adaptation) changed the game entirely. For a 7B model, it reduces trainable parameters to 0.1% while achieving performance on par with full fine-tuning.
In this series, we walk through the entire pipeline — LoRA, QLoRA, evaluation, and deployment — using Qwen 2.5 7B as our target model.
- Part 1 (this post): LoRA theory + first fine-tune
- Part 2: QLoRA + Korean dataset construction
- Part 3: Evaluation + deployment + practical tips
Why Fine-tune at All?
General-purpose LLMs like GPT-4, Claude, and Qwen are decent at everything. But "decent" is not enough for specialized domains.
Prompt engineering can cover some ground, but it has clear limits:
- Token cost: You need to send a lengthy system prompt with every request
- Consistency: The longer the prompt, the higher the chance the model misses instructions
- Knowledge ceiling: Prompting alone cannot inject knowledge the model simply does not have
Fine-tuning modifies the model's weights directly, addressing all three issues at the root.
The Problem with Full Fine-tuning: Memory
How much does it take to fully fine-tune a 7B model?
Model parameters: 7B × 4 bytes (FP32) = 28 GB
Optimizer states (Adam): 7B × 8 bytes = 56 GB
Gradients: 7B × 4 bytes = 28 GB
Activations: ~20-40 GB (varies with batch size)
──────────────────────────────
Total VRAM required: ~130-150 GBThat demands two A100 80GB cards. Cloud cost: $6-8 per hour.
The key question here: do we really need to update all 7B parameters?
LoRA: The Core Idea
LoRA's central insight is surprisingly simple:
The weight update $\Delta W$ produced by fine-tuning has low rank.Mathematically:
$$W' = W_0 + \Delta W = W_0 + BA$$
Where:
$W_0$: pre-trained original weights (frozen, not trained)$B \in \mathbb{R}^{d \times r}$: small matrix$A \in \mathbb{R}^{r \times k}$: small matrix$r$: rank (typically 8–64, much smaller than the original dimension$d$)
For example, in an attention layer where $d = 4096$ and $k = 4096$:
Instead of training 16.7M parameters, we train only 131K. A 99.2% reduction.
Why Does This Work?
Let's build some intuition. A pre-trained LLM already knows the fundamental structure of language. Fine-tuning is about *nudging* that existing knowledge, not learning from scratch.
Think of it this way: a fluent English speaker learning medical terminology does not need to relearn the entire language. They just need to pick up new vocabulary and expression patterns. That "delta" is low-rank.
Key LoRA Hyperparameters
lora_alpha / r is the effective scaling ratio. With alpha=32, r=16, the scale is 2, meaning LoRA's contribution is amplified by 2×.
Hands-on: LoRA Fine-tuning Qwen 2.5 7B
Time to turn theory into code. The full code is available in the accompanying Jupyter notebook.
Environment Setup
!pip install -q transformers peft datasets accelerate bitsandbytes trlimport torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset
from trl import SFTTrainerLoading the Model
model_name = "Qwen/Qwen2.5-7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
)We load Qwen 2.5 7B Instruct in bfloat16. This alone uses roughly 14 GB of VRAM.
Configuring LoRA
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13,631,488 || all params: 7,628,554,240 || trainable%: 0.1787Out of 7.6B total parameters, only 13.6 million (0.18%) are trainable.
Note that target_modules includes not just attention layers (q/k/v/o_proj) but also FFN layers (gate/up/down_proj). Recent research shows that including FFN layers yields better performance.
Preparing the Dataset
For this walkthrough we use a subset of Open-Orca/OpenOrca. In production, swap this for your own domain-specific data.
dataset = load_dataset("Open-Orca/OpenOrca", split="train[:5000]")
def format_chat(example):
messages = [
{"role": "system", "content": example["system_prompt"]},
{"role": "user", "content": example["question"]},
{"role": "assistant", "content": example["response"]},
]
text = tokenizer.apply_chat_template(messages, tokenize=False)
return {"text": text}
dataset = dataset.map(format_chat)We apply Qwen's chat template to convert examples into the <|im_start|>system\n...<|im_end|> format.
Training
training_args = TrainingArguments(
output_dir="./qwen25-lora",
num_train_epochs=1,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=2e-4,
warmup_ratio=0.1,
logging_steps=10,
save_strategy="epoch",
bf16=True,
gradient_checkpointing=True,
optim="adamw_8bit",
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
dataset_text_field="text",
max_seq_length=1024,
)
trainer.train()Key configuration notes:
gradient_accumulation_steps=8: batch 2 × 8 = effective batch size of 16gradient_checkpointing=True: trades compute for memory (~30% VRAM savings)optim="adamw_8bit": 8-bit Adam cuts optimizer memory in halfbf16=True: BFloat16 training (supported on A100 / RTX 3090+)
Saving and Inference
# Save only the LoRA adapter (not the base model)
model.save_pretrained("./qwen25-lora-adapter")
# Inference
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "./qwen25-lora-adapter")
messages = [
{"role": "user", "content": "Explain quantum mechanics to a 5-year-old"}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
outputs = model.generate(inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))The saved adapter weighs roughly 52 MB — just 0.4% of the base model (14 GB).
Memory Comparison: Full FT vs LoRA
Measured on the same Qwen 2.5 7B with the same dataset.
LoRA wins by a wide margin in efficiency. Pay special attention to the "~98% inference quality" row. There is no reason to spend 10× the cost for a 2% performance gap.
Frequently Asked Questions
Q: Is a higher rank always better?
No. The difference between rank=8 and rank=64 is marginal for most tasks. In fact, excessively high rank increases the risk of overfitting. As a rule of thumb:
- Simple tasks (tone adjustment, format compliance): r=8
- Medium tasks (domain adaptation): r=16–32
- Complex tasks (injecting new knowledge): r=32–64
Q: Which layers should I apply LoRA to?
Applying LoRA to all attention + FFN layers is the safest bet. The original paper used only q_proj and v_proj, but subsequent studies have shown that targeting more layers consistently improves performance.
Q: Does LoRA degrade the base model's capabilities?
LoRA never touches the original weights. Remove the adapter, and you get the original model back. This is one of LoRA's greatest strengths.
Coming Up Next: QLoRA + Korean
In Part 1 we covered LoRA theory and a basic fine-tuning run. But even LoRA on Qwen 2.5 7B pushes a 24 GB RTX 3090 to its limits.
In Part 2 we tackle QLoRA (4-bit quantization + LoRA) to fine-tune a 7B model on a T4 with just 16 GB of VRAM. We will also build a Korean-language dataset and use it to measurably improve the model's Korean performance.