Mastering LoRA — Fine-tune a 7B Model on a Single Notebook

What if you could fine-tune a 7-billion-parameter model on a single GPU?

Just two years ago, LLM fine-tuning required 8x A100 GPUs and hundreds of gigabytes of memory — a luxury reserved for big tech companies. LoRA (Low-Rank Adaptation) changed the game entirely. For a 7B model, it reduces trainable parameters to 0.1% while achieving performance on par with full fine-tuning.

In this series, we walk through the entire pipeline — LoRA, QLoRA, evaluation, and deployment — using Qwen 2.5 7B as our target model.

Part 1 (this post): LoRA theory + first fine-tune
Part 2: QLoRA + Korean dataset construction
Part 3: Evaluation + deployment + practical tips

Why Fine-tune at All?

General-purpose LLMs like GPT-4, Claude, and Qwen are decent at everything. But "decent" is not enough for specialized domains.

Prompt engineering can cover some ground, but it has clear limits:

Token cost: You need to send a lengthy system prompt with every request
Consistency: The longer the prompt, the higher the chance the model misses instructions
Knowledge ceiling: Prompting alone cannot inject knowledge the model simply does not have

Fine-tuning modifies the model's weights directly, addressing all three issues at the root.

The Problem with Full Fine-tuning: Memory

How much does it take to fully fine-tune a 7B model?

Model parameters: 7B × 4 bytes (FP32) = 28 GB
Optimizer states (Adam): 7B × 8 bytes = 56 GB
Gradients: 7B × 4 bytes = 28 GB
Activations: ~20-40 GB (varies with batch size)
──────────────────────────────
Total VRAM required: ~130-150 GB

That demands two A100 80GB cards. Cloud cost: $6-8 per hour.

The key question here: do we really need to update all 7B parameters?

LoRA: The Core Idea

LoRA's central insight is surprisingly simple:

The weight update $\Delta W$ produced by fine-tuning has low rank.

Mathematically:

$$W' = W_0 + \Delta W = W_0 + BA$$

Where:

$W_0$ : pre-trained original weights (frozen, not trained)
$B \in \mathbb{R}^{d \times r}$ : small matrix
$A \in \mathbb{R}^{r \times k}$ : small matrix
$r$ : rank (typically 8–64, much smaller than the original dimension $d$ )

For example, in an attention layer where $d = 4096$ and $k = 4096$ :

Instead of training 16.7M parameters, we train only 131K. A 99.2% reduction.

Why Does This Work?

Let's build some intuition. A pre-trained LLM already knows the fundamental structure of language. Fine-tuning is about *nudging* that existing knowledge, not learning from scratch.

Think of it this way: a fluent English speaker learning medical terminology does not need to relearn the entire language. They just need to pick up new vocabulary and expression patterns. That "delta" is low-rank.

Key LoRA Hyperparameters

lora_alpha / r is the effective scaling ratio. With alpha=32, r=16, the scale is 2, meaning LoRA's contribution is amplified by 2×.

Hands-on: LoRA Fine-tuning Qwen 2.5 7B

Time to turn theory into code. The full code is available in the accompanying Jupyter notebook.

Environment Setup

!pip install -q transformers peft datasets accelerate bitsandbytes trl

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset
from trl import SFTTrainer

Loading the Model

model_name = "Qwen/Qwen2.5-7B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

We load Qwen 2.5 7B Instruct in bfloat16. This alone uses roughly 14 GB of VRAM.

Configuring LoRA

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13,631,488 || all params: 7,628,554,240 || trainable%: 0.1787

Out of 7.6B total parameters, only 13.6 million (0.18%) are trainable.

Note that target_modules includes not just attention layers (q/k/v/o_proj) but also FFN layers (gate/up/down_proj). Recent research shows that including FFN layers yields better performance.

Preparing the Dataset

For this walkthrough we use a subset of Open-Orca/OpenOrca. In production, swap this for your own domain-specific data.

dataset = load_dataset("Open-Orca/OpenOrca", split="train[:5000]")

def format_chat(example):
    messages = [
        {"role": "system", "content": example["system_prompt"]},
        {"role": "user", "content": example["question"]},
        {"role": "assistant", "content": example["response"]},
    ]
    text = tokenizer.apply_chat_template(messages, tokenize=False)
    return {"text": text}

dataset = dataset.map(format_chat)

We apply Qwen's chat template to convert examples into the <|im_start|>system\n...<|im_end|> format.

Training

training_args = TrainingArguments(
    output_dir="./qwen25-lora",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=2e-4,
    warmup_ratio=0.1,
    logging_steps=10,
    save_strategy="epoch",
    bf16=True,
    gradient_checkpointing=True,
    optim="adamw_8bit",
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
    dataset_text_field="text",
    max_seq_length=1024,
)

trainer.train()

Key configuration notes:

gradient_accumulation_steps=8: batch 2 × 8 = effective batch size of 16
gradient_checkpointing=True: trades compute for memory (~30% VRAM savings)
optim="adamw_8bit": 8-bit Adam cuts optimizer memory in half
bf16=True: BFloat16 training (supported on A100 / RTX 3090+)

Saving and Inference

# Save only the LoRA adapter (not the base model)
model.save_pretrained("./qwen25-lora-adapter")

# Inference
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "./qwen25-lora-adapter")

messages = [
    {"role": "user", "content": "Explain quantum mechanics to a 5-year-old"}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
outputs = model.generate(inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

The saved adapter weighs roughly 52 MB — just 0.4% of the base model (14 GB).

Memory Comparison: Full FT vs LoRA

Measured on the same Qwen 2.5 7B with the same dataset.

LoRA wins by a wide margin in efficiency. Pay special attention to the "~98% inference quality" row. There is no reason to spend 10× the cost for a 2% performance gap.

Frequently Asked Questions

Q: Is a higher rank always better?

No. The difference between rank=8 and rank=64 is marginal for most tasks. In fact, excessively high rank increases the risk of overfitting. As a rule of thumb:

Simple tasks (tone adjustment, format compliance): r=8
Medium tasks (domain adaptation): r=16–32
Complex tasks (injecting new knowledge): r=32–64

Q: Which layers should I apply LoRA to?

Applying LoRA to all attention + FFN layers is the safest bet. The original paper used only q_proj and v_proj, but subsequent studies have shown that targeting more layers consistently improves performance.

Q: Does LoRA degrade the base model's capabilities?

LoRA never touches the original weights. Remove the adapter, and you get the original model back. This is one of LoRA's greatest strengths.

Coming Up Next: QLoRA + Korean

In Part 1 we covered LoRA theory and a basic fine-tuning run. But even LoRA on Qwen 2.5 7B pushes a 24 GB RTX 3090 to its limits.

In Part 2 we tackle QLoRA (4-bit quantization + LoRA) to fine-tune a 7B model on a T4 with just 16 GB of VRAM. We will also build a Korean-language dataset and use it to measurably improve the model's Korean performance.

References