Models & AlgorithmsKR

Mastering LoRA — Fine-tune a 7B Model on a Single Notebook

From LoRA theory to Qwen 2.5 7B model setup. 99.8% parameter reduction and 86% memory savings vs full fine-tuning, explained with code.

Mastering LoRA — Fine-tune a 7B Model on a Single Notebook

Mastering LoRA — Fine-tune a 7B Model on a Single Notebook

What if you could fine-tune a 7-billion-parameter model on a single GPU?

Just two years ago, LLM fine-tuning required 8x A100 GPUs and hundreds of gigabytes of memory — a luxury reserved for big tech companies. LoRA (Low-Rank Adaptation) changed the game entirely. For a 7B model, it reduces trainable parameters to 0.1% while achieving performance on par with full fine-tuning.

In this series, we walk through the entire pipeline — LoRA, QLoRA, evaluation, and deployment — using Qwen 2.5 7B as our target model.

Why Fine-tune at All?

General-purpose LLMs like GPT-4, Claude, and Qwen are decent at everything. But "decent" is not enough for specialized domains.

ScenarioGeneral LLMFine-tuned Model
Legal document summarizationGeneric summary, imprecise legal termsAccurate summaries following case law format
Korean customer supportAwkward phrasing, unnatural honorificsNatural, fluent Korean polite speech
Medical chart analysisGeneral knowledge levelSpecialized medical terminology + diagnostic patterns
Code reviewGeneric feedbackSpecific feedback aligned with team conventions

Prompt engineering can cover some ground, but it has clear limits:

  1. Token cost: You need to send a lengthy system prompt with every request
  2. Consistency: The longer the prompt, the higher the chance the model misses instructions
  3. Knowledge ceiling: Prompting alone cannot inject knowledge the model simply does not have

Fine-tuning modifies the model's weights directly, addressing all three issues at the root.

The Problem with Full Fine-tuning: Memory

How much does it take to fully fine-tune a 7B model?

Model parameters: 7B × 4 bytes (FP32) = 28 GB
Optimizer states (Adam): 7B × 8 bytes = 56 GB
Gradients: 7B × 4 bytes = 28 GB
Activations: ~20-40 GB (varies with batch size)
──────────────────────────────
Total VRAM required: ~130-150 GB

That demands two A100 80GB cards. Cloud cost: $6-8 per hour.

The key question here: do we really need to update all 7B parameters?

LoRA: The Core Idea

LoRA's central insight is surprisingly simple:

The weight update ΔW\Delta W produced by fine-tuning has low rank.

Mathematically:

W=W0+ΔW=W0+BAW' = W_0 + \Delta W = W_0 + BA

Where:

  • W0W_0: pre-trained original weights (frozen, not trained)
  • BRd×rB \in \mathbb{R}^{d \times r}: small matrix
  • ARr×kA \in \mathbb{R}^{r \times k}: small matrix
  • rr: rank (typically 8–64, much smaller than the original dimension dd)

For example, in an attention layer where d=4096d = 4096 and k=4096k = 4096:

Parameter CountRatio
Original $W$4096 × 4096 = 16.7M100%
LoRA ($r=16$)4096 × 16 + 16 × 4096 = 131K0.78%

Instead of training 16.7M parameters, we train only 131K. A 99.2% reduction.

Why Does This Work?

Let's build some intuition. A pre-trained LLM already knows the fundamental structure of language. Fine-tuning is about *nudging* that existing knowledge, not learning from scratch.

Think of it this way: a fluent English speaker learning medical terminology does not need to relearn the entire language. They just need to pick up new vocabulary and expression patterns. That "delta" is low-rank.

Key LoRA Hyperparameters

ParameterMeaningRecommended RangeDescription
r (rank)Decomposition dimension8–64Higher = more expressive, more memory
lora_alphaScaling factor1–2× rIndirectly controls learning rate
target_modulesLayers to adaptq_proj, v_projAttention layers are most effective
lora_dropoutDropout rate0.05–0.1Prevents overfitting

lora_alpha / r is the effective scaling ratio. With alpha=32, r=16, the scale is 2, meaning LoRA's contribution is amplified by 2×.

Hands-on: LoRA Fine-tuning Qwen 2.5 7B

Time to turn theory into code. The full code is available in the accompanying Jupyter notebook.

Environment Setup

python
!pip install -q transformers peft datasets accelerate bitsandbytes trl
python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset
from trl import SFTTrainer

Loading the Model

python
model_name = "Qwen/Qwen2.5-7B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

We load Qwen 2.5 7B Instruct in bfloat16. This alone uses roughly 14 GB of VRAM.

Configuring LoRA

python
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13,631,488 || all params: 7,628,554,240 || trainable%: 0.1787

Out of 7.6B total parameters, only 13.6 million (0.18%) are trainable.

Note that target_modules includes not just attention layers (q/k/v/o_proj) but also FFN layers (gate/up/down_proj). Recent research shows that including FFN layers yields better performance.

That's the core of LoRA setup. Only 0.18% of 7.6B parameters are trainable — that's all there is to it.

The next steps are:

  1. Dataset preparation — format data to match Qwen's chat template
  2. Training — run actual training with SFTTrainer (~25 min on RTX 3090)
  3. Save the adapter — one 52 MB adapter file, done
  4. Inference — load base model + adapter → ready to use

In Part 2, we run this entire pipeline with QLoRA (4-bit quantization). It works on a T4 with just 16 GB of VRAM, and we build a Korean dataset from scratch to measurably improve the model's Korean performance.

Memory Comparison: Full FT vs LoRA

Measured on the same Qwen 2.5 7B with the same dataset.

Full Fine-tuningLoRA (r=16)Reduction
Trainable parameters7.6B13.6M99.8%
VRAM usage~130 GB~18 GB86%
Required GPUA100 80GB × 2RTX 3090 × 1
Adapter size14 GB (full model)52 MB99.6%
Training time (5K samples)~2 hrs (8×A100)~25 min (1×A100)
Inference qualityBaseline~98% of Full FT

LoRA wins by a wide margin in efficiency. Pay special attention to the "~98% inference quality" row. There is no reason to spend 10× the cost for a 2% performance gap.

Frequently Asked Questions

Q: Is a higher rank always better?

No. The difference between rank=8 and rank=64 is marginal for most tasks. In fact, excessively high rank increases the risk of overfitting. As a rule of thumb:

  • Simple tasks (tone adjustment, format compliance): r=8
  • Medium tasks (domain adaptation): r=16–32
  • Complex tasks (injecting new knowledge): r=32–64

Q: Which layers should I apply LoRA to?

Applying LoRA to all attention + FFN layers is the safest bet. The original paper used only q_proj and v_proj, but subsequent studies have shown that targeting more layers consistently improves performance.

Q: Does LoRA degrade the base model's capabilities?

LoRA never touches the original weights. Remove the adapter, and you get the original model back. This is one of LoRA's greatest strengths.

Next Up: QLoRA + Custom Dataset + Full Training Pipeline

Part 1 covered LoRA theory and configuration. Part 2 is where the real work happens:

  • QLoRA — 4-bit quantization to fine-tune 7B on a T4 with just 16 GB VRAM
  • Building a custom dataset — formatting domain data into chat template format
  • Training + Wandb monitoring — interpreting loss curves, detecting overfitting
  • Save → Inference → Before/After comparison — verifying actual performance gains

Part 3 covers evaluation (Perplexity, KoBEST benchmark), LoRA weight merging, and deployment with vLLM/Ollama.

References

Part 1 of 4 complete

3 more parts waiting for you

From theory to production deployment — subscribe to unlock the full series and all premium content.

Compare plans

Stay Updated

Follow us for the latest posts and tutorials

Subscribe to Newsletter

Related Posts