QLoRA + Custom Dataset — Fine-tune 7B on a Single T4 GPU

In Part 1, we covered the theory behind LoRA and fine-tuned Qwen 2.5 7B. That required about 18GB of VRAM on an RTX 3090 (24GB). In this post, we use QLoRA to bring that down to a single T4 with 16GB, and build a Korean-language dataset to meaningfully improve the model's Korean response quality.

Series: Part 1: LoRA Theory | Part 2 (this post) | Part 3: Eval + Deploy

QLoRA: Breaking Through the Memory Barrier

If LoRA reduced trainable parameters by 99.8%, QLoRA goes further and reduces the memory footprint of the model itself.

Quantize model weights to 4-bit for storage, but train the LoRA adapters in 16-bit precision.

Think of it this way: store every book in the library as a summary (4-bit), but write your new notes (LoRA) at full resolution (16-bit). The QLoRA paper (Dettmers et al., 2023) introduced three key techniques.

Three Core Techniques in QLoRA

1. 4-bit NormalFloat (NF4)

Pretrained model weights follow a normal distribution — values cluster around zero, and extreme values are rare. Standard 4-bit quantization maps values at uniform intervals, which is a poor fit for this distribution.

NF4 places quantization levels more densely around zero, matching the normal distribution. As a result, information loss is less than half compared to standard INT4.

2. Double Quantization

Quantization requires an FP32 scaling constant for each block (64 weights). For a 7B model, these constants alone take up roughly 0.5GB. Double Quantization re-quantizes these constants to 8-bit, bringing that down to about 0.13GB. On a 16GB GPU, this 370MB difference can determine whether training fits in memory or not.

3. Paged Optimizers

When GPU memory runs out during training, you get an OOM error. Paged Optimizers leverage NVIDIA unified memory to automatically page optimizer states out to CPU RAM — the same principle as virtual memory in an OS.

Memory Comparison: LoRA vs QLoRA

Measured with the same Qwen 2.5 7B and same settings (r=16, batch=2, seq_len=1024).

LoRA required an RTX 3090, but QLoRA runs on even a free-tier Colab T4.

Building a Korean Dataset

To turn a model into a Korean-language specialist, you need high-quality Korean instruction data. A curated set of 1,000 examples will outperform 10,000 sloppy ones.

Data Format: instruction-input-output

This is the most standard format for SFT (Supervised Fine-Tuning).

{
  "instruction": "다음 문장을 존댓말로 바꿔주세요.",
  "input": "이거 빨리 해.",
  "output": "이것을 빨리 해주시겠어요?"
}

The input field is optional. When absent, the model works with instruction alone:

{
  "instruction": "대한민국의 수도는 어디인가요?",
  "input": "",
  "output": "대한민국의 수도는 서울특별시입니다."
}

Note: The examples above are in Korean. The first asks to convert a casual sentence into polite/formal speech. The second asks "What is the capital of South Korea?" with the answer "The capital of South Korea is Seoul."

Public Korean Datasets

In this post, we use kyujinpy/KOR-OpenOrca-Platypus-v3 for its balance of versatility and size.

Data Quality Guidelines

Apply these criteria when building or filtering your data.

Diversity: Include at least 10 categories — summarization, translation, coding, analysis, creative writing, math, common sense, etc. A skewed dataset produces a biased model
Response length: If your data only contains short answers, the model will only give short answers. Mix various lengths, aiming for an average of 150–300 tokens
Natural language quality: Remove machine-translation artifacts, correct literal translations, and maintain consistent formality/register throughout

QLoRA Hands-On Code

Full code that runs on a T4 16GB.

Setup

!pip install -q transformers peft datasets accelerate bitsandbytes trl wandb

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType
from datasets import load_dataset
from trl import SFTTrainer

4-bit Quantization Config

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                    # Enable 4-bit quantization
    bnb_4bit_quant_type="nf4",            # Use NormalFloat4
    bnb_4bit_compute_dtype=torch.bfloat16, # Compute in BF16
    bnb_4bit_use_double_quant=True,        # Double Quantization
)

The inline comments explain each parameter. The key idea is: storage in 4-bit, computation in BF16.

Load the Model

model_name = "Qwen/Qwen2.5-7B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)

# Prepare model for QLoRA (gradient checkpointing + cast input/output layers to FP32)
model = prepare_model_for_kbit_training(model)

prepare_model_for_kbit_training() handles gradient checkpointing, keeps input/output layers in FP32, and freezes the quantized layers — all in one call. At this point, VRAM usage is around 4.2GB. Compare that to 14GB in Part 1.

LoRA Config

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13,631,488 || all params: 3,947,921,408 || trainable%: 0.3452

Same LoRA config as Part 1. The total parameter count shows 7.6B → 3.9B because this reflects the memory-equivalent count after 4-bit quantization.

Load and Preprocess the Korean Dataset

dataset = load_dataset("kyujinpy/KOR-OpenOrca-Platypus-v3", split="train")
dataset = dataset.shuffle(seed=42).select(range(3000))  # Sample 3,000 examples

def format_korean_chat(example):
    """Convert Korean instruction data to Qwen chat format"""
    # Korean system prompt: "You are an AI assistant fluent in Korean.
    # Please respond in natural and accurate Korean."
    system_msg = "당신은 한국어에 능통한 AI 어시스턴트입니다. 자연스럽고 정확한 한국어로 답변하세요."

    user_content = example["instruction"]
    if example.get("input") and example["input"].strip():
        user_content += f"\n\n{example['input']}"

    messages = [
        {"role": "system", "content": system_msg},
        {"role": "user", "content": user_content},
        {"role": "assistant", "content": example["output"]},
    ]
    text = tokenizer.apply_chat_template(messages, tokenize=False)
    return {"text": text}

dataset = dataset.map(format_korean_chat)

Training

training_args = TrainingArguments(
    output_dir="./qwen25-qlora-korean",
    num_train_epochs=2,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,        # Effective batch = 2 × 8 = 16
    learning_rate=2e-4,
    warmup_ratio=0.1,
    weight_decay=0.01,
    logging_steps=10,
    save_strategy="steps",
    save_steps=200,
    save_total_limit=3,
    bf16=True,                            # T4 can also use FP16: fp16=True
    gradient_checkpointing=True,
    optim="paged_adamw_8bit",             # Use Paged Optimizer
    lr_scheduler_type="cosine",
    report_to="wandb",                    # Wandb integration
    run_name="qwen25-qlora-korean-3k",
    max_grad_norm=0.3,                    # Gradient clipping
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
    dataset_text_field="text",
    max_seq_length=1024,
    packing=False,
)

trainer.train()

Key settings that changed from Part 1:

On a T4 16GB, training on 3,000 examples for 2 epochs takes about 2 hours 30 minutes.

Save and Inference

# Save the LoRA adapter
model.save_pretrained("./qwen25-qlora-korean-adapter")
tokenizer.save_pretrained("./qwen25-qlora-korean-adapter")

# Inference
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    model_name, quantization_config=bnb_config, device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "./qwen25-qlora-korean-adapter")

# Korean system prompt: "You are an AI assistant fluent in Korean."
# User prompt: "Please explain blockchain technology to a non-technical person."
messages = [
    {"role": "system", "content": "당신은 한국어에 능통한 AI 어시스턴트입니다."},
    {"role": "user", "content": "블록체인 기술을 비전공자에게 설명해주세요."},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
outputs = model.generate(inputs, max_new_tokens=512, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

LoRA vs QLoRA: Comprehensive Comparison

QLoRA is 30% slower because every operation requires dequantizing from 4-bit to BF16. But trading 30% speed for 44% memory savings is a good deal. The quality gap (98% vs 96%) is based on benchmarks — for Korean instruction-following specifically, data quality has a far greater impact than quantization loss.

Monitoring Training with Wandb

If the loss isn't decreasing or spikes unexpectedly, you need to adjust hyperparameters. Connecting Wandb lets you monitor in real time.

import wandb

wandb.init(
    project="qlora-korean-finetuning",
    name="qwen25-7b-korean-3k",
    config={
        "model": "Qwen/Qwen2.5-7B-Instruct",
        "quantization": "NF4",
        "lora_r": 16,
        "lora_alpha": 32,
        "dataset_size": 3000,
        "epochs": 2,
        "learning_rate": 2e-4,
    }
)

Since we set report_to="wandb", calling trainer.train() automatically logs train/loss, train/learning_rate, and train/grad_norm.

Typical loss pattern for 3,000 examples over 2 epochs:

Step 10:  loss=2.45, lr=4.0e-05   ← warmup phase
Step 50:  loss=1.82, lr=1.8e-04   ← warmup complete, rapid descent
Step 100: loss=1.35, lr=2.0e-04   ← peak learning rate reached
Step 200: loss=1.08, lr=1.7e-04   ← cosine decay begins
Step 300: loss=0.92, lr=1.2e-04   ← steady decline
Step 375: loss=0.85, lr=5.0e-05   ← near end of training

A loss dropping below 1.0 signals that the model is learning Korean patterns well. If it drops below 0.5, suspect overfitting. Call wandb.finish() after training completes.

Troubleshooting

Coming Up Next: Evaluation + Deployment

In Part 3:

Korean benchmark evaluation: Measure performance with KoBEST, KLUE, and custom evaluation sets
vLLM deployment: Serve a quantized model + LoRA adapter in production
Adapter merging: Merge the base model and LoRA adapter into one for optimized inference
Practical tips: A guide to finding the optimal combination of learning rate, rank, and dataset size

References