QLoRA + Custom Dataset — Fine-tune 7B on a Single T4 GPU

QLoRA + Custom Dataset — Fine-tune 7B on a Single T4 GPU
In Part 1, we covered the theory behind LoRA and fine-tuned Qwen 2.5 7B. That required about 18GB of VRAM on an RTX 3090 (24GB). In this post, we use QLoRA to bring that down to a single T4 with 16GB, and build a Korean-language dataset to meaningfully improve the model's Korean response quality.
Series: Part 1: LoRA Theory | Part 2 (this post) | Part 3: Eval + Deploy
QLoRA: Breaking Through the Memory Barrier
If LoRA reduced trainable parameters by 99.8%, QLoRA goes further and reduces the memory footprint of the model itself.
Quantize model weights to 4-bit for storage, but train the LoRA adapters in 16-bit precision.
Think of it this way: store every book in the library as a summary (4-bit), but write your new notes (LoRA) at full resolution (16-bit). The QLoRA paper (Dettmers et al., 2023) introduced three key techniques.
Three Core Techniques in QLoRA
1. 4-bit NormalFloat (NF4)
Pretrained model weights follow a normal distribution — values cluster around zero, and extreme values are rare. Standard 4-bit quantization maps values at uniform intervals, which is a poor fit for this distribution.
NF4 places quantization levels more densely around zero, matching the normal distribution. As a result, information loss is less than half compared to standard INT4.
2. Double Quantization
Quantization requires an FP32 scaling constant for each block (64 weights). For a 7B model, these constants alone take up roughly 0.5GB. Double Quantization re-quantizes these constants to 8-bit, bringing that down to about 0.13GB. On a 16GB GPU, this 370MB difference can determine whether training fits in memory or not.
3. Paged Optimizers
When GPU memory runs out during training, you get an OOM error. Paged Optimizers leverage NVIDIA unified memory to automatically page optimizer states out to CPU RAM — the same principle as virtual memory in an OS.
Memory Comparison: LoRA vs QLoRA
Measured with the same Qwen 2.5 7B and same settings (r=16, batch=2, seq_len=1024).
LoRA required an RTX 3090, but QLoRA runs on even a free-tier Colab T4.
Building a Korean Dataset
To turn a model into a Korean-language specialist, you need high-quality Korean instruction data. A curated set of 1,000 examples will outperform 10,000 sloppy ones.
Data Format: instruction-input-output
This is the most standard format for SFT (Supervised Fine-Tuning).
{
"instruction": "다음 문장을 존댓말로 바꿔주세요.",
"input": "이거 빨리 해.",
"output": "이것을 빨리 해주시겠어요?"
}The input field is optional. When absent, the model works with instruction alone:
{
"instruction": "대한민국의 수도는 어디인가요?",
"input": "",
"output": "대한민국의 수도는 서울특별시입니다."
}Note: The examples above are in Korean. The first asks to convert a casual sentence into polite/formal speech. The second asks "What is the capital of South Korea?" with the answer "The capital of South Korea is Seoul."
Public Korean Datasets
In this post, we use kyujinpy/KOR-OpenOrca-Platypus-v3 for its balance of versatility and size.
Data Quality Guidelines
Apply these criteria when building or filtering your data.
- Diversity: Include at least 10 categories — summarization, translation, coding, analysis, creative writing, math, common sense, etc. A skewed dataset produces a biased model
- Response length: If your data only contains short answers, the model will only give short answers. Mix various lengths, aiming for an average of 150–300 tokens
- Natural language quality: Remove machine-translation artifacts, correct literal translations, and maintain consistent formality/register throughout
QLoRA Hands-On Code
Full code that runs on a T4 16GB.
Setup
!pip install -q transformers peft datasets accelerate bitsandbytes trl wandbimport torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType
from datasets import load_dataset
from trl import SFTTrainer4-bit Quantization Config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, # Enable 4-bit quantization
bnb_4bit_quant_type="nf4", # Use NormalFloat4
bnb_4bit_compute_dtype=torch.bfloat16, # Compute in BF16
bnb_4bit_use_double_quant=True, # Double Quantization
)The inline comments explain each parameter. The key idea is: storage in 4-bit, computation in BF16.
Load the Model
model_name = "Qwen/Qwen2.5-7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
)
# Prepare model for QLoRA (gradient checkpointing + cast input/output layers to FP32)
model = prepare_model_for_kbit_training(model)prepare_model_for_kbit_training() handles gradient checkpointing, keeps input/output layers in FP32, and freezes the quantized layers — all in one call. At this point, VRAM usage is around 4.2GB. Compare that to 14GB in Part 1.
LoRA Config
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13,631,488 || all params: 3,947,921,408 || trainable%: 0.3452Same LoRA config as Part 1. The total parameter count shows 7.6B → 3.9B because this reflects the memory-equivalent count after 4-bit quantization.
Load and Preprocess the Korean Dataset
dataset = load_dataset("kyujinpy/KOR-OpenOrca-Platypus-v3", split="train")
dataset = dataset.shuffle(seed=42).select(range(3000)) # Sample 3,000 examplesdef format_korean_chat(example):
"""Convert Korean instruction data to Qwen chat format"""
# Korean system prompt: "You are an AI assistant fluent in Korean.
# Please respond in natural and accurate Korean."
system_msg = "당신은 한국어에 능통한 AI 어시스턴트입니다. 자연스럽고 정확한 한국어로 답변하세요."
user_content = example["instruction"]
if example.get("input") and example["input"].strip():
user_content += f"\n\n{example['input']}"
messages = [
{"role": "system", "content": system_msg},
{"role": "user", "content": user_content},
{"role": "assistant", "content": example["output"]},
]
text = tokenizer.apply_chat_template(messages, tokenize=False)
return {"text": text}
dataset = dataset.map(format_korean_chat)Training
training_args = TrainingArguments(
output_dir="./qwen25-qlora-korean",
num_train_epochs=2,
per_device_train_batch_size=2,
gradient_accumulation_steps=8, # Effective batch = 2 × 8 = 16
learning_rate=2e-4,
warmup_ratio=0.1,
weight_decay=0.01,
logging_steps=10,
save_strategy="steps",
save_steps=200,
save_total_limit=3,
bf16=True, # T4 can also use FP16: fp16=True
gradient_checkpointing=True,
optim="paged_adamw_8bit", # Use Paged Optimizer
lr_scheduler_type="cosine",
report_to="wandb", # Wandb integration
run_name="qwen25-qlora-korean-3k",
max_grad_norm=0.3, # Gradient clipping
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
dataset_text_field="text",
max_seq_length=1024,
packing=False,
)
trainer.train()Key settings that changed from Part 1:
On a T4 16GB, training on 3,000 examples for 2 epochs takes about 2 hours 30 minutes.
Save and Inference
# Save the LoRA adapter
model.save_pretrained("./qwen25-qlora-korean-adapter")
tokenizer.save_pretrained("./qwen25-qlora-korean-adapter")# Inference
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
model_name, quantization_config=bnb_config, device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "./qwen25-qlora-korean-adapter")
# Korean system prompt: "You are an AI assistant fluent in Korean."
# User prompt: "Please explain blockchain technology to a non-technical person."
messages = [
{"role": "system", "content": "당신은 한국어에 능통한 AI 어시스턴트입니다."},
{"role": "user", "content": "블록체인 기술을 비전공자에게 설명해주세요."},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
outputs = model.generate(inputs, max_new_tokens=512, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))LoRA vs QLoRA: Comprehensive Comparison
QLoRA is 30% slower because every operation requires dequantizing from 4-bit to BF16. But trading 30% speed for 44% memory savings is a good deal. The quality gap (98% vs 96%) is based on benchmarks — for Korean instruction-following specifically, data quality has a far greater impact than quantization loss.
Monitoring Training with Wandb
If the loss isn't decreasing or spikes unexpectedly, you need to adjust hyperparameters. Connecting Wandb lets you monitor in real time.
import wandb
wandb.init(
project="qlora-korean-finetuning",
name="qwen25-7b-korean-3k",
config={
"model": "Qwen/Qwen2.5-7B-Instruct",
"quantization": "NF4",
"lora_r": 16,
"lora_alpha": 32,
"dataset_size": 3000,
"epochs": 2,
"learning_rate": 2e-4,
}
)Since we set report_to="wandb", calling trainer.train() automatically logs train/loss, train/learning_rate, and train/grad_norm.
Typical loss pattern for 3,000 examples over 2 epochs:
Step 10: loss=2.45, lr=4.0e-05 ← warmup phase
Step 50: loss=1.82, lr=1.8e-04 ← warmup complete, rapid descent
Step 100: loss=1.35, lr=2.0e-04 ← peak learning rate reached
Step 200: loss=1.08, lr=1.7e-04 ← cosine decay begins
Step 300: loss=0.92, lr=1.2e-04 ← steady decline
Step 375: loss=0.85, lr=5.0e-05 ← near end of trainingA loss dropping below 1.0 signals that the model is learning Korean patterns well. If it drops below 0.5, suspect overfitting. Call wandb.finish() after training completes.
Troubleshooting
Coming Up Next: Evaluation + Deployment
In Part 3:
- Korean benchmark evaluation: Measure performance with KoBEST, KLUE, and custom evaluation sets
- vLLM deployment: Serve a quantized model + LoRA adapter in production
- Adapter merging: Merge the base model and LoRA adapter into one for optimized inference
- Practical tips: A guide to finding the optimal combination of learning rate, rank, and dataset size