Models & Algorithms•March 8, 2026•KR

QLoRA + Custom Dataset — Fine-tune 7B on a Single T4 GPU

Fine-tune a 7B model on a T4 16GB with QLoRA. Dataset construction, training execution, Wandb monitoring, and Before/After comparison.

QLoRA + Custom Dataset — Fine-tune 7B on a Single T4 GPU

In Part 1, we covered the theory behind LoRA and fine-tuned Qwen 2.5 7B. That required about 18GB of VRAM on an RTX 3090 (24GB). In this post, we use QLoRA to bring that down to a single T4 with 16GB, and build a Korean-language dataset to meaningfully improve the model's Korean response quality.

Series: Part 1: LoRA Theory | Part 2 (this post) | Part 3: Eval + Deploy

QLoRA: Breaking Through the Memory Barrier

If LoRA reduced trainable parameters by 99.8%, QLoRA goes further and reduces the memory footprint of the model itself.

Quantize model weights to 4-bit for storage, but train the LoRA adapters in 16-bit precision.

Think of it this way: store every book in the library as a summary (4-bit), but write your new notes (LoRA) at full resolution (16-bit). The QLoRA paper (Dettmers et al., 2023) introduced three key techniques.

Three Core Techniques in QLoRA

1. 4-bit NormalFloat (NF4)

Pretrained model weights follow a normal distribution — values cluster around zero, and extreme values are rare. Standard 4-bit quantization maps values at uniform intervals, which is a poor fit for this distribution.

🔒