SDFT: Learning Without Forgetting via Self-Distillation
SOTAAZ·

SDFT: Learning Without Forgetting via Self-Distillation
No complex RL needed. Models teach themselves to learn new skills while preserving existing capabilities.
TL;DR
- Problem: Traditional SFT causes catastrophic forgetting when learning new tasks
- Solution: SDFT (Self-Distillation Fine-Tuning)
- Key insight: Model generates its own training signals conditioned on demonstrations (On-policy)
- Result: Sequential skill accumulation without performance degradation
1. Why Does SFT Cause Forgetting?
The Limitation of Supervised Fine-Tuning
The fundamental issue with SFT:
Training data distribution ≠ Model's current output distributionOff-policy vs On-policy
Off-policy (SFT):
Model output: "The cat sat on the..."
Ground truth: "A feline rested upon the..."
→ Forces learning regardless of model's natural outputOn-policy (SDFT):
Model output: "The cat sat on the mat."
Training signal: Generated from model's own output
→ Improves while maintaining current distribution2. SDFT: Self-Distillation Fine-Tuning
Core Idea
Show the model demonstrations, then let it generate its own training data based on that knowledge.
How It Works
1. Provide Demonstrations
[example input] → [example output]
2. Model generates conditionally
π(y|x, demonstrations)
3. Self-learning from generated data
Model output → Training signalMathematical Formulation
Traditional SFT objective:
$$\mathcal{L}_\text{SFT} = -\mathbb{E}_{(x,y)\sim\mathcal{D}}[\log \pi_\theta(y|x)]$$
SDFT objective:
$$\mathcal{L}_\text{SDFT} = -\mathbb{E}_{x\sim\mathcal{D}}[\log \pi_\theta(y|x)] \quad \text{where } y \sim \pi_\theta(\cdot|x, \text{demos})$$
Key difference: $y$ is generated by the model itself, not a fixed ground truth.
3. Why SDFT Works
The Importance of Distribution Preservation
SFT: P_model → P_new_data (destroys existing distribution)
SDFT: P_model → P_model + Δ (preserves while improving)Advantages of Self-Distillation
Comparison with RL
Reinforcement learning is also on-policy, but:
- RL: Requires reward function (difficult to design)
- SDFT: No reward function needed (just demonstrations)
4. Experimental Results
Continual Learning Performance
Key findings from the paper:
Sequential Learning Scenario
Task sequence: Task A → Task B → Task C
SFT results:
- Task A: 90% → 60% → 40% (continuous decline)
- Task B: — → 85% → 55%
- Task C: — → — → 80%
SDFT results:
- Task A: 90% → 88% → 86% (nearly maintained)
- Task B: — → 85% → 83%
- Task C: — → — → 82%5. Implementation Concept
Pseudo-code
def sdft_training_step(model, input_x, demonstrations):
# 1. Generate model output conditioned on demonstrations
with torch.no_grad():
# In-context generation
y_generated = model.generate(
input_x,
context=demonstrations,
temperature=1.0
)
# 2. Train on generated output (on-policy)
loss = -model.log_prob(y_generated, given=input_x)
return lossKey Components
- Demonstration conditioning: Show examples to the model
- Self-generation: Model produces its own outputs
- On-policy learning: Train on self-generated outputs
6. Limitations and Considerations
Current Limitations
Practical Considerations
- Demonstration selection: Need representative and diverse examples
- Generation diversity: Adjust temperature for varied outputs
- Learning rate: Too fast learning still causes forgetting
7. Practical Implications
When to Use SDFT?
✅ Good fit:
- Continuous model updates required
- Domain adaptation while preserving general capabilities
- Reward function design is difficult
❌ Not suitable:
- Learning completely new capabilities (no related knowledge)
- Fast one-shot adaptation needed
Future Outlook
SDFT could become a core technique for continuous LLM updates.
- Model updates: No full retraining needed when adding knowledge
- Domain adaptation: Maintain general abilities when specializing
- Safety: Preserve safety training results
Key Takeaways
References
- Paper: Self-Distillation Enables Continual Learning
- Project Page: https://self-distillation.github.io/SDFT.html
- Authors: Idan Shenfeld, Mehul Damani, Jonas Hübotter, Pulkit Agrawal (MIT CSAIL)