SDFT: Learning Without Forgetting via Self-Distillation

No complex RL needed. Models teach themselves to learn new skills while preserving existing capabilities.

TL;DR

Problem: Traditional SFT causes catastrophic forgetting when learning new tasks
Solution: SDFT (Self-Distillation Fine-Tuning)
Key insight: Model generates its own training signals conditioned on demonstrations (On-policy)
Result: Sequential skill accumulation without performance degradation

1. Why Does SFT Cause Forgetting?

The Limitation of Supervised Fine-Tuning

The fundamental issue with SFT:

Training data distribution ≠ Model's current output distribution

Off-policy vs On-policy

Off-policy (SFT):

Model output: "The cat sat on the..."
Ground truth: "A feline rested upon the..."
→ Forces learning regardless of model's natural output

On-policy (SDFT):

Model output: "The cat sat on the mat."
Training signal: Generated from model's own output
→ Improves while maintaining current distribution

2. SDFT: Self-Distillation Fine-Tuning

Core Idea

Show the model demonstrations, then let it generate its own training data based on that knowledge.

How It Works

1. Provide Demonstrations
   [example input] → [example output]

2. Model generates conditionally
   π(y|x, demonstrations)

3. Self-learning from generated data
   Model output → Training signal

Mathematical Formulation

Traditional SFT objective:

$$\mathcal{L}_\text{SFT} = -\mathbb{E}_{(x,y)\sim\mathcal{D}}[\log \pi_\theta(y|x)]$$

SDFT objective:

$$\mathcal{L}_\text{SDFT} = -\mathbb{E}_{x\sim\mathcal{D}}[\log \pi_\theta(y|x)] \quad \text{where } y \sim \pi_\theta(\cdot|x, \text{demos})$$

Key difference: $y$ is generated by the model itself, not a fixed ground truth.

3. Why SDFT Works

The Importance of Distribution Preservation

SFT:  P_model → P_new_data (destroys existing distribution)
SDFT: P_model → P_model + Δ (preserves while improving)

Advantages of Self-Distillation

Comparison with RL

Reinforcement learning is also on-policy, but:

RL: Requires reward function (difficult to design)
SDFT: No reward function needed (just demonstrations)

4. Experimental Results

Continual Learning Performance

Key findings from the paper:

Sequential Learning Scenario

Task sequence: Task A → Task B → Task C

SFT results:
- Task A: 90% → 60% → 40%  (continuous decline)
- Task B: — → 85% → 55%
- Task C: — → — → 80%

SDFT results:
- Task A: 90% → 88% → 86%  (nearly maintained)
- Task B: — → 85% → 83%
- Task C: — → — → 82%

5. Implementation Concept

Pseudo-code

def sdft_training_step(model, input_x, demonstrations):
    # 1. Generate model output conditioned on demonstrations
    with torch.no_grad():
        # In-context generation
        y_generated = model.generate(
            input_x,
            context=demonstrations,
            temperature=1.0
        )

    # 2. Train on generated output (on-policy)
    loss = -model.log_prob(y_generated, given=input_x)

    return loss

Key Components

Demonstration conditioning: Show examples to the model
Self-generation: Model produces its own outputs
On-policy learning: Train on self-generated outputs

6. Limitations and Considerations

Current Limitations

Practical Considerations

Demonstration selection: Need representative and diverse examples
Generation diversity: Adjust temperature for varied outputs
Learning rate: Too fast learning still causes forgetting

7. Practical Implications

When to Use SDFT?

✅ Good fit:

Continuous model updates required
Domain adaptation while preserving general capabilities
Reward function design is difficult

❌ Not suitable:

Learning completely new capabilities (no related knowledge)
Fast one-shot adaptation needed

Future Outlook

SDFT could become a core technique for continuous LLM updates.

Model updates: No full retraining needed when adding knowledge
Domain adaptation: Maintain general abilities when specializing
Safety: Preserve safety training results

Key Takeaways

References

Paper: Self-Distillation Enables Continual Learning
Project Page: https://self-distillation.github.io/SDFT.html
Authors: Idan Shenfeld, Mehul Damani, Jonas Hübotter, Pulkit Agrawal (MIT CSAIL)