Models & AlgorithmsKR

SDFT: Learning Without Forgetting via Self-Distillation

No complex RL needed. Models teach themselves to learn new skills while preserving existing capabilities.

SDFT: Learning Without Forgetting via Self-Distillation

SDFT: Learning Without Forgetting via Self-Distillation

No complex RL needed. Models teach themselves to learn new skills while preserving existing capabilities.

TL;DR

  • Problem: Traditional SFT causes catastrophic forgetting when learning new tasks
  • Solution: SDFT (Self-Distillation Fine-Tuning)
  • Key insight: Model generates its own training signals conditioned on demonstrations (On-policy)
  • Result: Sequential skill accumulation without performance degradation

1. Why Does SFT Cause Forgetting?

The Limitation of Supervised Fine-Tuning

The fundamental issue with SFT:

python
Training data distribution ≠ Model's current output distribution
AspectSFTProblem
Learning typeOff-policyLearns only from ground truth
Distribution shiftLargeGap between model output and training data
OutcomeForgettingNew data overwrites existing distribution

Off-policy vs On-policy

Off-policy (SFT):

python
Model output: "The cat sat on the..."
Ground truth: "A feline rested upon the..."
→ Forces learning regardless of model's natural output

On-policy (SDFT):

python
Model output: "The cat sat on the mat."
Training signal: Generated from model's own output
→ Improves while maintaining current distribution

2. SDFT: Self-Distillation Fine-Tuning

Core Idea

Show the model demonstrations, then let it generate its own training data based on that knowledge.

How It Works

python
1. Provide Demonstrations
   [example input] → [example output]

2. Model generates conditionally
   π(y|x, demonstrations)

3. Self-learning from generated data
   Model output → Training signal

Mathematical Formulation

Traditional SFT objective:

LSFT=E(x,y)D[logπθ(yx)]\mathcal{L}_\text{SFT} = -\mathbb{E}_{(x,y)\sim\mathcal{D}}[\log \pi_\theta(y|x)]

SDFT objective:

LSDFT=ExD[logπθ(yx)]where yπθ(x,demos)\mathcal{L}_\text{SDFT} = -\mathbb{E}_{x\sim\mathcal{D}}[\log \pi_\theta(y|x)] \quad \text{where } y \sim \pi_\theta(\cdot|x, \text{demos})

Key difference: $y$ is generated by the model itself, not a fixed ground truth.

3. Why SDFT Works

The Importance of Distribution Preservation

python
SFT:  P_model → P_new_data (destroys existing distribution)
SDFT: P_model → P_model + Δ (preserves while improving)

Advantages of Self-Distillation

AspectSFTSDFT
Training signalExternal dataSelf-generated
Distribution changeAbruptGradual
Prior capabilitiesLostPreserved
Reward functionNot neededNot needed

Comparison with RL

Reinforcement learning is also on-policy, but:

  • RL: Requires reward function (difficult to design)
  • SDFT: No reward function needed (just demonstrations)

4. Experimental Results

Continual Learning Performance

Key findings from the paper:

MetricSFTSDFT
New task accuracyBaselineHigher
Prior task retentionSharp declineMostly preserved
Multi-task accumulationPerformance decayContinuous improvement

Sequential Learning Scenario

python
Task sequence: Task A → Task B → Task C

SFT results:
- Task A: 90% → 60% → 40%  (continuous decline)
- Task B: — → 85% → 55%
- Task C: — → — → 80%

SDFT results:
- Task A: 90% → 88% → 86%  (nearly maintained)
- Task B: — → 85% → 83%
- Task C: — → — → 82%

5. Implementation Concept

Pseudo-code

python
def sdft_training_step(model, input_x, demonstrations):
    # 1. Generate model output conditioned on demonstrations
    with torch.no_grad():
        # In-context generation
        y_generated = model.generate(
            input_x,
            context=demonstrations,
            temperature=1.0
        )

    # 2. Train on generated output (on-policy)
    loss = -model.log_prob(y_generated, given=input_x)

    return loss

Key Components

  1. Demonstration conditioning: Show examples to the model
  2. Self-generation: Model produces its own outputs
  3. On-policy learning: Train on self-generated outputs

6. Limitations and Considerations

Current Limitations

LimitationDescription
Demonstration qualityRequires good examples
Generation costNeed to generate at each training step
Base model dependencyModel needs some initial capability

Practical Considerations

  • Demonstration selection: Need representative and diverse examples
  • Generation diversity: Adjust temperature for varied outputs
  • Learning rate: Too fast learning still causes forgetting

7. Practical Implications

When to Use SDFT?

Good fit:

  • Continuous model updates required
  • Domain adaptation while preserving general capabilities
  • Reward function design is difficult

Not suitable:

  • Learning completely new capabilities (no related knowledge)
  • Fast one-shot adaptation needed

Future Outlook

SDFT could become a core technique for continuous LLM updates.
  • Model updates: No full retraining needed when adding knowledge
  • Domain adaptation: Maintain general abilities when specializing
  • Safety: Preserve safety training results

Key Takeaways

ConceptDescription
Catastrophic ForgettingNew learning overwrites existing knowledge
Off-policy (SFT)Learning from external data distribution → causes forgetting
On-policy (SDFT)Learning from model's own distribution → prevents forgetting
Self-DistillationModel teaches itself
Demonstration-conditionedExamples guide model output

References

Stay Updated

Follow us for the latest posts and tutorials

Subscribe to Newsletter

Related Posts