AI ResearchKR

Hybrid Mamba-Transformer MoE: Three Teams, One Architecture -- The 2026 LLM Convergence

NVIDIA Nemotron 3 Nano, Qwen 3.5, and Mamba-3 independently converge on 75% linear layers + 25% attention + MoE. 88% KV-cache reduction, O(n) complexity for long-context processing.

Hybrid Mamba-Transformer MoE: Three Teams, One Architecture -- The 2026 LLM Convergence

The Hybrid Mamba-Transformer-MoE Architecture: Three Teams, One Conclusion

In March 2026, something remarkable happened. Three independent teams -- NVIDIA, Alibaba (Qwen), and the Mamba research group -- arrived at the same architectural conclusion almost simultaneously.

"Neither pure Transformer nor pure SSM. Mix them at roughly 75% linear layers to 25% attention layers. Add MoE routing on top."

NVIDIA released Nemotron 3 Nano. Qwen shipped the 3.5 Small series. The Mamba team presented a theoretical framework (Mamba-3) at ICLR 2026. If one team had reached this conclusion, it could be coincidence. When three do it at once, it is a paradigm shift signal.

This post covers the background behind this convergence, the technical details of each architecture, and what it means for AI infrastructure going forward.

Background: The Cost Problem with Pure Transformers

For six years after GPT, the Transformer has been essentially the only LLM architecture. But as scale increases, two fundamental problems become harder to ignore.

1. Quadratic Complexity of Self-Attention

Self-attention -- the core of the Transformer -- requires O(n^2) computation with respect to sequence length n. Every token attends to every other token. The scaling looks like this:

Sequence LengthAttention Compute (relative)
1,0241x
8,19264x
32,7681,024x
131,07216,384x

8x longer means 64x more expensive. A 128K context requires over 16,000x more attention compute than 1K.

2. KV-Cache Memory Explosion

During inference, Transformers must keep the Key-Value vectors for every previously generated token in memory (the KV-cache). This cache grows linearly with sequence length.

A 70B model processing a 128K context can consume tens of gigabytes of GPU memory on KV-cache alone. This limits batch sizes, caps concurrent users, and makes mobile or edge deployment essentially impossible.

These two problems motivated the State Space Model (SSM) line of research.

State Space Models Recap: The Road to Mamba

The core idea behind State Space Models is that "not every token needs to see every other token." Instead, information is compressed into a fixed-size state, which is updated sequentially as new tokens arrive.

An analogy: Transformers are like taking an exam with every textbook open on your desk (full reference). SSMs are like reading all the textbooks first, then answering from your summary notes (compressed state).

S4 (2022)

The Structured State Space Sequence Model discretized continuous-time state equations for sequence modeling. It achieved O(n) complexity and could handle long sequences, but it could not beat Transformers on language tasks. The state transition matrices were fixed (input-independent), making it weak at content-based reasoning.

Mamba (2023)

Gu and Dao's Mamba tackled S4's biggest weakness head-on. It made the state transition matrices A, B, C input-dependent (selective state spaces). The model could now choose what to remember and what to forget based on content.

Key properties:

  • O(n) complexity: Compute scales linearly with sequence length
  • Fixed-size state: No KV-cache. State size is independent of sequence length
  • Hardware-aware implementation: Optimized data movement between GPU SRAM and HBM

Mamba-2 (2024)

Mamba-2 reformulated the core computation in terms of semiseparable matrices and discovered the Structured State Space Duality (SSD). This proved a mathematical equivalence between Mamba's selective scan and linear attention. This theoretical discovery opened the door to hybrid architectures.

DeltaNet / Gated DeltaNet

DeltaNet is a linear attention variant that applies the delta rule (an error-correcting learning rule) to linear attention. Gated DeltaNet adds a gating mechanism for finer-grained control over information flow. Like Mamba-2, it has O(n) complexity while offering expressiveness closer to full attention.

The Hybrid Insight: Why Mixing Works Better

Pure SSMs (Mamba, DeltaNet, etc.) have the advantage of O(n) complexity, but a quality gap versus pure Transformers persisted. Why?

The Fixed-State Information Bottleneck

An SSM's state is a fixed-dimensional vector. No matter how long the sequence, the state size does not change. This is like having a limited number of pages in your summary notes. Some information is inevitably lost.

Tasks where this becomes a problem:

  • In-context learning: Tasks that require precise reference to examples given in the prompt
  • Exact retrieval/copying: Tasks that need to reproduce a specific part of the input verbatim
  • Complex dependency tracking: Tasks involving precise relationships between distant tokens

For these tasks, full attention -- where every token can directly reference every other token -- has a structural advantage.

The key hybrid insight: Most layers only need linear-complexity SSM/linear attention. Only a small number of layers need full attention.

Think of it this way. When reading a book, most of the time you follow the narrative sequentially (the SSM's job). But occasionally you need to flip back to an earlier page to check an exact detail (attention's job). You do not need to re-read from page one for every sentence.

Let us now look at the three implementations of this idea.

Case 1: NVIDIA Nemotron 3 Nano (30B-A3B)

NVIDIA's Nemotron 3 Nano is one of the most complete implementations of the hybrid architecture.

Architecture Breakdown

SpecValue
Total Parameters31.6B
Active Parameters3.2B (~10%)
Total Layers52
Mamba-2 Layers23 (44%)
MoE FFN Layers23 (44%)
GQA Attention Layers6 (12%)
Context Window128K tokens

Out of 52 layers, only 6 are attention layers. The remaining 46 use Mamba-2 (linear complexity) and MoE (conditional computation).

Why This Ratio?

NVIDIA ran ablation experiments varying the attention ratio from 0% to 100%:

  • 0% (pure Mamba): Significant degradation on in-context retrieval tasks
  • 10-15% attention: Performance recovery to match pure Transformers on most benchmarks
  • 25%+: Diminishing returns from additional attention layers

6/52 = roughly 11.5% attention is the sweet spot from these experiments.

MoE Configuration

The 23 MoE layers each contain 16 experts, with top-4 activated per token. This maintains the knowledge capacity of a 31.6B model while keeping actual computation at the 3.2B level.

Performance Summary

Nemotron 3 Nano significantly outperforms dense models of similar active parameter count and competes with models 10x larger. Highlights:

  • Averages 5-8%p higher benchmark scores than Phi-4-mini (3.8B dense)
  • Matches or exceeds Llama 3.1 8B performance with only 3.2B active parameters
  • Runs real-time inference on NVIDIA Jetson edge devices

The KV-cache savings are dramatic. Only 6 attention layers require KV-cache, so KV-cache memory usage drops by roughly 88% compared to a pure Transformer.

Case 2: Qwen 3.5 Small Series -- The Power of Gated DeltaNet

Alibaba's Qwen team took a different linear layer -- DeltaNet-family linear attention -- to build their hybrid.

Architecture Core

The key design in Qwen 3.5 Small is a 3:1 ratio of Gated DeltaNet to softmax attention. For every 4 layers, 3 use Gated DeltaNet (linear complexity) and 1 uses full softmax attention. This puts the attention ratio at 25%, matching the theoretical optimum from Mamba-3 exactly.

What is Gated DeltaNet?

Recall standard softmax attention:

Attention(Q, K, V) = softmax(Q * K^T / sqrt(d)) * V

The softmax is what causes O(n^2). Linear attention removes the softmax and replaces it with a kernel function phi:

LinearAttn(Q, K, V) = phi(Q) * (phi(K)^T * V)

The key trick is the order of operations. Computing phi(K)^T * V first yields a d x d matrix. Multiplying phi(Q) by that matrix costs O(n * d^2) -- linear in sequence length.

DeltaNet applies the delta rule on top of this. When updating the state S at each step, it "erases the old relevant information" before "writing new information":

S_t = S_{t-1} - beta_t * (S_{t-1} * k_t) * k_t^T + beta_t * v_t * k_t^T

This is like an associative memory that overwrites old associations with new ones. Gated DeltaNet adds a gating mechanism (alpha) for finer control over what to retain and what to discard.

9B Beats 120B

The most striking result from the Qwen 3.5 Small series is the 9B model's performance.

BenchmarkQwen 3.5 Small 9BGPT-OSS-120BNotes
MMLU-ProHigherLower9B exceeds 120B
HumanEval+HigherComparableCoding ability on par
Throughput~10x1xActive parameter difference

A 9B hybrid model beating a 120B dense Transformer on benchmarks demonstrates that this architectural shift is not just an efficiency play -- it can deliver genuine performance gains.

Case 3: Mamba-3 -- The Theoretical Contribution at ICLR 2026

The team led by Albert Gu and Tri Dao -- the original Mamba authors -- presented Mamba-3 at ICLR 2026. It is more theoretical framework than production model.

Key Contributions

1. Theoretical Derivation of the Optimal Ratio

The Mamba-3 paper provides a mathematical answer to "why is ~75% linear layers + ~25% attention layers optimal?"

The core argument:

  • Linear layers (SSM/linear attention) are optimal for processing sequential "flow" at O(n) complexity
  • Full attention is O(n^2) but irreplaceable for tasks requiring precise retrieval
  • Analyzing the actual task distribution of language models shows that precise retrieval is needed in roughly 20-30% of computation
  • Therefore, 25% attention suffices, and replacing the remaining 75% with linear layers dramatically reduces overall inference cost

2. Attention Sink Analysis

An interesting finding concerns the optimal placement of attention layers. Rather than distributing attention layers evenly throughout the network, concentrating them at the beginning and end is more effective. Early attention layers capture the global structure of the input, while late attention layers perform precise retrieval needed for output generation.

3. Unified Framework for SSM and Linear Attention

Extending the SSD (Structured State Space Duality) discovered in Mamba-2, Mamba-3 unifies the Mamba family and the DeltaNet family under a single mathematical framework. This proves that "which linear layer to use" is an implementation choice -- they are fundamentally the same computational class.

This explains why NVIDIA chose Mamba-2 and Qwen chose Gated DeltaNet, yet both achieve comparable results.

Performance Comparison: The Hybrid Advantage in Numbers

A side-by-side look at the key characteristics:

PropertyNemotron 3 NanoQwen 3.5 Small 9BPure Transformer (comparable)
Total Parameters31.6B~9B~8B
Active Parameters3.2B~9B~8B
Linear Layer Ratio88%75%0%
KV-cache Size (128K)~12% of full~25% of full100%
Inference Speed (relative)~3-4x~2-3x1x
Memory EfficiencyVery highHighBaseline

KV-Cache Savings in Practice

This is where the biggest real-world deployment difference lies.

Consider the KV-cache size for a pure Transformer at 128K context. With 40 layers, hidden dim 4096, GQA with 8 KV heads:

KV-cache = 2 (K + V) * 40 (layers) * 128,000 (seq_len) * 512 (head_dim * n_kv_heads) * 2 (bytes, FP16) = ~10.5 GB

Nemotron 3 Nano needs KV-cache for only 6 layers:

KV-cache = 2 * 6 * 128,000 * 512 * 2 = ~1.6 GB

That is roughly 85% memory savings at the same context length. This difference directly impacts the number of concurrent users you can serve, the batch sizes you can run, and whether edge deployment is feasible.

Practical Implications: What Changes

1. Inference Cost Reduction

O(n) linear layers + conditional MoE computation + a handful of attention layers. This combination generates responses of the same quality while dramatically reducing FLOPs. For API service providers, this translates directly to margin improvement.

2. Making Long Context Practical

Processing 128K+ context at low cost becomes viable. Full codebase analysis, long document summarization, and multi-turn conversations enter the economically practical zone.

3. On-Device Deployment

KV-cache savings make LLM deployment realistic on memory-constrained mobile and edge devices. NVIDIA demonstrated real-time inference on Jetson with Nemotron 3 Nano as a proof point.

4. Training Infrastructure Changes

Training hybrid models is more complex than training pure Transformers. Mamba/DeltaNet layers and attention layers require different parallelization strategies. New training frameworks and kernel optimizations are needed, and NVIDIA's hardware-software integration capabilities become even more important in this space.

Conclusion: What the Convergence Tells Us

Since "Attention is All You Need" in 2017, the Transformer architecture has been the undisputed standard for nearly a decade. But in March 2026, the simultaneous convergence of three independent teams signals the emergence of a new standard.

The key takeaways:

  1. The pure Transformer era is ending. O(n^2) attention is not needed in every layer.
  2. 75/25 is the new default ratio. 75% linear layers + 25% attention layers is the efficiency-performance sweet spot.
  3. MoE is becoming essential. MoE decouples knowledge capacity from compute cost, forming the third pillar of the hybrid architecture.
  4. SSM and Linear Attention are the same family. Whether you use Mamba-2 or Gated DeltaNet, they are mathematically the same computational class.

From "Attention is All You Need" to "Attention is Sometimes What You Need." That is the one-line summary of the 2026 architecture trend.

References

Stay Updated

Follow us for the latest posts and tutorials

Subscribe to Newsletter

Related Posts