Fine-tuning Gemma 4 MoE — Customizing Arena #6 with 3.8B Active Parameters
Apply QLoRA to Gemma 4 26B MoE. Expert layer LoRA strategies, Dense vs MoE comparison, MoE-specific training tips, and Ollama deployment. LoRA Series Part 4.
Fine-tuning Gemma 4 MoE — Customizing Arena #6 with Just 3.8B Active Parameters
Series: Part 1: LoRA Theory | Part 2: QLoRA + Custom Data | Part 3: Eval + Deploy | Part 4 (this post)
Parts 1-3 covered LoRA fundamentals through deployment using Qwen 2.5 7B. Part 4 levels up — we apply LoRA to a Gemma 4 MoE model.
Why Gemma 4? Three reasons:
- MoE architecture: 26B total params, only 3.8B active. Inference cost is 4B-class, but performance is Arena #6
Related Posts

AI Engineering
LLM Inference Optimization Part 4 — Production Serving
Production deployment with vLLM and TGI. Continuous Batching, Speculative Decoding, memory budget design, and throughput benchmarks.

AI Engineering
LLM Inference Optimization Part 3 — Sparse Attention in Practice
Sliding Window, Sink Attention, DeepSeek DSA, IndexCache, and Nvidia DMS. From dynamic token selection to Needle-in-a-Haystack evaluation.

AI Engineering
LLM Inference Optimization Part 2 — KV Cache Optimization
KV Cache quantization (int8/int4), PCA compression (KVTC), and PagedAttention (vLLM). Hands-on memory reduction code and scenario-based configuration guide.