AI ResearchKR

InternVL-U: Understanding + Generation + Editing in One 4B Model -- A New Standard for Unified Multimodal AI

Shanghai AI Lab's InternVL-U. A single 4B parameter model handles image understanding, generation, editing, and reasoning-based generation. Decoupled visual representations outperform 14B BAGEL on GenEval and DPG-Bench.

InternVL-U: Understanding + Generation + Editing in One 4B Model -- A New Standard for Unified Multimodal AI

InternVL-U: Understanding + Generation + Editing in One 4B Model -- A New Standard for Unified Multimodal AI

There's been a long-standing goal in multimodal AI: a single model that can understand, generate, and edit images. Previously, each task required a separate model. Image understanding used InternVL, generation used Stable Diffusion, editing used InstructPix2Pix -- pipelines became complex, and knowledge sharing between models was impossible.

InternVL-U, released by Shanghai AI Lab in March 2026, tackles this problem head-on. With just 4B parameters in a single model, it handles multimodal understanding, text-to-image generation, image editing, and reasoning-based generation. It outperforms the 14B-parameter BAGEL on GenEval (0.85 vs 0.82) and DPG-Bench (85.18 vs 85.07).

The secret lies in an architectural design called Decoupled Visual Representation.

The Unified Multimodal Dilemma: Limits of a Single Representation

Previous unified multimodal models (Emu3, Show-o, Janus) tried to handle both understanding and generation with a single visual tokenizer. This creates a fundamental conflict.

What Understanding requires:

  • High-level semantic features (recognizing that something is a "cat")
  • Object relationships (the cat is sitting "on" the mat)
  • Overall scene context

What Generation requires:

  • Low-level pixel information (exact colors, textures)
  • Spatial precision (exact position and size of objects)
  • Visual details (shadows, reflections, textures)

A single representation cannot excel at both. Optimizing for understanding degrades generation quality; optimizing for generation weakens understanding capability. This is the "representation conflict" problem.

InternVL-U's Solution: Decoupled Visual Representation

InternVL-U's core idea is simple:

Use different visual representations for understanding and generation.
PipelineComponentPurposeFeature Type
UnderstandingPre-trained ViT (InternViT-300M)Image recognition/reasoningHigh-level semantic features
GenerationVAE (Qwen-Image)Image generation/editingLow-level continuous latent representation

The ViT focuses exclusively on understanding, and the VAE focuses exclusively on generation. Since the two representations don't interfere with each other's learning, each achieves optimal performance in its respective role.

Architecture Details: Three Modules

InternVL-U is a 4B parameter model composed of three modules.

Module 1: Visual Understanding Encoder (InternViT-300M)

  • Parameters: ~300M
  • Structure: 24 Transformer layers, hidden size 1024, 16 attention heads
  • Role: Extract high-level semantic features from raw pixels
  • Token processing: Encode image patches into 1024 visual tokens → compress to 256 tokens via pixel shuffle
  • Resolution: Dynamic High Resolution strategy with 448x448 tile splitting

Module 2: Context Backbone / MLLM (InternVL3.5-2B)

  • Parameters: ~2B
  • Structure: 28 Transformer layers (Qwen-series LLM backbone)
  • Role: Text generation, semantic reasoning, bridge between understanding and generation
  • Pattern: ViT-MLP-LLM architecture (InternVL family standard)

This module is the central hub. It processes text tokens and visual tokens in a shared latent space, converting understanding results into conditioning signals for the generation module.

Module 3: Visual Generation Head (Custom MMDiT, 1.7B)

  • Parameters: ~1.7B
  • Structure: 20 Transformer layers, 12 attention heads
  • Key innovations:

- Gating mechanism within attention blocks: First of its kind in MMDiT architecture

- Multimodal Scalable RoPE (MSRoPE): Variable resolution handling

- Flow Matching: Velocity parameterization instead of standard diffusion noise prediction

  • VAE: Same VAE as Qwen-Image for conversion between continuous latent space and pixel space

Inter-Module Connection

The unified hidden states produced by the MLLM backbone serve as conditioning signals for the MMDiT generation head. Dual projectors + variance normalization are used to resolve feature distribution differences from the VLM branch.

Four Operating Modes

InternVL-U performs four tasks from a single checkpoint:

ModeInputOutputExample
Text generationImage + TextText"What amino acids are visible in this image?"
Image generationTextImage"A futuristic city at sunset"
Image editingImage + InstructionEdited image"Change the sky to sunset colors"
Reasoning-based generationTextCoT text + Image"Generate a physics diagram"

The 4th mode is particularly unique. It decomposes abstract prompts ("generate happiness") through Chain-of-Thought into specific visual elements, emotional intent, and typographic constraints before generating the image.

Training Pipeline: 3 Stages

Stage 1: Generation Head Pre-training

  • Steps: 250,000
  • Resolution: Fixed 512px
  • MLLM: Frozen (only MMDiT trains)
  • Data: T2I : Editing = 4:1
  • Goal: Train MMDiT to generate images conditioned on MLLM hidden states

Stage 2: Variable Resolution Pre-training

  • Steps: 60,000
  • Resolution: Variable 512~1024px
  • MLLM: Frozen
  • Goal: Variable resolution adaptation + strict aesthetic filtering

Stage 3: Unified SFT (Full Model Training)

  • Steps: 20,000
  • MLLM: Unfrozen (full end-to-end training)
  • Data: Generation : Editing : Understanding = 1:1:2
  • Loss weights: NTP : VP = 1:20
  • Goal: Unified optimization including CoT reasoning data

Data Synthesis Pipeline

One of InternVL-U's strengths is its synthetic data across 5 domains:

  1. Text-centric: Bilingual (Chinese/English) text rendering
  2. Science-centric: Physics diagrams, computer science visualizations
  3. Spatial-centric: Solid geometry, CAD multi-view, 3D rotation
  4. Humor-centric: Meme generation/editing
  5. Reasoning-centric (CoT): Chain-of-Thought augmentation for general, knowledge, meme, and scientific images

Benchmark Results

Image Generation (GenEval)

ModelParametersSingle ObjTwo ObjCountingColorsOverall
InternVL-U4B0.990.940.740.910.85
BAGEL14B--------0.82
Janus-Pro7B--------0.80
Qwen-Image20B--------0.87

4B beats 14B (BAGEL) and comes close to 5x larger 20B (Qwen-Image).

Multimodal Understanding

BenchmarkInternVL-U (4B)BAGEL (14B)Janus-Pro (7B)
OCRBench83.973.348.7
MMMU54.755.336.3
MME-P1607.51687.01444.0

Leads BAGEL by 10.6 points on OCRBench. Nearly tied on MMMU with just 0.6 point difference.

Image Generation (DPG-Bench)

ModelGlobalEntityAttributeRelationOverall
InternVL-U90.3990.7890.6890.2985.18
BAGEL--------85.07
Janus-Pro--------84.19

A 3.5x smaller model outperforms BAGEL on DPG-Bench as well.

Practical Usage

python
import torch
from PIL import Image
from internvlu import InternVLUPipeline

pipeline = InternVLUPipeline.from_pretrained(
    "InternVL-U/InternVL-U",
    torch_dtype=torch.bfloat16,
).to("cuda")

# Image understanding
output = pipeline(
    prompt="What animals do you see in this photo?",
    image=Image.open("cat.jpg").convert("RGB"),
    generation_mode="text",
)

# Image generation
output = pipeline(
    prompt="A futuristic city at sunset",
    height=576, width=1024,
    generation_mode="image",
    generator=torch.Generator(device="cuda").manual_seed(42),
)

# Image editing
output = pipeline(
    prompt="Change the sky to sunset colors",
    image=Image.open("photo.jpg").convert("RGB"),
    generation_mode="image",
)

Why Does 4B Beat 14B?

Two key factors:

1. Decoupled representations eliminate optimization conflicts

BAGEL has 14B parameters, but understanding and generation share representations, interfering with each other's learning. InternVL-U completely separates ViT and VAE, letting each focus on its own role. Fewer parameters achieve higher efficiency.

2. CoT data augmentation

Chain-of-Thought training that decomposes abstract user instructions into concrete visual elements makes a particularly large difference in text rendering and knowledge-intensive generation.

Conclusion

What InternVL-U demonstrates is that "size isn't everything."

  1. Decoupling is key: Forcing the same representation for understanding and generation hurts both
  2. 4B can beat 14B: Architecture design matters more than parameter count
  3. Unified models are now practical: Understanding + generation + editing from a single checkpoint, MIT licensed
  4. CoT is effective for generation too: Reasoning-based generation opens a new direction

References:

Stay Updated

Follow us for the latest posts and tutorials

Subscribe to Newsletter

Related Posts