We Benchmarked MiniCPM-o 4.5 in Korean. Here's What Actually Happens.

MiniCPM-o 4.5 is an omni model optimized for English and Chinese. How well does it handle Korean?

We tested with the same images, same questions — one in Korean, one in English, side by side. Image description, OCR, document extraction, and fine-tuning, all tested hands-on.

The short answer: Korean works. But there are fascinating failure modes, and the root cause isn't what you'd expect.

Test Setup

Item	Spec
Model	MiniCPM-o 4.5 (BF16, 17.6GB VRAM)
Framework	transformers 4.51.0, PyTorch 2.x
Method	Same image + semantically identical Korean/English prompts
Decoding	sampling=True, temperature=0.7, repetition_penalty=1.2

System prompts were set per language:

python

system_prompts = {
    "ko": "당신은 한국어 전문 어시스턴트입니다. 반드시 한국어로만 답변하세요. 중국어, 영어, 러시아어 등 다른 언어를 섞지 마세요.",
    "en": "You are a helpful assistant. Respond only in English.",
}

What Works Well

Image Description: Eiffel Tower

We showed the model a photo of the Eiffel Tower and asked for a detailed description.

Korean output (excerpt):

이 사진은 프랑스 파리의 상징적인 루즈벨트 광장에 위치한 에펠탑을 보여줍니다. 풍성한 밝은 파란 하늘 아래 에펠탑이 풍경의 중심에 우뚝선 모습을 담고 있습니다. (...) 전경에는 푸른 잔디밭이 넓게 펼쳐져 있으며, 몇몇 사람들이 산책하는 것을 볼 수 있습니다.

English output (excerpt):

This is a vibrant, sunlit daytime photograph of the Eiffel Tower in Paris, France. The iconic iron lattice tower stands majestically in the center of the frame, reaching toward a brilliant, deep blue sky streaked with wispy white cirrus clouds and faint contrails from aircraft.

The Korean output is surprisingly natural. It covers structural description (arched base, symmetrical tree arrangement), atmosphere (serene weather), and cultural context (traditional French garden design) — on par with the English output.

Food Comparison: Bibimbap vs BBQ Ribs

We showed two food images (Korean bibimbap, Western BBQ ribs) and asked for a comparative analysis.

The Korean output systematically identified 3 similarities and 5 differences, correctly categorizing them as "Asian-style colorful mixed rice" vs "Western BBQ main course." The analytical depth matched the English version.

Receipt JSON Extraction

Given a Korean receipt image, we asked for structured JSON extraction of store name, date, items, and total.

json

{
  "store_name": "네이어-김유나-음영가인",
  "date": "2022-06-17",
  "items": [
    {"name": "코코라떼", "quantity": 1, "price": 2500},
    {"name": "리얼하리라떼", "quantity": 1, "price": 2500}
  ],
  "total": 5000
}

Item names, prices, and total are accurate. The store name is slightly off, but the structured extraction itself works well.

At this point, you might think "Korean works perfectly!" But cracks appear on harder tasks.

Where It Breaks Down

1. Chinese Token Leakage

When asked to explain the Eiffel Tower's historical background in Korean:

하늘에는 얇고 길게 뻗어 있는 **천空痕**(비행기 흔적)이 보이며

몇몇 사람들이 산책하는 것을 볼 수 있습니다. 이는 이 장소가 관광객과 **시민们** 모두가 즐기는 공공 공간임을 암시합니다.

1889년 프랑스-republic(프랑스 공화국)이 옛 **프랑스-republique**(프랑스 왕국) 시대를 기념하기 위해

이 이미지는 한국의 현대적인 상업 문화와 브랜드 정체성을 잘 보여주는 시각적 요약입니다 (**招牌**를 모아놓은 콜라주)

天空痕, 시민们, 招牌 — Chinese characters appearing in the middle of Korean sentences, despite the system prompt explicitly saying "respond only in Korean."

2. Repetition Loop

When asked to read text from a Korean street sign image:

스킨푸드, SKINFOOD, 파리크라상, PARIS CROISSANT, 이니스프리, TONY MOLY, 토니모리, 올리브영, 아리따움, **에나, 에나, 에나, 에나, 에나, 에나, 에나, 에나, 에나, 에나, 에나, 에나...** (hundreds of repetitions)

The first few signs were read correctly. Then the model gets stuck on "에나" and repeats it until max token length.

The same image in English? Not perfect either, but no repetition loop — it lists diverse sign names throughout.

3. Code-switching

When asked for Korean OCR, the first line of output:

**Sure Here's** all the text from the image, exactly as it appears:

Despite the Korean-only instruction, the response starts in English before switching to Korean.

Root Cause: It's Not the Prompt. It's the Architecture.

It's tempting to blame the prompts. But the real causes run deeper.

The CJK Shared Token Space

MiniCPM-o 4.5's language backbone is Qwen3-8B. Qwen3's tokenizer shares tokens across the CJK (Chinese, Japanese, Korean) Unicode ranges.

The critical factor is training data distribution. In Qwen3's training data:

English: ~40-50%
Chinese: ~30-40%
Korean: ~2-5% (estimated)

When knowledge about "the Eiffel Tower's history" is stored in the model, most associated tokens are bound to Chinese/English contexts. Even when generating Korean, there are moments when Chinese tokens have higher probability. The model generates "비행기 흔적" (contrails) but suddenly switches to "天空痕" — because that's where the strongest association lies.

A system prompt is a high-level instruction. "Respond in Korean" influences the decoding strategy, but it cannot directly control the logit distribution of individual tokens.

Biased Probability Distribution for Korean Tokens

The repetition loop problem is more fundamental. For English, the vocabulary is rich and training data diverse, so next-token probabilities are spread across many candidates. For Korean, limited training data means probability concentrates on fewer tokens.

This is especially dangerous for OCR-like tasks with repetitive patterns. Once "에나" gets a high probability, it becomes the top candidate for the next token, which reinforces the same prediction — a classic positive feedback loop.

repetition_penalty=1.2 helps but isn't sufficient for Korean. Setting it higher would suppress legitimate repetitions like Korean particles and verb endings.

English-biased Instruction Following

The "Sure Here's..." phenomenon occurs because Qwen3's instruction tuning was primarily done on English data. The model's "response start pattern" defaults to English, so even with a Korean system prompt, the first few tokens follow English patterns.

Solutions: A 3-Step Approach

Step 1: Immediate Mitigation

python

# Reinforced system prompt
system_prompt = "당신은 한국어 전문 어시스턴트입니다. 반드시 한국어로만 답변하세요. 중국어, 영어, 러시아어 등 다른 언어를 섞지 마세요."

# Tuned decoding parameters
model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    temperature=0.7,           # Maintain diversity
    repetition_penalty=1.2,    # Suppress repetition
)

This alone produces usable Korean output for most image description and analysis tasks. But OCR and long-form generation still break.

Step 2: LoRA Fine-tuning (Half a Day)

We fine-tuned on NCSOFT's K-DTCBench dataset (Korean document/table/chart VQA, 240 samples) using LoRA.

python

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM,
)
peft_model = get_peft_model(model.llm, lora_config)
# trainable params: 7,667,712 (0.09%)

Epoch	Avg Loss
1	4.22
2	1.71
3	1.23

With just 240 samples and 3 epochs, loss dropped by 71%. This shows that Korean domain data rapidly recalibrates the model's Korean token distributions.

For production use, fine-tuning on thousands of domain-specific samples (medical documents, legal text, Korean signage, etc.) can significantly reduce Chinese leakage and repetition loops.

Step 3: Full Korean Pipeline

For a complete Korean service including speech:

Korean ASR: Fine-tuned Korean Whisper (~3% CER achievable)
Text + Vision: MiniCPM-o with Korean LoRA
Korean TTS: CosyVoice2 (officially supports Korean)

This bypasses MiniCPM-o's English/Chinese-only omni pipeline while building a fully functional Korean voice conversation system.

Conclusion

Here's our honest assessment of MiniCPM-o 4.5's Korean capabilities:

Task	Zero-shot Quality	After Fine-tuning	Production Ready?
Image description	Excellent	-	Ready now
Food/scene analysis	Excellent	-	Ready now
Receipt extraction	Good	Expected excellent	Fine-tuning recommended
Document OCR	Fair (repetition loops)	Expected good	Fine-tuning required
Sign OCR	Unstable	Expected good	Fine-tuning required

Chinese token leakage, repetition loops, and code-switching cannot be fixed by prompt engineering alone. These stem from training data proportions and tokenizer architecture. Being open source, you can analyze the failure modes directly and correct them with LoRA — 240 samples alone reduced loss by 71%.

We Benchmarked MiniCPM-o 4.5 in Korean. Here's What Actually Happens.