February 28, 20264 min readLoxily Team

AI vs. Traditional Translation Blind Test: Real-World Game Localization Data

Is AI translation actually good enough to ship? Theory only goes so far—so we ran a rigorous head-to-head test on the same core SLG dialogue. Here's the data.

Share

"Is AI translation actually good enough to ship?"

It's the question we field more than any other. Rather than argue theory, we'd rather show data. So we took a single passage of core SLG dialogue and ran a rigorous side-by-side test.


Test Design

The material: A core story-dialogue excerpt from an SLG, packing in five distinct translation challenges:

  • Character voice consistency (an arrogant general vs. a humble strategist)
  • Game-specific terminology (Mandate Points, Awakening Stones, Expedition Orders)
  • Cultural wordplay (lines built on historical allusions)
  • UI constraints (3 button labels, each capped at 12 characters)
  • Emotional density (a character's farewell scene, 8 lines of continuous dialogue)

Group A: A traditional translation agency (industry Top 10, $0.12/word, 5-day turnaround) Group B: AI engine + character profiles + termbase ($0.008/word, 4-hour turnaround) — for a breakdown of how AI game localization works end to end, see our complete guide The reviewers: 3 native-English game localization reviewers (anonymized blind review)


Item-by-Item Results Across Five Dimensions

1. Accuracy

MetricTraditionalAI
Term consistency rate87%99%
Factual errors20
Missed translations10

The traditional translation's problems clustered around inconsistent terminology—"Awakening Stone" was rendered two different ways across the first and second halves. The AI worked from the termbase and stayed consistent throughout.

2. Fluency

MetricTraditionalAIAI + 15 min polish
Native fluency (1-10)8.37.88.5
"Translationese" flags130

The AI's first draft had a faint "translationese" feel—overuse of the passive voice, clauses nested a bit too deep. But after 15 minutes of human polish, it edged ahead of the pure-human translation on fluency.

3. Voice Consistency

MetricTraditionalAI
General's voice retention72%95%
Strategist's voice retention68%91%

This was the most surprising dimension of all. Because the AI cross-referenced the character profiles on every single line, it pulled well ahead on holding character voice. Racing against the deadline, the agency's translator skipped the character bible, and the two characters' voices started to blur together in the second half.

4. Creative Adaptation

Test line: "此去经年,风烟俱净。"

  • Traditional: 9.1/10 — "Years will pass, and all that remains is the wind and the silence."
  • AI: 7.9/10 → 8.8/10 after polish — "In years to come, even the wind and smoke will find their peace."

Highly literary content (<5% of game text) is still where human translators hold the edge—but a quick polish pass narrows the gap dramatically.

5. Constraint Compliance

MetricTraditionalAI
UI character-limit compliance1/3 passed3/3 passed
Format tag retention95%100%

Agency translators not reading the UI spec sheet is par for the course. The AI enforces character limits as a hard constraint—it never forgets.


Overall Comparison

OptionCost (100K characters)TurnaroundComposite quality score
Traditional$12,0005 days3.9/5
AI$8004 hours4.4/5
AI + human polish$2,0006 hours4.8/5

83% lower cost, 23% higher quality, 95% faster turnaround.


When to Use AI, and When to Use Humans

Content typeShareRecommended approach
System prompts, UI copy~30%AI only
Routine NPC dialogue~40%AI + spot checks
Main-quest story dialogue~20%AI + full review
Cutscenes / promo copy~5%AI first draft + human transcreation
Marketing assets~5%Human transcreation

The core principle: let AI handle 80% of the correctness work, and let humans focus on the 20% that's truly creative. Getting this split right matters beyond cost—localization quality directly shapes player retention, so the dimensions we tested above are the same ones that keep players engaged.

Related articles