KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context

arXiv cs.CL Papers

Summary

KMMMU is a native Korean benchmark for evaluating multimodal understanding with 3,466 questions across nine disciplines and visual modality categories, addressing the gap of English-centric benchmarks by testing performance on Korean-specific cultural and institutional contexts.

arXiv:2604.13058v2 Announce Type: replace Abstract: We introduce KMMMU, a native Korean benchmark for evaluating multimodal understanding in Korean cultural and institutional settings. KMMMU contains 3,466 questions from exams natively written in Korean, covering nine disciplines and nine visual modality categories, along with a 300-item Korean-specific subset and a hard subset of 627 questions. Unlike translated or English-centric benchmarks, KMMMU targets information-dense problems shaped by local conventions, official standards, and discipline-specific visual formats. Experiments show that the strongest open-source model reaches only 42.05% accuracy on the full set, while the best proprietary model achieves 52.42% on the hard subset. Performance varies across disciplines, with some disciplines emerging as bottlenecks, and Korean-specific questions showing gaps of up to 13.43%. Error analysis suggests that these failures stem less from insufficient reasoning depth than from weak convention-to-label mapping, few-shot symbolic induction, localized knowledge recall, and domain-specific standards understanding. KMMMU provides a testbed for multimodal evaluation beyond English-centric benchmarks and for developing more reliable systems for expert real-world tasks.
Original Article
View Cached Full Text

Cached at: 04/20/26, 08:32 AM

# KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context

Source: https://arxiv.org/html/2604.13058

Nahyun Lee¹,⁵ Guijin Son²,⁴,⁵ Hyunwoo Ko⁴,⁵ Chanyoung Kim³,⁵ Junyoung An² Kyubeen Han⁵ Il-Youp Kwak¹

¹Chung-Ang University ²Seoul National University ³SK A.X ⁴OnelineAI ⁵HAE-RAE

## Abstract

We introduce KMMMU, a native Korean benchmark for evaluating multimodal understanding in Korean cultural and institutional settings. KMMMU contains 3,466 questions from exams natively written in Korean, covering nine disciplines and nine visual modality categories, along with a 300-item Korean-specific subset and a hard subset of 627 questions. Unlike translated or English-centric benchmarks, KMMMU targets information-dense problems shaped by local conventions, official standards, and discipline-specific visual formats. Experiments show that the strongest open-source model reaches only 42.05% accuracy on the full set, while the best proprietary model achieves 52.42% on the hard subset. Performance varies across disciplines, with some disciplines emerging as bottlenecks, and Korean-specific questions showing gaps of up to 13.43%. Error analysis suggests that these failures stem less from insufficient reasoning depth than from weak convention-to-label mapping, few-shot symbolic induction, localized knowledge recall, and domain-specific standards understanding. KMMMU provides a testbed for multimodal evaluation beyond English-centric benchmarks and for developing more reliable systems for expert real-world tasks.

Dataset is available at https://huggingface.co/datasets/HAERAE-HUB/KMMMU

## 1 Introduction

Multimodal Large Language Models (MLLMs) have shown strong performance on a range of vision–language tasks, including visual recognition, document understanding, and multimodal question answering. However, existing benchmarks do not fully reflect the settings in which these models are increasingly deployed. Past evaluations either are English-centric or derived from translated datasets, making them less suitable for assessing performance on tasks shaped by local institutional conventions, discipline-specific formats, and information-dense visual materials in non-English contexts.

![Figure 1: Comparison of English (MMMU, MMMU-Pro), Japanese (JMMMU, JMMMU-Pro), and Korean (others) multimodal benchmarks. Each point is positioned by benchmark size (x-axis, log scale) and difficulty proxy (100−peak public score), with lighter colors indicating more recent releases. Shaded regions mark two common limitations: small size (left) and low headroom (bottom).](image1)

To address this gap, we introduce KMMMU, a native Korean benchmark for expert-level multimodal understanding. KMMMU contains 3,466 questions drawn from Korean assessment sources, spanning nine disciplines, nine visual modality categories, and both multiple-choice and open-form question formats. Beyond broad evaluation, the benchmark is designed to diagnose localized knowledge, expert reasoning, and discipline- and modality-specific weaknesses. To support this analysis, we construct a **hard subset** of questions jointly missed by three baseline models, as well as a **Korean-specific subset** targeting domestic legal, administrative, and institutional knowledge.

![Figure 2: Examples of KMMMU questions. Examples include the original questions, associated images, English translations, and metadata such as visual modality, question format, and Korean-specific labels.](image2)

Experiments on KMMMU reveal several consistent findings. Current models remain far from robust, with the strongest open-source model reaching 42.05% on the full set and the best proprietary model reaching 52.42% on the hard subset. Performance varies substantially across disciplines, and gains from model scale and explicit reasoning are uneven. Korean-specific questions remain particularly challenging, with accuracy gaps of up to 13.43% relative to non-Korean-specific items. These results show that strong general multimodal ability does not automatically transfer to Korean institutional and cultural contexts.

## 2 Related Work

In recent years, a diverse range of Korean multimodal benchmarks has already been introduced, including KRETA for text-rich VQA, KoNET for exam-based educational assessment, and KorMedMCQA-V for medical reasoning, alongside resources targeting free-form VQA (KOFFVQA), cultural understanding (K-Viscuit), under-specified user queries (HAERAE-Vision), translated benchmark variants (K-MMBench, K-SEED), and document-centric reasoning (K-DTCBench). However, despite this diversity, most existing benchmarks remain limited in coverage, and many are already saturated for current models. This calls for a bigger and stronger benchmark.

Harvesting questions from existing examinations is a common strategy for benchmark construction. Benchmarks such as MMLU, MMMU, and M3Exam all draw on exam-style questions to evaluate broad knowledge and reasoning, and related efforts have extended this paradigm to local languages and cultural contexts, as in JMMMU for Japanese and CMMMU for Chinese. This approach is valuable because exam questions offer scale, disciplinary breadth, and an interpretable link to human expertise, making them useful proxies for general capability even when the evaluation format is limited to multiple-choice or short-form responses.

**So why another X-MMMU benchmark?** The Korean case further highlights why localized benchmarks remain necessary. KMMLU, for instance, is constructed from original Korean exams rather than translations, thereby capturing linguistic and cultural factors that translated benchmarks often miss. Similarly, KMMLU-Pro shows that the gap between translated MMMLU and locally authored Korean professional exams is relatively small in medicine but substantially larger in law-related domains, where country-specific knowledge is indispensable. Together, these findings underscore the need for localized MMMU-style benchmarks tailored to each linguistic and cultural context.

As suggested by Figure 1, the current landscape still reflects a trade-off between breadth, realism, and headroom. Translation-based benchmarks improve comparability with established English suites, but they largely inherit the structure and limitations of their source tasks. More realistic or culturally grounded benchmarks capture important failure modes, including cultural reasoning, text-rich understanding, and under-specified real-world queries, yet they are often narrower in scope or smaller in scale. Moreover, most existing Korean benchmarks already lie in the low-headroom region, while HAERAE-Vision, although comparatively difficult, derives much of its challenge from deliberate under-specification rather than broad coverage of general capabilities. Accordingly, there remains a clear need for a large-scale Korean multimodal benchmark that is broad in coverage, grounded in local context, and sufficiently unsaturated to differentiate frontier models.

## 3 The KMMMU Benchmark

### 3.1 Data Collection and Annotation

KMMMU is constructed from Korean-native official examinations and competitions. These sources include the civil service recruitment (PSAT), National Technical Qualifications (NTQ), National Competency Standards exam (NCS), and academic Olympiads (see Appendix A for details). We initially collect approximately 68k raw instances. We process the collected exam materials into structured multimodal instances using automated extraction, followed by manual verification. Technical qualification data are collected through web crawling, while other sources are digitized using the MinerU-2.5 OCR system. To correct OCR artifacts and validate image cropping, we built a custom verification interface. Five Korean annotators use this system to review the dataset, refine LaTeX formulas, verify image references, and discard illegible questions (see Appendix B for details). Additionally, we expect this step to reduce contamination risk. As a large portion of the dataset is acquired from PDF documents, the benchmark is less susceptible to large-scale web-crawled datasets. We provide additional ablation studies in Appendix I.

### 3.2 KMMMU Dataset Construction

To ensure benchmark difficulty, we apply a multi-stage adversarial filtering pipeline removing instances solvable by one or more of the following models: Phi-3.5-Vision-Instruct, InternVL-3.5-38B, Gemini-2.5-Flash-Lite, and Gemini-2.5-Flash. Starting from the manually verified pool of 68k questions, we sequentially filter the dataset. Each model is evaluated in a zero-shot setting, and questions that are answered correctly by any of the models are removed from the candidate pool. These adversarial filters also minimize contamination by removing questions likely memorized from the training data. Although this approach is post hoc, it is presently unavoidable, given the lack of reliable methods for identifying training-set inclusion, especially amid declining transparency around training data. Finally, the KMMMU benchmark consists of 3,466 questions. Figure 2 shows representative KMMMU instances from multiple disciplines, illustrating the diversity of visual modalities, question formats, and Korean-specific content covered by the benchmark. KMMMU is named in reference to MMMU, reflecting its intended role as a Korean counterpart for expert-level multimodal evaluation in linguistically and culturally grounded settings.

![Figure 3: Discipline-wise visual modality composition of KMMMU. Stacked bars show the number of questions for each visual modality in each discipline, with total counts shown beneath the labels. Scatter points indicate Korean-specific items overlaid on the corresponding discipline–modality segments, and jittered randomly.](image3)

### 3.3 Taxonomy and Dataset Composition

KMMMU is designed to evaluate expert-level multimodal understanding across diverse domains. Each instance is annotated along four axes: discipline, visual modality, question format, along with a Korean-specific flag. The Korean-specific flag identifies cases where the problem requires Korean–specific institutional or cultural knowledge beyond general world knowledge. All taxonomy labels are assigned using Gemini-2.5-Flash. To assess label quality, we manually audit 300 randomly sampled instances and verify all Korea-specific items. Figure 3 presents the discipline-wise distribution of visual modalities in KMMMU by absolute count. The stacked bars show the number of questions for each visual modality within each discipline, with the numbers beneath each label indicating the total number of instances. The overlaid scatter points denote Korean-specific items (randomly jittered) within their corresponding visual modality segments. They are particularly concentrated in institutionally grounded domains such as Business & Public (76) and Law & Ethics (82). Across disciplines, Engineering (Egnr) accounts for the largest share of the dataset, and diagrams are the most common visual modality. Text/Code & Documents also appears frequently, especially in Business, Law, and Social Science domains.

### 3.4 Construction of the Hard Subset

To further analyze model limitations, we construct a **Hard subset** consisting of challenging instances. Specifically, this subset includes questions that are answered incorrectly by all three baseline models: Gemma-3-27B, Qwen3-VL-235B-Thinking, and GPT-5-nano. The Hard subset contains 627 questions, corresponding to 18% of the full KMMMU dataset (see Figure 11 for details).

### 3.5 Does Adversarial Filtering Distort the Original Data Distribution?

To assess whether adversarial filtering affects benchmark representativeness, we compare the distributional alignment of the original dataset and filtered subsets. For this analysis, each item is represented using a text embedding obtained from multilingual-e5-large. The resulting embeddings are projected into a lower-dimensional manifold using PCA (n=50), followed by 3D UMAP. As shown in Figure 4, both the **Full KMMMU set** and the **Hard subset** largely preserve the broad geometric structure of the **original 68k-sample** distribution. To quantify these differences, we compute the Kullback–Leibler (KL) divergence

Similar Articles

UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

arXiv cs.CL

UrduMMLU is a new benchmark of 26,431 multiple-choice questions across 26 subjects for evaluating LLMs on Urdu language understanding, sourced from native educational materials. Evaluation of 30 LLMs reveals Gemini-3.5-Flash performs best, while open-source models and region-specific subjects pose significant challenges.

What We are Missing in Multimodal LLM Evaluation?

arXiv cs.AI

This paper reviews current multimodal LLM evaluation benchmarks and identifies key gaps such as temporal-spatial coherence, physical world understanding, multimodal consistency, and selective attention, arguing that existing isolated-task benchmarks fail to measure true cross-modal integration.