KoALa-Bench: Evaluating Large Audio Language Models on Korean Speech Understanding and Faithfulness
Summary
KoALa-Bench introduces a Korean-focused benchmark suite for evaluating large audio language models on six tasks, including novel measures of speech faithfulness and Korea-specific cultural content.
View Cached Full Text
Cached at: 04/23/26, 10:02 AM
# KoALa-Bench: Evaluating Large Audio Language Models on Korean Speech Understanding and Faithfulness
Source: [https://arxiv.org/html/2604.19782](https://arxiv.org/html/2604.19782)
Jinyoung Kim1∗, Hyeongsoo Lim1∗, Eunseo Seo1∗, Minho Jang1∗, Keunwoo Choi2, Seungyoun Shin2, Ji Won Yoon1† 1Department of Artificial Intelligence, Chung\-Ang University,2Upstage AI \{wlsdud338, andrew1001, jeo0534, sunbi8534, jiwonyoon\}@cau\.ac\.kr \{keunwoo, logan\}@upstage\.ai
###### Abstract
Recent advances in large audio language models \(LALMs\) have enabled multilingual speech understanding\. However, benchmarks for evaluating LALMs remain scarce for non\-English languages, with Korean being one such underexplored case\. In this paper, we introduceKoALa\-Bench, a comprehensive benchmark for evaluating Korean speech understanding and speech faithfulness of LALMs\. In particular, KoALa\-Bench comprises six tasks\. Four tasks evaluate fundamental speech understanding capabilities, including automatic speech recognition, speech translation, speech question answering, and speech instruction following, while the remaining two tasks evaluate speech faithfulness, motivated by our observation that several LALMs often fail to fully leverage the speech modality\. Furthermore, to reflect Korea\-specific knowledge, our benchmark incorporates listening questions from the Korean college scholastic ability test as well as content covering Korean cultural domains\. We conduct extensive experiments across six models, including both white\-box and black\-box ones\. Our benchmark, evaluation code, and leaderboard are publicly available at[https://github\.com/scai\-research/KoALa\-Bench\.git](https://github.com/scai-research/KoALa-Bench.git)\.
KoALa\-Bench: Evaluating Large Audio Language Models on Korean Speech Understanding and Faithfulness
Jinyoung Kim1∗, Hyeongsoo Lim1∗, Eunseo Seo1∗, Minho Jang1∗,Keunwoo Choi2, Seungyoun Shin2, Ji Won Yoon1†1Department of Artificial Intelligence, Chung\-Ang University,2Upstage AI\{wlsdud338, andrew1001, jeo0534, sunbi8534, jiwonyoon\}@cau\.ac\.kr\{keunwoo, logan\}@upstage\.ai
††footnotetext:\* Equal contribution††footnotetext:†\\daggerCorresponding author## 1Introduction
Recent advances in multimodal large language models \(MLLMs\)\(qwen2\-audio;tang2024salmonn;audiopalm;team2023gemini\)have enabled models to process diverse modalities, including images, videos, and audio\. Among these, speech is particularly important as it represents one of the most natural forms of human interaction\. Accordingly, large audio language models \(LALMs\)\(navercloudhyperclovaxteam2026hyperclovax8bomni;xu2025qwen3omnitechnicalreport;fang\-etal\-2025\-llamaomni2;qwen2\-audio\), which integrate speech encoders with pretrained language models, have emerged as a promising paradigm for speech understanding and spoken interaction, increasingly supporting multilingual speech inputs\.
As LALMs expand their multilingual capabilities, comprehensive evaluation frameworks for speech understanding across diverse languages become increasingly important\. However, existing benchmarks remain predominantly English\-centric, leaving languages such as Korean largely underexplored\(wang\-etal\-2025\-audiobench;adu\-bench;air\-bench\)\. Meanwhile, existing Korean speech benchmarks are primarily designed for traditional speech processing tasks rather than evaluating the speech understanding capabilities of LALMs\(ksponspeech;clovacall;zeroth\_korean\)\. As a result, there is currently a lack of standardized benchmarks specifically designed to evaluate LALMs using Korean speech inputs\. To address this limitation, we introduceKoreanAudioLanguage benchmark, calledKoALa\-Bench, a comprehensive benchmark for evaluating the Korean speech understanding capabilities of LALMs\.
KoALa\-Bench consists of six tasks, four for standard speech understanding and two newly proposed for evaluating speech faithfulness\. The four standard tasks cover automatic speech recognition \(ASR\), speech translation \(ST\), speech question answering \(SQA\), and speech instruction following \(SIF\)\. For these tasks, we leverage existing Korean datasets as well as English datasets\. We align the English datasets with Korean using translation and TTS synthesis, filtering out low\-quality samples based on CER or human verification to ensure sample quality\. In addition, we incorporate listening questions from the korean college scholastic ability test \(KCSAT\) as authentic long\-form Korean speech samples, which are included in the SQA task\.
The two proposed speech faithfulness tasks, speech\-aware context question answering \(SCA\-QA\) and position\-aware question answering \(PA\-QA\), are motivated by our observation that LALMs often fail to fully leverage speech input when generating responses\. Specifically, we evaluate speech faithfulness from two perspectives\. First, SCA\-QA examines modality faithfulness by examining whether LALMs fully leverage the speech input or rely solely on the text input\. By comparing model responses to a text question with and without a counterfactual speech context, we assess how faithfully LALMs utilize the given speech input\. To further reflect Korea\-specific knowledge, SCA\-QA incorporates crawled Korean cultural domain data covering history, sports, and K\-pop\. Second, PA\-QA assesses positional faithfulness with respect to the position of evidence within long\-form speech inputs\. To construct the dataset, we identify the location of each answer span in the speech, and measure accuracy at each position\. This enables a fine\-grained analysis of how faithfully LALMs comprehend information across extended speech inputs\.
We conduct extensive experiments across both white\-box and black\-box models, including Qwen3\-Omni\(xu2025qwen3omnitechnicalreport\), Gemma\-3n\(gemma\_3n\_2025\), GPT\-audio\(hurst2024gpt\-4o\), and Gemini\-flash\(comanici2025gemini\-2\.5\-flash\), with varying parameter scales, providing a comprehensive analysis of current LALMs on Korean speech understanding\.
- •We introduce KoALa\-Bench, a comprehensive benchmark for evaluating the Korean speech understanding capabilities of LALMs\.To the best of our knowledge, this is the first universal benchmark dedicated to evaluating Korean speech understanding and faithfulness of LALMs\.
- •We propose two novel tasks, SCA\-QA and PA\-QA, to evaluate the speech faithfulness of LALMs in terms of modality and position, examining whether models fully leverage the given speech input\.
- •To evaluate Korea\-specific knowledge, we construct evaluation datasets from KCSAT and crawled cultural domain content covering K\-history, K\-sports, and K\-pop\.
\(a\)High\-level overview of standard speech understanding tasks in KoALa\-Bench\.
\(b\)Construction process of the SCA\-QA dataset\.
\(c\)The evidence\-annotated dataset generation pipeline for PA\-QA\.
Figure 1:Overview of the KoALa\-Bench pipeline\. \(a\) The fundamental pipeline\. Synthesizes audio for text\-only datasets, translating English text into Korean via Gemma beforehand\. \(b\) The SCA\-QA construction process\. Generates paired questions \(factually aligned vs\. intentionally conflicting\) to test whether models rely on acoustic context rather than parametric knowledge\. \(c\) The PA\-QA dataset generation pipeline\. Maps supporting evidence to one of four temporal segments to enable position\-aware performance analysis\.
## 2Related Works
##### Audio Benchmarks\.
Recent advances in LALMs have introduced architectures that integrate speech encoders with pretrained LLMs, enabling LLMs to process speech inputs\(navercloudhyperclovaxteam2026hyperclovax8bomni;fang\-etal\-2025\-llamaomni2;xu2025qwen3omnitechnicalreport;qwen2\-audio\)\. As LALMs expand their audio understanding capabilities, various benchmarks have been proposed to evaluate their performance on audio\-conditioned tasks\. AIR\-Bench\(air\-bench\)evaluates audio understanding and instruction following across diverse audio inputs, while AudioBench\(wang\-etal\-2025\-audiobench\)provides a comprehensive evaluation suite covering multiple audio tasks\. More recent benchmarks explore specialized abilities, including audio dialogue understanding and expert\-level reasoning\(adu\-bench;MMAU\)\.
Despite these efforts, existing benchmarks remain largely English\-centric, limiting their ability to evaluate the speech understanding capabilities of multilingual LALMs, particularly for languages such as Korean\. To address this gap, we propose KoALa\-Bench, a unified benchmark designed to evaluate Korean speech understanding and faithfulness of LALMs\.
##### Faithfulness of Multimodal Language Models\.
Recent multimodal language models \(MLLMs\)\(longcatflashomnitechnicalreport;liu2025voxtral;tang2024salmonn;audiopalm\)have demonstrated strong capabilities in integrating information from multiple modalities such as text, images, and audio\. However, prior studies report that these models often generate responses that are inconsistent with the provided inputs, relying instead on parametric knowledge stored in the language modelbai2024hallucination;faith1\_1;faith1\_2\. This phenomenon, commonly referred to as multimodal hallucination, reflects a lack of faithfulness to the given modality\. Faithfulness also becomes challenging when models process long speech inputs, as prior studies report that multimodal models often struggle with long\-context audio understanding\(faith2\_1;faith2\_2\)\. Despite these observations, the evaluation of faithfulness to speech inputs remains largely underexplored\. To facilitate a systematic analysis of this capability, we introduce two complementary evaluation tasks, SCA\-QA and PA\-QA, designed to assess how faithfully LALMs utilize the given speech input\.
## 3KoALa\-Bench
##### Overview of KoALa\-Bench\.
We introduce KoALa\-Bench, a benchmark designed to assess the speech understanding and speech faithfulness of Korean LALMs\. Existing evaluation frameworks primarily focus on generic speech understanding and speech\-based QA, with limited consideration of Korean cultural and linguistic characteristics\. In addition, prior benchmarks lack evaluation protocols for long duration speech robustness and speech context faithfulness, where speech context faithfulness evaluates whether model predictions are grounded in the acoustic context of the input speech or derived from parametric knowledge\. To address these limitations, KoALa\-Bench evaluates Korean LALMs across six task dimensions: ASR, ST \(English\-to\-Korean\), SQA, SIF, SCA\-QA, which measures whether models utilize both speech and text information and PA\-QA for assessing long\-duration speech robustness\. Furthermore, to evaluate model robustness under noisy conditions, we additionally construct noise augmented versions of several datasets by injecting noise into the speech inputs\.
### 3\.1Automatic Speech Recognition \(ASR\)
##### Methods\.
The ASR task evaluates a model’s ability to convert spoken utterances into textual transcriptions, serving as a fundamental measure of speech recognition performance\. For evaluation, we provide speech inputs to the LALMs and prompt it to generate corresponding Korean transcriptions, which are then compared with ground\-truth references\.
##### Datasets\.
KsponSpeech\(ksponspeech\)provides spontaneous Korean conversational speech covering diverse everyday topics\. Common Voice Korean\(commonvoice\)and Zeroth\-Korean111https://openslr\.org/40/contain speech recorded under diverse acoustic conditions as well as relatively clean read speech, enabling evaluation across different speaking styles and recording environments\.
##### Metrics\.
ASR performance is evaluated using character error rate \(CER\), which measures the edit distance between predicted transcriptions and ground\-truth references at the character level\. Before computing CER, we apply a normalization pipeline that removes punctuation and collapses whitespace\.
### 3\.2Speech Translation \(ST\)
##### Methods\.
Korean LALMs should process both English and Korean\. Accordingly, KoALa\-Bench includes a cross\-lingual speech translation task that translates English speech into Korean text\. The task evaluates whether the generated output preserves the semantic content of the source speech while producing fluent Korean text\. Evaluation is conducted using automatic similarity measures that assess both lexical overlap and semantic consistency between the generated output and the reference translation\.
##### Datasets\.
We use the ETRI English–Korean Speech Translation Corpus222https://epretx\.etri\.re\.kr, built from TED Talks following the MuST\-C333https://ict\.fbk\.eu/must\-c/data construction methodology\. It provides English speech segments paired with Korean translations for evaluation\.
##### Metrics\.
We evaluate translation quality using BLEU\(bleu\), METEOR\(meteor\), and BERTScore\(Zhang\*2020BERTScore:\)\. BLEU and METEOR measure n\-gram overlap between the generated translation and the reference, while BERTScore evaluates semantic similarity using contextual embeddings\.
### 3\.3Speech Question Answering \(SQA\) & Speech Instruction Following \(SIF\)
##### Methods\.
The speech understanding task evaluates the model’s ability to answer questions based on acoustic input\. In this setting, speech segments are provided as context, which may consist of either natural recordings or speech synthesized using the Qwen3\-TTS model\(hu2026qwen3ttstechnicalreport\)\. The model is required to generate responses that correctly follow the textual instruction based on the given speech content\. This task evaluates the model’s ability to execute text\-based instructions conditioned on the provided acoustic context\.
#### 3\.3\.1SQA
##### Short\-Form Datasets\.
For the SQA task, we evaluate models on two Korean QA benchmarks: CLIcK and KoBEST\-BoolQ\. CLIcK is a Korean\-native multiple\-choice QA benchmark sourced from official exams and textbooks, covering 11 subcategories across language and culture domains\. We use the provided evaluation split\. KoBEST\-BoolQ is the Boolean QA subset of KoBEST, a Korean NLU benchmark built by professional linguists\. Each instance consists of a Korean Wikipedia passage and a yes/no question\. We use the test subset with binary answers\.
##### Long\-Form Datasets\.
We construct LSQA from the KCSAT listening section collected from EBSi444https://www\.ebsi\.co\.kr/ebs/pot/poti/main\.ebsto evaluate long\-form speech understanding\. LSQA consists of speech passages paired with multiple\-choice questions\. As the original recordings contain multiple questions in a single long audio file, we segment the audio into question\-level samples to construct individual instances\.
##### Metrics\.
For both benchmarks, we evaluate model performance under two scoring approaches\. In the logit\-based approach, the log\-probability of each candidate token is computed, and the choice with the highest log\-probability is selected as the model’s prediction\. In the generation\-based approach, the predicted choice is extracted from the generated response via string matching\. Additionally, accuracy is used as the final evaluation metric for both approaches\.
#### 3\.3\.2SIF
##### Datasets\.
Instructions are provided as audio and the model generates text responses\. To construct Korean instruction\-following data, we translate existing English datasets \(ALPACAalpaca, Vicunavicuna2023, and OpenHermes555https://huggingface\.co/datasets/teknium/openhermes\) and synthesize them into audio\. We additionally synthesize instructions from KUDGEkudge\.
##### Metrics\.
Instruction\-following performance is evaluated using GPT\-4o as a judge model, which assesses whether the generated response correctly follows the instruction\.
### 3\.4Speech Context\-Aware Question Answering \(SCA\-QA\)
##### Methods\.
Motivated by our observation, as illustrated in Appendix \(TableLABEL:tab:case\_study\_speech\_context\), that models often ignore speech context, we introduce SCA\-QA\. The task constructs paired questions inspired by the methodology of DisentQA\(disentqa\)to assess whether responses rely on acoustic context or parametric knowledge\. To construct the dataset, we crawled documents using keywords related to K\-pop, K\-history, and K\-sports and generated question–answer pairs using the GPT API\. Key answer entities were then replaced to create intentionally incorrect answers conflicting with world knowledge, yielding paired questions for the same speech context\. All samples were verified by human annotators\. SCA\-QA consists of paired questions per speech context, one aligned with world knowledge and the other intentionally conflicting, enabling analysis of whether responses rely on acoustic context or parametric knowledge\. The full construction pipeline and prompts used at each stage are provided in AppendixLABEL:sec:sca\_qa\_prompts\.
##### Datasets\.
SCA\-QA is constructed from speech contexts derived from documents crawled using keywords related to K\-pop, K\-history, and K\-sports\. Each sample contains paired questions generated via the GPT API and verified by human annotators, where one aligns with common world knowledge and the other intentionally contradicts it through entity replacement\.
##### Metrics\.
We evaluate model responses using Speech Context Faithfulness \(SCF\), which measures whether predictions rely on the provided speech context rather than parametric knowledge\. For each sample, we first identify cases where the model answers the textual question correctly, and then evaluate the same question with the corresponding speech context\. If the model changes its answer according to the speech content when the contextual information conflicts, the response is considered context\-faithful\. The final score is computed as the proportion of such context\-faithful samples\. We provide the formal definition in AppendixLABEL:app:scf\.
DatasetMetricsNoise\# SamplesHours\(h\)Length\(s\)AvgMinMax\\cellcolorsoftgrayASRZerothCER↓\\downarrow✓4571\.1919\.3814\.55820\.453Common Voice✓5230\.7264\.9981\.72812\.564KsponSpeech\-Clean✗30002\.6443\.1730\.28619\.357KsponSpeech\-Other✗30003\.8034\.5630\.39920\.093\\cellcolorsoftgraySTETRI\-TST\-CommonBLEU / METEOR/ BERTScore↑\\uparrow✗25324\.877650\.251210\.2331417\.741ETRI\-TST\-HE✗5442\.282746\.788551\.3581117\.391\\cellcolorsoftgraySQAShort\-FormCLIcKAccuracy↑\\uparrow✓13154\.18011\.4441\.600275\.402KoBest BoolQ✓140411\.95030\.6424\.64088\.308Long\-FormKCSAT✓822\.919128\.18262\.976208\.020\\cellcolorsoftgraySIFKUDGEScore\(GPT as Judge\)↑\\uparrow✓5571\.86310\.3653\.44023\.497Vicuna✓700\.1928\.6232\.64016\.297OpenHermes✓780\.1746\.4501\.68018\.617Alpaca✓690\.1114\.3751\.7609\.017\\cellcolorsoftgraySCA\-QAK\-history \(Before Chosun\)SCF Score↑\\uparrow✓1010\.81529\.05311\.28061\.280K\-history \(After Chosun\)✓820\.99743\.7803\.520100\.000K\-sports✓881\.69869\.48820\.40092\.640K\-pop✓1031\.45750\.9268\.00082\.400\\cellcolorsoftgrayPA\-QAMCTestAccuracy↑\\uparrow✓3258\.73296\.14238\.400205\.120
Table 1:Datasets used in the proposed KoALa\-Bench\.
### 3\.5Postion Aware Question Answering \(PA\-QA\)
##### Methods\.
Existing speech benchmarks largely rely on short utterances and lack supporting evidence annotations for each QA pair, making it difficult to analyze which part of the speech context a question depends on\. To address this limitation, KoALa\-Bench introduces a position\-aware evaluation setup by identifying supporting evidence sentences for each question\. We construct an evidence annotation pipeline using GPT\-4\.1\-nano to detect supporting sentences and map them to their relative positions within the speech context\. Each context is normalized to the range \[0,1\] and divided into four segments: front \(0–0\.25\), front\-middle \(0\.25–0\.5\), middle\-late \(0\.5–0\.75\), and late \(0\.75–1\.0\)\. This dataset enables position\-aware analysis of model performance across long speech inputs\. The full evidence grounding procedure is described in AppendixLABEL:sec:appendix\_grounding\.
##### Datasets\.
We construct the dataset from MCTest\(richardson\-etal\-2013\-mctest\), a machine comprehension dataset of short stories and multiple\-choice questions\. Speech contexts are synthesized from textual passages using Qwen3\-TTS\(hu2026qwen3ttstechnicalreport\), and supporting evidence sentences are identified with GPT\-4\.1\-nano\. After filtering invalid samples, the final dataset contains 327 instances\.
##### Metrics\.
We follow the same evaluation protocol as in SQA and report accuracy\. Each question is associated with a normalized position in the speech context, enabling position\-aware analysis\.
## 4ExperimentsSimilar Articles
A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook
A comprehensive survey reviewing the trustworthiness challenges of Large Audio Language Models (LALMs), including vulnerabilities like cross-modal jailbreaking and acoustic backdoors, and proposing a defense-in-depth roadmap.
The GaoYao Benchmark: A Comprehensive Framework for Evaluating Multilingual and Multicultural Abilities of Large Language Models
GaoYao introduces a 182k-sample benchmark across 26 languages and 51 regions to systematically evaluate LLMs’ multilingual and multicultural capabilities, revealing large geographical performance gaps.
KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context
KMMMU is a native Korean benchmark for evaluating multimodal understanding with 3,466 questions across nine disciplines and visual modality categories, addressing the gap of English-centric benchmarks by testing performance on Korean-specific cultural and institutional contexts.
SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing
SpeechEditBench is a bilingual multi-attribute benchmark for evaluating instruction-guided speech editing across seven atomic tasks and compositional tasks, using an anchor-based evaluation protocol with three metrics. Evaluation of mainstream Speech LLMs reveals no single model excels across all dimensions, and compositional editing remains highly challenging.
Korean Culture into LLM Alignment: Toward Cultural Coherence
This paper proposes a dataset generation pipeline to align large language models with Korean cultural norms using DPO fine-tuning, improving cultural safety without degrading general performance.