MUSCAT: MUltilingual, SCientific ConversATion Benchmark

arXiv cs.CL Papers

Summary

MUSCAT is a new multilingual, scientific conversation benchmark dataset for evaluating ASR systems on challenging multilingual scenarios including code-switching, domain-specific vocabulary, and mixed language input. The dataset consists of bilingual discussions on scientific papers between speakers using different languages, with results showing current state-of-the-art systems struggle with these multilingual challenges.

arXiv:2604.15929v1 Announce Type: new Abstract: The goal of multilingual speech technology is to facilitate seamless communication between individuals speaking different languages, creating the experience as though everyone were a multilingual speaker. To create this experience, speech technology needs to address several challenges: handling mixed multilingual input, domain-specific vocabulary, and code-switching. However, there is currently no dataset benchmarking this situation. We propose a new benchmark to evaluate current Automatic Speech Recognition (ASR) systems and their ability to handle these challenges. The benchmark consists of bilingual discussions on scientific papers between multiple speakers, each conversing in a different language. We provide a standard evaluation framework beyond Word Error Rate (WER) enabling consistent comparison of ASR performance across languages. Experimental results demonstrate that the proposed dataset remains an open challenge for state-of-the-art ASR systems. The dataset is available at https://huggingface.co/datasets/goodpiku/muscat-eval Keywords: multilingual, speech recognition, audio segmentation, speaker diarization
Original Article
View Cached Full Text

Cached at: 04/20/26, 08:29 AM

# MUSCAT: MUltilingual, SCientific ConversATion Benchmark

Source: https://arxiv.org/html/2604.15929

###### Abstract

The goal of multilingual speech technology is to facilitate seamless communication between individuals speaking different languages, creating the experience as though everyone were a multilingual speaker. To create this experience, speech technology needs to address several challenges: handling mixed multilingual input, specific vocabulary, and code-switching. However, there is currently no dataset benchmarking this situation. We propose a new benchmark to evaluate current Automatic Speech Recognition (ASR) systems and their ability to handle these challenges. The benchmark consists of bilingual discussions on scientific papers between multiple speakers, each conversing in a different language. We provide a standard evaluation framework beyond Word Error Rate (WER) enabling consistent comparison of ASR performance across languages. Experimental results demonstrate that the proposed dataset is still an open challenge for state-of-the-art ASR systems. The dataset is available at https://huggingface.co/datasets/goodpiku/muscat-eval

Keywords: multilingual, speech recognition, audio segmentation, speaker diarization

MUSCAT: MUltilingual, SCientific ConversATion Benchmark

Supriti Sinhamahapatra¹, Thai-Binh Nguyen¹, Yiğit Oğuz¹, Enes Ugan¹, Jan Niehues¹, Alexander Waibel¹,²

¹Karlsruhe Institute of Technology, ²Carnegie Mellon University

{supriti.sinhamahapatra, thai-binh.nguyen, enes.ugan, jan.niehues}@kit.edu
[email protected], [email protected]

## 1. Introduction

Seamless communication across language boundaries is a long-term dream of mankind. The ultimate goal is to have a natural, multilingual conversation where each participant talks in their favorite language and understands all the other languages. While significant progress has been made in multilingual speech recognition in high resource settings (Barrault et al. 2023; Liu et al. 2023), as well as low resource settings (Robinson et al. 2025; Li et al. 2025), to the best of our knowledge, currently there is no realistic benchmark to evaluate systems in multilingual dialogue scenarios.

To address the growing need for realistic and high-quality multilingual datasets, we present a unique collection of audio recordings designed as a benchmark for automatic speech recognition. To build strong systems for this benchmark, several challenges need to be addressed: multilingual speech input, speaker segmentation, audio conditions, domain-specific vocabulary, and code-switching.

To collect the data, we set up a discussion between two bilingual speakers about scientific publications. Each speaker speaks only one language but understands both languages. While English is the dominant language for scientific communication globally, we aim for AI-based solutions that can facilitate scenarios where researchers can communicate scientific content in their native language without compromise. To study this, we created a controlled simulation using bilingual speakers.

Our evaluation shows limitations in current AI systems. The technology is not yet robust enough to support multilingual communication in scientific domains. In Figure 1, the upper part illustrates an example scenario of the MUSCAT dataset creation process, in which spontaneous conversations are recorded between two speakers, one speaking English and the other German. The lower part highlights a key challenge faced by state-of-the-art (SOTA) ASR models: their difficulty in accurately detecting language switches in spontaneous multilingual speech. In such cases, the model often either translates the utterance into the language of the preceding context or fails to transcribe certain segments altogether. This leads to an oracle setup for speech translation technology.

**Figure 1:** An example illustrating the creation of MUSCAT (upper part of the figure) and the challenges its multilingual diversity poses for state-of-the-art ASR systems (lower part of the figure). The ASR is unable to accurately detect the language switches in a spontaneous conversation denoted by red in the transcript. The blue dashed lines (----) represent the part of the conversation that ASR fails to transcribe.

The main contributions of this paper are:

- A detailed analysis of the challenges posed by the proposed benchmark
- State-of-the-art baseline results that highlight the difficulties of the task

## 2. Data Collection

We aim to build a high-quality multilingual dataset. To achieve this, we first create a conversation setup where the challenges of multilingual, scientific conversations are highlighted. Next, we design a recording setup that allows us to investigate the different challenges individually.

### 2.1. Conversation Setup

Each instance in the dataset contains a conversation between a pair of speakers discussing a scientific paper. The speakers possess prior knowledge of the paper being discussed. We create the oracle situation for speech translation in this scenario by having the speakers converse in two different languages. To carry out the conversation, the speakers need to be fluent in both languages. In our case, the speakers engaging in natural conversations are each fluent in English at C1 level and are native speakers of the other language.

This unique setup allows for meaningful exchanges where speakers fully comprehend one another but respond solely in one of the two languages. These conversations offer paired speech² and transcripts for language pairs such as English-German, English-Vietnamese, English-Chinese, and English-Turkish.

² All participants gave their consent for their voices to be recorded and used for research purposes

### 2.2. Recording Setup

The challenges associated with ASR vary depending on the recording environment. To evaluate different conditions jointly, we synchronously record the conversations with three different devices. The first device is Meeting Owl 3, a popular video conferencing system that captures 360° video and audio. The second device, the ReSpeaker USB Microphone, is a compact array microphone designed for high-fidelity, multi-directional audio capture. The third, Aria smart glasses by Meta, function as a wearable device that records first-person audio along with audio from the speaker's environment.

The Meeting Owl 3, referred to as simply OWL in the rest of the paper, was connected to a laptop via USB, and recordings were made using OBS Studio 3. The ReSpeaker USB Microphone Array was paired with a Raspberry Pi 3, also connected via USB, to provide an additional audio source. For brevity, we refer to this setup as Pi for the remainder of the paper. Finally, the Aria glasses by Meta were worn by one of the speakers and used to record the entire conversation from their perspective. Since Aria can be worn by only one person during recording, we randomly selected one speaker. This results in three German, one Chinese, one Vietnamese, and one English speaker wearing Aria. The other audio recording devices were kept approximately at the middle and equidistant from the speakers.

The recordings are made at a sampling rate of 44.1 kHz using multiple microphones as mentioned above. After data collection, the recordings from all devices were manually aligned using Audacity to ensure they were perfectly synchronized. This combination of devices and meticulous alignment ensures that the dataset captures a wide range of audio perspectives, adding depth and variety. All software used were the latest versions available at the time of recording. To ensure minimum possible interference from external sound, an appropriately secluded room was used for the recordings.

## 3. Human Annotation

We annotate the collected data to be used as a benchmark for state-of-the-art ASR systems. In a first step, we perform manual segmentation of the audio recordings which serves as the oracle to evaluate and compare two automatic segmentation approaches. Next, we create the multilingual transcripts of the audio.

### 3.1. Manual Segmentation

We perform the manual segmentation guided by two constraints. First, since each of our recordings consists of conversations in two different languages, we prioritize language-specific segmentation. This ensures that each segment comprises recordings in a single language. Second, to support optimal model performance, we limit each segment to a maximum duration of 30 seconds. The manual segmentation process is conducted using Label Studio (2023), an open-source data annotation tool.

### 3.2. Human Post-editing

Starting from the derived manual segments, we follow a two-step process to obtain the human transcript of our dataset. Firstly, to ease manual transcription of the recordings, we use a state-of-the-art ASR model, Whisper (Radford et al. 2023), for automatic transcription of the language-specific segments. As a second step, the respective speaker manually corrects any mistakes made by the ASR model, ensuring high-quality transcription. This strategy was adopted to overcome the challenge of finding external annotators who possessed both fluency in the specific languages and familiarity with the complex scientific discourse discussed in the papers. Using the speakers for such tasks ensured that technical terms and domain-specific context were annotated accurately.

During the annotation process, we frequently observe instances of code-switching within the recorded speech. Since mixed-language utterances are known to pose particular challenges for current speech recognition systems (Klejch et al. 2021; Hamed et al. 2022; Ugan et al. 2025b), annotators were additionally instructed to mark all words belonging to the embedded language whenever code-switching occurred.

## 4. MUSCAT Dataset

The MUSCAT dataset consists of multilingual conversations from six recordings across eleven speakers. Each recording is between a pair of speakers, and there exists one speaker who is present in two recordings. All six recordings have at least one English speaker, while the other speaks one of the languages from German, Turkish, Chinese, and Vietnamese. Of the six recordings, half of the conversations are between pairs of English and German speakers, while the other half is between English and the remaining languages. We maintain gender diversity among the speakers in the dataset. Among the eleven speakers of the MUSCAT dataset, six speakers are male and the remaining five speakers are female.

To evaluate different challenges related to this benchmark, we provide 6 different variations of the dataset. First, three different recordings with different devices per speaker. Secondly, for each conversation, both segmented and unsegmented versions of the audio recording are available.

Table 1 summarizes the main aspects of the dataset. The total duration of our dataset is approximately 65 minutes, of which the English-Vietnamese conversation comprises 17 minutes, whereas English-Chinese, English-Turkish, and English-German conversations comprise 15, 12, and 21 minutes, respectively. Words spoken by each speaker are attributed to the word count of their respective language, which totals 9,066 words.

**Table 1: Dataset Statistics**

| Recording | Languages | Total Duration | Word Counts |
|-----------|-----------|-----------------|------------|
| Recording 1 | English | 4.69 mins | 463 |
| | German | 1.92 mins | 288 |
| Recording 2 | English | 1.39 mins | 162 |
| | German | 2.74 mins | 427 |
| Recording 3 | English | 7.51 mins | 1,344 |
| | Turkish | 3.94 mins | 447 |
| Recording 4 | English | 11.90 mins | 1,362 |
| | Chinese | 2.79 mins | 623 |
| Recording 5 | English | 7.47 mins | 972 |
| | German | 3.00 mins | 426 |
| Recording 6 | English | 10.04 mins | 1,489 |
| | Vietnamese | 6.83 mins | 1,063 |
| **Total** | | **64.22 mins** | **9,066** |

## 5. Baseline

This section outlines the baseline configuration adopted in our experiments, detailing the ASR models used and the segmentation strategies applied during pre-processing.

### 5.1. ASR Models

Our goal is to evaluate the performance of SOTA ASR models on the MUSCAT dataset. To this end, we employ four SOTA models—Whisper, SALMONN, Phi-4 Multimodal, and Wav2Vec2—and assess the quality of their generated transcriptions. These models represent diverse ASR paradigms, including encoder-decoder architectures (Whisper), multimodal large language models (SALMONN and Phi-4 Multimodal), and CTC-based systems (Wav2Vec2).

#### Whisper

Whisper is a transformer-based encoder-decoder model developed by OpenAI, primarily designed for automatic speech recognition (ASR) and speech translation tasks (Radford et al. 2023). It has been trained on approximately 680k hours of speech data collected from the internet. The model's encoder processes the input speech to generate audio features, which are then passed to the decoder. The decoder, using these audio features along with positional encodings, produces the corresponding transcription. Whisper also incorporates a set of context tokens that guide the model by specifying the language, the task to be performed, and the start and end points of the transcription.

#### SALMONN

The SALMONN model, developed by Tsinghua University and ByteDance (Tang et al. 2023), extends the capabilities of Large Language Models (LLMs), such as Vicuna (Chiang et al. 2023), to directly perceive and interpret general audio inputs. This enhancement allows LLMs to perform competitively across a range of speech and audio processing tasks. SALMONN integrates information from two specialized encoders: Whisper (Radford et al. 2023) for speech and BEATs (Chen et al. 2022) for general audio, using a window-level Q-Former module (Zhang et al. 2024). The resulting augmented audio tokens are aligned with the LLM's internal representations, enabling seamless multi-modal understanding.

#### Phi-4-Multimodal

Phi-4-multimodal (referred to as Phi) is a 5.6B-parameter instruction-tuned multimodal transformer developed by Microsoft. It is designed for unified processing of text, image, and audio inputs, enabling it to handle tasks across vision-language, vision-speech, and speech-language domains. The model supports a context length of up to 128K tokens and utilizes 32 transformer layers equipped with Grouped Query Attention (GQA) (Ainslie et al. 2023) for efficient long-context processing. Vision and audio modalities are mapped into the text embedding space via two-layer multilayer perceptrons (MLPs). Phi demonstrates strong performance on a wide range of multilingual and multimodal benchmarks.

#### wav2vec2

wav2vec2 (Baevski et al. 2020) is a self-supervised learning framework designed to learn speech representations directly from raw audio. The model includes a convolutional feature encoder that converts audio into latent representations, followed by a transformer network that captures context over time.

We use the wav2vec2-large-960h-lv60-self model...

Similar Articles

Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

arXiv cs.CL

This paper presents a benchmark evaluating five commercial ASR systems on code-switching speech across Arabic-English, Persian-English, and German-English pairs, using a two-stage pipeline to select 300 samples per pair and assessing performance with WER and BERTScore. ElevenLabs Scribe v2 achieves the lowest overall WER (13.2%) and highest BERTScore (0.936), with public dataset available.

IMCBench: A benchmark for multimodal LLMs in Image-grounded Medical Conversations

arXiv cs.AI

IMCBench is a new benchmark for evaluating multimodal LLMs on image-grounded medical conversations, pairing clinical images with synthetic patient profiles. Evaluations across safety, accuracy, and uncertainty show that even strong models like Claude Opus 4.6 have safety issues, highlighting the need for multi-dimensional evaluation.