Code-Switching Information Retrieval: Benchmarks, Analysis, and the Limits of Current Retrievers

Hugging Face Daily Papers Papers

Summary

Researchers introduce CSR-L and CS-MTEB benchmarks showing that code-switching queries degrade IR system performance by up to 27%, revealing embedding-space divergence that current multilingual techniques cannot fix.

Code-switching is a pervasive linguistic phenomenon in global communication, yet modern information retrieval systems remain predominantly designed for, and evaluated within, monolingual contexts. To bridge this critical disconnect, we present a holistic study dedicated to code-switching IR. We introduce CSR-L (Code-Switching Retrieval benchmark-Lite), constructing a dataset via human annotation to capture the authentic naturalness of mixed-language queries. Our evaluation across statistical, dense, and late-interaction paradigms reveals that code-switching acts as a fundamental performance bottleneck, degrading the effectiveness of even robust multilingual models. We demonstrate that this failure stems from substantial divergence in the embedding space between pure and code-switched text. Scaling this investigation, we propose CS-MTEB, a comprehensive benchmark covering 11 diverse tasks, where we observe performance declines of up to 27%. Finally, we show that standard multilingual techniques like vocabulary expansion are insufficient to resolve these deficits completely. These findings underscore the fragility of current systems and establish code-switching as a crucial frontier for future IR optimization.
Original Article
View Cached Full Text

Cached at: 04/22/26, 06:17 AM

Paper page - Code-Switching Information Retrieval: Benchmarks, Analysis, and the Limits of Current Retrievers

Source: https://huggingface.co/papers/2604.17632

Abstract

Code-switching poses significant challenges for information retrieval systems, revealing performance bottlenecks and embedding space divergences that current multilingual approaches cannot fully address.

Code-switchingis a pervasive linguistic phenomenon in global communication, yet moderninformation retrievalsystems remain predominantly designed for, and evaluated within, monolingual contexts. To bridge this critical disconnect, we present a holistic study dedicated tocode-switchingIR. We introduceCSR-L(Code-SwitchingRetrieval benchmark-Lite), constructing a dataset via human annotation to capture the authentic naturalness of mixed-language queries. Our evaluation across statistical, dense, and late-interaction paradigms reveals thatcode-switchingacts as a fundamental performance bottleneck, degrading the effectiveness of even robustmultilingual models. We demonstrate that this failure stems from substantial divergence in theembedding spacebetween pure and code-switched text. Scaling this investigation, we proposeCS-MTEB, a comprehensive benchmark covering 11 diverse tasks, where we observe performance declines of up to 27%. Finally, we show that standard multilingual techniques likevocabulary expansionare insufficient to resolve these deficits completely. These findings underscore the fragility of current systems and establishcode-switchingas a crucial frontier for future IR optimization.

View arXiv pageView PDFGitHub0Add to collection

Get this paper in your agent:

hf papers read 2604\.17632

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.17632 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.17632 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.17632 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

arXiv cs.CL

This paper presents a benchmark evaluating five commercial ASR systems on code-switching speech across Arabic-English, Persian-English, and German-English pairs, using a two-stage pipeline to select 300 samples per pair and assessing performance with WER and BERTScore. ElevenLabs Scribe v2 achieves the lowest overall WER (13.2%) and highest BERTScore (0.936), with public dataset available.

Think Multilingual, Not Harder: A Data-Efficient Framework for Teaching Reasoning Models to Code-Switch

arXiv cs.CL

This paper introduces a data-efficient fine-tuning framework for teaching reasoning models to code-switch (mix languages) effectively, demonstrating that strategic code-switching can improve reasoning capabilities for lower-resource languages. The work analyzes code-switching behaviors in large language models across diverse languages, tasks, and domains, then develops interventions to promote beneficial code-switching patterns.