Tag
This paper proposes Judge-LS, a protocol to evaluate whether LLM-as-a-judge models are invariant to language switching between English and Chinese. It finds that switching languages causes 10.7-14.4% preference flips and that judges achieve their highest accuracy in English.
The author observes that LLMs exhibit denominational bias depending on language (Protestant-leaning in English, Catholic-leaning in Spanish/French/Portuguese) and introduces a free Bible study app called Biblians.
An experiment running the same research prompt about LENR and superconductivity through six AI systems in five languages reveals significant linguistic bias, with non-English queries surfacing information about real industrial commitments that English-only searches miss.
This paper demonstrates that LLMs are heavily biased toward English, and shows that continual pre-training does not offer cost advantages over training from scratch for adapting models to other languages, especially for cultural understanding.
Researchers identify systematic English and query-language bias in multilingual RAG rerankers and introduce LAURA, a utility-driven alignment method that boosts performance by retrieving answer-critical documents across languages.