LLM prefomance in Estonian

Reddit r/ArtificialInteligence Tools

Summary

The Institute of the Estonian Language has released an open benchmark to evaluate LLM performance in Estonian, covering language proficiency, reasoning, factual accuracy, and resistance to propaganda, revealing that models strong on English benchmarks may perform differently in smaller language environments.

The Institute of the Estonian Language (EKI) has released an open benchmark for evaluating LLM performance in Estonian. The benchmark goes beyond simple language understanding and evaluates multiple dimensions, including: • Estonian language proficiency • Reasoning and problem-solving • Factual accuracy • Resistance to propaganda and manipulative prompts • Reliability across different tasks One interesting result is that leading models show significant differences in their susceptibility to narrative steering and propaganda-style prompting. Models that perform well on general benchmarks do not necessarily perform equally well when tested in a smaller-language information environment. The benchmark and results are publicly available: https://moodupuu.eki.ee/ This is a useful example of why evaluating LLMs only on English-centric benchmarks can miss important weaknesses that become visible in smaller languages and local information ecosystems. I’d be interested to hear how people here approach evaluation for non-English languages and whether propaganda/manipulation resistance should become a standard benchmark category.
Original Article

Similar Articles

These LLMs are the best at resisting Russian propaganda

Ars Technica

A benchmark study by the Estonian Language Institute evaluates LLMs on their ability to resist Russian propaganda, finding that Nvidia's Nemotron, Alibaba's Qwen, and OpenAI's GPT-5.4 perform well, while Google's Gemini models show notable weaknesses, especially when prompted in Russian.

Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment

arXiv cs.CL

This paper investigates whether standard benchmarks underestimate LLM performance by re-evaluating hallucination detection datasets using an LLM-first, human-adjudicated assessment method. The study finds that incorporating LLM reasoning into the adjudication process improves agreement and suggests that model-assisted re-evaluation yields more reliable benchmarks for ambiguity-prone tasks.