LLM prefomance in Estonian

Reddit r/ArtificialInteligence 06/05/26, 08:59 PM Tools

estonian benchmark llm-evaluation language-model propaganda-resistance open-source

Summary

The Institute of the Estonian Language has released an open benchmark to evaluate LLM performance in Estonian, covering language proficiency, reasoning, factual accuracy, and resistance to propaganda, revealing that models strong on English benchmarks may perform differently in smaller language environments.

The Institute of the Estonian Language (EKI) has released an open benchmark for evaluating LLM performance in Estonian. The benchmark goes beyond simple language understanding and evaluates multiple dimensions, including: • Estonian language proficiency • Reasoning and problem-solving • Factual accuracy • Resistance to propaganda and manipulative prompts • Reliability across different tasks One interesting result is that leading models show significant differences in their susceptibility to narrative steering and propaganda-style prompting. Models that perform well on general benchmarks do not necessarily perform equally well when tested in a smaller-language information environment. The benchmark and results are publicly available: https://moodupuu.eki.ee/ This is a useful example of why evaluating LLMs only on English-centric benchmarks can miss important weaknesses that become visible in smaller languages and local information ecosystems. I’d be interested to hear how people here approach evaluation for non-English languages and whether propaganda/manipulation resistance should become a standard benchmark category.

Original Article

Similar Articles

CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks

arXiv cs.CL

CulturALL introduces a 2,610-sample benchmark across 14 languages and 51 regions to evaluate LLMs on real-world, culturally grounded tasks; top model scores only 44.48%, highlighting large room for improvement.

These LLMs are the best at resisting Russian propaganda

Ars Technica

A benchmark study by the Estonian Language Institute evaluates LLMs on their ability to resist Russian propaganda, finding that Nvidia's Nemotron, Alibaba's Qwen, and OpenAI's GPT-5.4 perform well, while Google's Gemini models show notable weaknesses, especially when prompted in Russian.

UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning

arXiv cs.CL

Introduces UA-Legal-Bench, a five-task benchmark for evaluating large language models on Ukrainian legal reasoning, built from the Unified State Register of Court Decisions. Evaluates 11 LLMs, revealing task-dependent few-shot effects and the misleading nature of accuracy on imbalanced legal tasks.

How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem

arXiv cs.AI

This empirical study evaluates LLMs on the Equivalence Class Problem to assess long-chain reasoning capabilities, finding that non-reasoning models fail while reasoning models struggle with specific structural difficulties.

Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment

arXiv cs.CL

This paper investigates whether standard benchmarks underestimate LLM performance by re-evaluating hallucination detection datasets using an LLM-first, human-adjudicated assessment method. The study finds that incorporating LLM reasoning into the adjudication process improves agreement and suggests that model-assisted re-evaluation yields more reliable benchmarks for ambiguity-prone tasks.

Similar Articles

CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks

These LLMs are the best at resisting Russian propaganda

UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning

How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem

Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment

Submit Feedback