LLM prefomance in Estonian
Summary
The Institute of the Estonian Language has released an open benchmark to evaluate LLM performance in Estonian, covering language proficiency, reasoning, factual accuracy, and resistance to propaganda, revealing that models strong on English benchmarks may perform differently in smaller language environments.
Similar Articles
CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks
CulturALL introduces a 2,610-sample benchmark across 14 languages and 51 regions to evaluate LLMs on real-world, culturally grounded tasks; top model scores only 44.48%, highlighting large room for improvement.
These LLMs are the best at resisting Russian propaganda
A benchmark study by the Estonian Language Institute evaluates LLMs on their ability to resist Russian propaganda, finding that Nvidia's Nemotron, Alibaba's Qwen, and OpenAI's GPT-5.4 perform well, while Google's Gemini models show notable weaknesses, especially when prompted in Russian.
UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning
Introduces UA-Legal-Bench, a five-task benchmark for evaluating large language models on Ukrainian legal reasoning, built from the Unified State Register of Court Decisions. Evaluates 11 LLMs, revealing task-dependent few-shot effects and the misleading nature of accuracy on imbalanced legal tasks.
How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem
This empirical study evaluates LLMs on the Equivalence Class Problem to assess long-chain reasoning capabilities, finding that non-reasoning models fail while reasoning models struggle with specific structural difficulties.
Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment
This paper investigates whether standard benchmarks underestimate LLM performance by re-evaluating hallucination detection datasets using an LLM-first, human-adjudicated assessment method. The study finds that incorporating LLM reasoning into the adjudication process improves agreement and suggests that model-assisted re-evaluation yields more reliable benchmarks for ambiguity-prone tasks.