Tag
The paper presents CalBrief, a pilot diagnostic benchmark of 16 evidence packages and 96 human-verified takeaways for evaluating whether large language models can generate evidence-calibrated scientific briefings. The study finds that structured organization improves reasoning but explicit strength-calibration policies are overly conservative, with most conservatism arising from expanded label spaces rather than signal injection.
An analysis of political leanings in six major AI models, showing that 4 out of 6 lean left of center on the economic axis, with some models being unaware of their own bias.
Introduces Age of LLM, a turn-based 1v1 benchmark where LLMs compete on a grid with fog of war and diplomacy, measuring reasoning, reliability, and strategic planning. Findings show a dominance of nuclear rush tactics and a weak link between reliability and winning.
An open-source LLM benchmark with 147 coding tasks runs every 4 hours, using 5-trial median with 95% confidence intervals and CUSUM for change-point detection, sparking discussion on its methodology.
Multi-LCB extends the LiveCodeBench benchmark to evaluate LLMs across twelve programming languages while preserving contamination controls, revealing Python overfitting and language-specific contamination issues.
Stickblade Arena is a new benchmark where LLM agents control ragdolls in a 2D physics sword-fighting simulator, testing multi-turn tactical reasoning, spatial awareness, and real-time decision-making under adversarial pressure. Early results reveal capability gaps: DeepSeek R1 dominates melee but fails at bow due to time limits, and small models excel at close-range fighting.
MÖVE is a holistic benchmark for evaluating large language models in the German public sector, covering performance and governance criteria across 39 models using ten German-language datasets.
MTG Bench evaluates how well LLMs can play Magic: The Gathering using an MCP server for library operations, showing both successes and failures in complex game actions.
This paper introduces UPBench, a benchmark to evaluate large language models on urban planning knowledge across four knowledge pillars and five cognitive levels, finding that models perform better on higher-order analysis than factual recall, and identifying epistemic limitations such as regulatory hallucination and phronetic deficit.
This paper introduces OfficeEval, a benchmark based on China's National Computer Rank Examination (NCRE) to evaluate LLM agents on complex Office automation tasks. Frontier models achieve at best 36.6% in single-turn and 68.8% with agentic systems, far below human-level performance.
A static Linux command builder for llama.cpp that helps construct CLI commands, run benchmarks, and log results.
A new open-source benchmark called Age of LLM tests LLM reasoning through a turn-based nuclear strategy game with fog of war, diplomacy, and bluffing, offering a more dynamic evaluation than traditional multiple-choice benchmarks.
KINA (Knowledge Index of Noah's Ark) is an 899-item LLM benchmark spanning 261 fine-grained disciplines, introducing formal guarantees for disciplinary representativeness, incentive-aligned annotation via bonus-on-bar tournaments, and bootstrap ranking-stability reporting. Evaluating 42 models, top performers include Gemini-3.1-Pro-Preview (53.17%), Claude-Opus-4.6 (49.92%), and GPT-5.4 (48.55%), revealing a tiered rather than smooth leaderboard structure.
EHRBench is an automated and reliable benchmark for evaluating LLMs on clinical decision-making tasks using real-world electronic health records, covering nearly 1M QA items across diagnosis, treatment, and prognosis tasks.
This paper introduces the AllFaith Religious Representation Benchmark to measure how often LLMs omit religious perspectives when answering everyday ethical questions, finding that models underrepresent religion compared to human expectations, especially in practical personal situations.
ModelRift benchmarks LLMs on generating OpenSCAD code for the Pantheon, with Antigravity 2.0 achieving the top result.
This paper introduces MHGraphBench, a knowledge-graph-grounded benchmark for evaluating large language models on mental health knowledge, including entity recognition, relation judgment, and multi-hop reasoning. Experiments across 15 LLMs reveal a gap between recognition and judgment capabilities.
SCICONVBENCH is a benchmark that evaluates LLMs on multi-turn clarification for ill-posed scientific queries across computational science domains, finding that even frontier models struggle with disambiguation and frequently make silent assumptions.
An independent analysis tested 100+ LLMs on 117 political questions to map their ideological alignment, revealing that DeepSeek and Grok lean left while most other models cluster near the center or right.
Physics-intern is an agentic framework for theoretical physics that improves Gemini 3.1 Pro's performance on the CritPt benchmark from 17.7% to 31.4%, achieving a new state-of-the-art.