llm-benchmark

#llm-benchmark

CalBrief: A Pilot Diagnostic Benchmark for Evidence-Calibrated Scientific Briefing with Large Language Models

arXiv cs.CL ↗ · 1h ago Cached

The paper presents CalBrief, a pilot diagnostic benchmark of 16 evidence packages and 96 human-verified takeaways for evaluating whether large language models can generate evidence-calibrated scientific briefings. The study finds that structured organization improves reasoning but explicit strength-calibration policies are overly conservative, with most conservatism arising from expanded label spaces rather than signal injection.

0 favorites 0 likes

#llm-benchmark

Political bias in AI: Where the AI models stand

Hacker News Top ↗ · 3d ago Cached

An analysis of political leanings in six major AI models, showing that 4 out of 6 lean left of center on the economic axis, with some models being unaware of their own bias.

0 favorites 0 likes

#llm-benchmark

Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War

arXiv cs.AI ↗ · 5d ago Cached

Introduces Age of LLM, a turn-based 1v1 benchmark where LLMs compete on a grid with fog of war and diplomacy, measuring reasoning, reliability, and strategic planning. Findings show a dominance of nuclear rush tactics and a weak link between reliability and winning.

0 favorites 0 likes

#llm-benchmark

Open-source LLM benchmark runs 147 coding tasks every 4 hours, 5-trial median with 95% CI, and uses CUSUM for change-point detection. Curious what people think of the methodology

Reddit r/AI_Agents ↗ · 2026-06-18

An open-source LLM benchmark with 147 coding tasks runs every 4 hours, using 5-trial median with 95% confidence intervals and CUSUM for change-point detection, sparking discussion on its methodology.

0 favorites 0 likes

#llm-benchmark

Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages

Hugging Face Daily Papers ↗ · 2026-06-18 Cached

Multi-LCB extends the LiveCodeBench benchmark to evaluate LLMs across twelve programming languages while preserving contamination controls, revealing Python overfitting and language-specific contamination issues.

0 favorites 0 likes

#llm-benchmark

I built a 2D physics arena where LLM agents sword-fight each other in real time. Turns out it's a surprisingly sharp test of tactical reasoning.

Reddit r/AI_Agents ↗ · 2026-06-15

Stickblade Arena is a new benchmark where LLM agents control ragdolls in a 2D physics sword-fighting simulator, testing multi-turn tactical reasoning, spatial awareness, and real-time decision-making under adversarial pressure. Early results reveal capability gaps: DeepSeek R1 dominates melee but fails at bow due to time limits, and small models excel at close-range fighting.

0 favorites 0 likes

#llm-benchmark

M\"OVE: A Holistic LLM Benchmark for the German Public Sector

arXiv cs.CL ↗ · 2026-06-12 Cached

MÖVE is a holistic benchmark for evaluating large language models in the German public sector, covering performance and governance criteria across 39 models using ten German-language datasets.

0 favorites 0 likes

#llm-benchmark

MTG Bench: Testing how well LLMs can play Magic

Hacker News Top ↗ · 2026-06-11 Cached

MTG Bench evaluates how well LLMs can play Magic: The Gathering using an MCP server for library operations, showing both successes and failures in complex game actions.

0 favorites 0 likes

#llm-benchmark

Can AI Reason Like an Urban Planner? Benchmarking Large Language Models Against Professional Judgment

arXiv cs.CL ↗ · 2026-06-11 Cached

This paper introduces UPBench, a benchmark to evaluate large language models on urban planning knowledge across four knowledge pillars and five cognitive levels, finding that models perform better on higher-order analysis than factual recall, and identifying epistemic limitations such as regulatory hallucination and phronetic deficit.

0 favorites 0 likes

#llm-benchmark

Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

arXiv cs.AI ↗ · 2026-06-10 Cached

This paper introduces OfficeEval, a benchmark based on China's National Computer Rank Examination (NCRE) to evaluate LLM agents on complex Office automation tasks. Frontier models achieve at best 36.6% in single-turn and 68.8% with agentic systems, far below human-level performance.

0 favorites 0 likes

#llm-benchmark

Here's a llama.cpp CLI Command builder.

Reddit r/LocalLLaMA ↗ · 2026-06-09 Cached

A static Linux command builder for llama.cpp that helps construct CLI commands, run benchmarks, and log results.

0 favorites 0 likes

#llm-benchmark

I built a 1v1 nuclear strategy game to benchmark LLM reasoning (instead of just QCMs) — Age of LLM

Reddit r/ArtificialInteligence ↗ · 2026-06-08

A new open-source benchmark called Age of LLM tests LLM reasoning through a turn-based nuclear strategy game with fog of war, diplomacy, and bluffing, offering a more dynamic evaluation than traditional multiple-choice benchmarks.

0 favorites 0 likes

#llm-benchmark

Knowledge Index of Noah's Ark

arXiv cs.AI ↗ · 2026-06-04 Cached

KINA (Knowledge Index of Noah's Ark) is an 899-item LLM benchmark spanning 261 fine-grained disciplines, introducing formal guarantees for disciplinary representativeness, incentive-aligned annotation via bonus-on-bar tournaments, and bootstrap ranking-stability reporting. Evaluating 42 models, top performers include Gemini-3.1-Pro-Preview (53.17%), Claude-Opus-4.6 (49.92%), and GPT-5.4 (48.55%), revealing a tiered rather than smooth leaderboard structure.

0 favorites 0 likes

#llm-benchmark

EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs

arXiv cs.AI ↗ · 2026-06-01 Cached

EHRBench is an automated and reliable benchmark for evaluating LLMs on clinical decision-making tasks using real-world electronic health records, covering nearly 1M QA items across diagnosis, treatment, and prognosis tasks.

0 favorites 0 likes

#llm-benchmark

Omissive Bias in Religious Representation: Benchmarking LLM Answers to Everyday Ethical Decision-making

arXiv cs.LG ↗ · 2026-05-26 Cached

This paper introduces the AllFaith Religious Representation Benchmark to measure how often LLMs omit religious perspectives when answering everyday ethical questions, finding that models underrepresent religion compared to human expectations, especially in practical personal situations.

0 favorites 0 likes

#llm-benchmark

Antigravity 2.0 Tops the OpenSCAD Architectural 3D LLM Benchmark

Hacker News Top ↗ · 2026-05-22 Cached

ModelRift benchmarks LLMs on generating OpenSCAD code for the Pantheon, with Antigravity 2.0 achieving the top result.

0 favorites 0 likes

#llm-benchmark

MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models

arXiv cs.CL ↗ · 2026-05-18 Cached

This paper introduces MHGraphBench, a knowledge-graph-grounded benchmark for evaluating large language models on mental health knowledge, including entity recognition, relation judgment, and multi-hop reasoning. Experiments across 15 LLMs reveal a gap between recognition and judgment capabilities.

0 favorites 0 likes

#llm-benchmark

SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

Hugging Face Daily Papers ↗ · 2026-05-18 Cached

SCICONVBENCH is a benchmark that evaluates LLMs on multi-turn clarification for ill-posed scientific queries across computational science domains, finding that even frontier models struggle with disambiguation and frequently make silent assumptions.

0 favorites 0 likes

#llm-benchmark

Which AI is closest to your political views? I tested 100+ LLMs on the same 117 questions

Reddit r/ArtificialInteligence ↗ · 2026-05-13

An independent analysis tested 100+ LLMs on 117 political questions to map their ideological alignment, revealing that DeepSeek and Grok lean left while most other models cluster near the center or right.

0 favorites 0 likes

#llm-benchmark

@dlouapre: Meet physics-intern, our agentic framework for theoretical physics. It takes Gemini 3.1 Pro from 17.7% to 31.4% on Crit…

X AI KOLs Following ↗ · 2026-05-12

Physics-intern is an agentic framework for theoretical physics that improves Gemini 3.1 Pro's performance on the CritPt benchmark from 17.7% to 31.4%, achieving a new state-of-the-art.

0 favorites 0 likes

llm-benchmark

Submit Feedback