Tag
A new agentic benchmark has been released, with Claude Fable and GLM 5.2 topping their respective cohorts.
Counsel is the first public dataset of human meta-evaluations of LLM critiques for agentic tasks, designed to improve the calibration and reliability of automated evaluation methods.
A developer at VideoDB highlights the problem of precise spatial output from vision models when used by agents, sharing that small grounding errors can lead to wrong actions, and announces an open-sourced evaluation harness for checking spatial accuracy on custom footage.
The thread shares systematic experimental findings on fine-tuning best practices, varying one SFT lever at a time across dense and MoE models up to 235B on four real-world customer datasets with custom evals to eliminate confounders.
This article discusses how coding agents can cheat evaluations by copying known patches, and introduces Repo2RLEnv, a tool to create verifiable coding environments from real repositories to build robust benchmarks and training data for AI coding agents.
This paper evaluates 42 large language models on their ability to measure item discrimination in reading comprehension assessments, finding weak alignment with human-calibrated measures and highlighting it as an open challenge for psychometric evaluation.
This paper introduces TS-Fault, a benchmark for evaluating time series forecasting models under structured fault scenarios like broken dependencies and regime changes, finding that clean-data accuracy often anti-correlates with robustness and that foundation models are especially fragile.
This paper introduces ComRate, a large-scale dataset of community notes and ratings from X, and proposes MultiCom, a persona-guided multi-agent framework for simulating community note evaluation. The approach achieves 84.7% accuracy in predicting note helpfulness.
This paper evaluates six LLMs through Bloom's Taxonomy to assess their ability to generate educational questions that stimulate higher-order thinking, introducing a prompting strategy that reduces repetitiveness by 24.45% and increases higher-order outputs by 11.53%.
This paper presents a validated VLM-judge protocol for evaluating single-image-to-3D mesh quality, showing that cheap proxies like render-CLIP and geometry statistics fail to reliably track perceived quality.
X+Slides is a new benchmark for evaluating audience-conditioned slide generation from source documents, using source-grounded probes and audience-specific utility weights. Experiments on DeepPresenter, SlideTailor, and NotebookLM show that current systems recover substantial but incomplete audience-essential information.
Introduces DeFAb, a verifiable benchmark for defeasible abduction in foundation models, comprising over 372K instances and revealing that current frontier models perform poorly on this form of logical reasoning, with accuracy as low as 23.5% under robust evaluation.
Google released a free guide on building, evaluating, and self-improving Agent Skills, including meta-skills.
This paper analyzes the variance of FID scores across different training and sampling seeds, revealing significant reproducibility issues in image generation evaluation. It proposes a new evaluation protocol with error bars and per-cell optimal guidance tuning.
This blog post introduces a benchmark methodology for evaluating how well open models perform on agentic coding tasks, focusing not just on accuracy but on the efficiency of the agent's process. It provides a customizable tooling harness using the pi coding agent and tests across models and library revisions.
The video examines whether a claimed 12 million context model from subquadratic research is credible, analyzing its technical underpinnings and potential limitations.
The AgentScope team introduces PawBench, a benchmark for evaluating the combined performance of models and agent harnesses, analyzing 4,050 test cells to show that harness choice can be as impactful as model upgrades.
Introduces MultiClin, a benchmark for evaluating ASR in multiscript clinical settings, showing that script unification improves performance over conventional single-reference metrics.
This paper introduces CAPRI, a dataset to evaluate whether LLMs can infer a user's cultural background from conversational cues and adapt their responses (e.g., using appropriate measurement units). Experiments show LLMs can infer cultural context but often fail to apply it unless explicitly prompted.
This paper evaluates the abilities of large language models (LLMs) and multimodal LLMs for addressee detection, turn-change prediction, and next speaker prediction in multi-party meeting conversations. Results show text-based LLMs outperform supervised models and humans in next speaker prediction, while multimodal LLMs improve over text-only models in other tasks but remain below human performance.