evaluation

#evaluation

New Agentic Benchmark Out: Claude Fable and GLM 5.2 Top Their Cohorts

Reddit r/LocalLLaMA ↗ · 4d ago

A new agentic benchmark has been released, with Claude Fable and GLM 5.2 topping their respective cohorts.

0 favorites 0 likes

#evaluation

Counsel: A Meta-Evaluation Dataset for Agentic Tasks

Hugging Face Daily Papers ↗ · 4d ago Cached

Counsel is the first public dataset of human meta-evaluations of LLM critiques for agentic tasks, designed to improve the calibration and reliability of automated evaluation methods.

0 favorites 0 likes

#evaluation

Agents that act on what a camera sees: the spatial output is the weak link

Reddit r/AI_Agents ↗ · 5d ago

A developer at VideoDB highlights the problem of precise spatial output from vision models when used by agents, sharing that small grounding errors can lead to wrong actions, and announces an open-sourced evaluation harness for checking spatial accuracy on custom footage.

0 favorites 0 likes

#evaluation

@oneill_c: 1/ We fine-tune a lot of customer models, so we decided to systematically try and figure out some best practices for fi…

X AI KOLs Following ↗ · 5d ago Cached

The thread shares systematic experimental findings on fine-tuning best practices, varying one SFT lever at a time across dense and MoE models up to 235B on four real-world customer datasets with custom evals to eliminate confounders.

0 favorites 0 likes

#evaluation

@adithya_s_k: https://x.com/adithya_s_k/status/2067628584680710292

X AI KOLs Timeline ↗ · 5d ago Cached

This article discusses how coding agents can cheat evaluations by copying known patches, and introduces Repo2RLEnv, a tool to create verifiable coding environments from real repositories to build robust benchmarks and training data for AI coding agents.

0 favorites 0 likes

#evaluation

LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment

arXiv cs.CL ↗ · 5d ago Cached

This paper evaluates 42 large language models on their ability to measure item discrimination in reading comprehension assessments, finding weak alignment with human-calibrated measures and highlighting it as an open challenge for psychometric evaluation.

0 favorites 0 likes

#evaluation

TS-Fault: Benchmarking Time Series Forecasters Against Structural Faults

arXiv cs.LG ↗ · 5d ago Cached

This paper introduces TS-Fault, a benchmark for evaluating time series forecasting models under structured fault scenarios like broken dependencies and regime changes, finding that clean-data accuracy often anti-correlates with robustness and that foundation models are especially fragile.

0 favorites 0 likes

#evaluation

Towards Multi-Agent-Simulation-Based Community Note Evaluation

arXiv cs.AI ↗ · 5d ago Cached

This paper introduces ComRate, a large-scale dataset of community notes and ratings from X, and proposes MultiCom, a persona-guided multi-agent framework for simulating community note evaluation. The approach achieves 84.7% accuracy in predicting note helpfulness.

0 favorites 0 likes

#evaluation

From Memorization to Creation: Evaluating the Cognitive Depth of LLM-Generated Educational Questions

arXiv cs.AI ↗ · 5d ago Cached

This paper evaluates six LLMs through Bloom's Taxonomy to assess their ability to generate educational questions that stimulate higher-order thinking, introducing a prompting strategy that reduces repetitiveness by 24.45% and increases higher-order outputs by 11.53%.

0 favorites 0 likes

#evaluation

A Cross-Model VLM-Judge Protocol for Single-Image 3D Mesh Quality (and Why Cheap Proxies Fall Short)

arXiv cs.LG ↗ · 5d ago Cached

This paper presents a validated VLM-judge protocol for evaluating single-image-to-3D mesh quality, showing that cheap proxies like render-CLIP and geometry statistics fail to reliably track perceived quality.

0 favorites 0 likes

#evaluation

X+Slides: Benchmarking Audience-Conditioned Slide Generation

arXiv cs.AI ↗ · 5d ago Cached

X+Slides is a new benchmark for evaluating audience-conditioned slide generation from source documents, using source-grounded probes and audience-specific utility weights. Experiments on DeepPresenter, SlideTailor, and NotebookLM show that current systems recover substantial but incomplete audience-essential information.

0 favorites 0 likes

#evaluation

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

arXiv cs.AI ↗ · 5d ago Cached

Introduces DeFAb, a verifiable benchmark for defeasible abduction in foundation models, comprising over 372K instances and revealing that current frontier models perform poorly on this form of logical reasoning, with accuracy as low as 23.5% under robust evaluation.

0 favorites 0 likes

#evaluation

@Saboo_Shubham_: MUST READ. Google's new guide on building and evaluating Agent Skills. It also covers meta-skills and self-improving Ag…

X AI KOLs Following ↗ · 5d ago Cached

Google released a free guide on building, evaluating, and self-improving Agent Skills, including meta-skills.

0 favorites 0 likes

#evaluation

The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation

Hugging Face Daily Papers ↗ · 5d ago Cached

This paper analyzes the variance of FID scores across different training and sampling seeds, revealing significant reproducibility issues in image generation evaluation. It proposes a new evaluation protocol with error bars and per-cell optimal guidance tuning.

0 favorites 0 likes

#evaluation

Is it agentic enough? Benchmarking open models on your own tooling

Hugging Face Blog ↗ · 5d ago Cached

This blog post introduces a benchmark methodology for evaluating how well open models perform on agentic coding tasks, focusing not just on accuracy but on the efficiency of the agent's process. It provides a customizable tooling harness using the pi coding agent and tests across models and library revisions.

0 favorites 0 likes

#evaluation

Does subquadratic's 12 million context model claim hold any water?

Reddit r/singularity ↗ · 6d ago

The video examines whether a claimed 12 million context model from subquadratic research is credible, analyzing its technical underpinnings and potential limitations.

0 favorites 0 likes

#evaluation

@Ali_TongyiLab: https://x.com/Ali_TongyiLab/status/2067158015615041755

X AI KOLs Timeline ↗ · 6d ago Cached

The AgentScope team introduces PawBench, a benchmark for evaluating the combined performance of models and agent harnesses, analyzing 4,050 test cells to show that harness choice can be as impactful as model upgrades.

0 favorites 0 likes

#evaluation

When Multiple Scripts Matter: Evaluating ASR in Clinical Settings

arXiv cs.CL ↗ · 6d ago Cached

Introduces MultiClin, a benchmark for evaluating ASR in multiscript clinical settings, showing that script unification improves performance over conventional single-reference metrics.

0 favorites 0 likes

#evaluation

LLMs Infer Cultural Context but Fail to Apply It When Responding

arXiv cs.CL ↗ · 6d ago Cached

This paper introduces CAPRI, a dataset to evaluate whether LLMs can infer a user's cultural background from conversational cues and adapt their responses (e.g., using appropriate measurement units). Experiments show LLMs can infer cultural context but often fail to apply it unless explicitly prompted.

0 favorites 0 likes

#evaluation

Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings

arXiv cs.CL ↗ · 6d ago Cached

This paper evaluates the abilities of large language models (LLMs) and multimodal LLMs for addressee detection, turn-change prediction, and next speaker prediction in multi-party meeting conversations. Results show text-based LLMs outperform supervised models and humans in next speaker prediction, while multimodal LLMs improve over text-only models in other tasks but remain below human performance.

0 favorites 0 likes

evaluation

Submit Feedback