Tag
OpenMythos introduces a new open-source benchmark for evaluating AI models on mythological knowledge.
A discussion on the methodologies and challenges involved in evaluating AI features once they are deployed in production environments.
A benchmark of 8 LLMs for medical scribing found hallucinations rare but omissions a concern.
LangChain's LangSmith enables developers to use tracing as compliance evidence for the EU AI Act, with customizable evaluators for bias, hallucination, toxicity, accuracy, and adversarial inputs.
A developer shares surprising lessons from fine-tuning a small open model, including that base models often already max out on intended improvements, the real weakness is behavior (caving), and fine-tuning requires careful measurement and balancing.
Released the Loop Template Library (loop-library), covering 50 specific scenarios including engineering, operations, evaluation, design, etc. Each Loop has a feedback, judgment, iteration loop and four Skill capabilities, supporting template matching and adaptive modification.
Calvin Zhang joins OpenAI as Research Program Manager, overseeing evaluation work, bringing extensive experience from Scale AI. This personnel move reflects the flow of evaluation talent in the AI arms race.
This paper introduces AgentCIBench, a benchmark to evaluate privacy risks in computer-use agents, finding that 11 of 15 frontier agents leak information in over 50% of scenarios.
HAKARI-Bench is a lightweight benchmark for comparing retrieval methods across multiple configurations and languages, enabling efficient model selection and performance analysis. It reproduces full benchmarks like MTEB at high correlation while being faster to run.
EnterpriseClawBench presents a benchmark for enterprise agents based on real-world workplace sessions, offering 852 reproducible tasks and comprehensive evaluation metrics beyond single performance scores.
User presents a comprehensive comparison of local text-to-image models using 192 prompts, evaluating capabilities like text rendering, faces, anatomy, and spatial composition, with results and prompts publicly available at imagebench.ai.
Argues that the key skill for product managers in the AI era is loop engineering, not prompt engineering. Describes how to create reusable, self-improving loops for AI agents to maintain quality and avoid drift.
AgentX is an AI agent evaluation framework that helps pinpoint issues and fix them with one click.
PlanBench-XL is a new benchmark that evaluates LLM agents' ability to plan and adapt in large tool ecosystems with limited visibility and dynamic disruptions. Experiments show GPT-5.4 achieves only 51.9% accuracy in block-free settings and collapses to 11.36% under severe blocking, highlighting significant challenges in long-horizon planning.
GLM 5.2 ranks second on the Vending Bench business simulation benchmark while costing less than half of Opus, demonstrating strong performance at lower cost.
The author argues that the true measure of an AI agent's utility is how many open loops it closes autonomously, rather than demo performance or integration count, and cites Runner as a desktop tool that effectively closes such loops by pulling cross-app context.
This paper presents a systematic review and benchmark of 24 black-box uncertainty estimation methods for large language models across 4 models and 4 dataset settings, finding that no single method dominates but hybrid methods that combine multiple uncertainty signals perform well.
本文介绍ORAgentBench,一个用于评估LLM代理在端到端运筹学任务中表现的执行基准,包含107个经过人工审查的任务。实验表明,当前最佳代理仅通过35.51%的任务,揭示了在可靠决策制定方面的重大不足。
This paper benchmarks agentic review systems for peer review, evaluating open-source and proprietary systems on research papers. The best configuration achieves 83.0% pairwise accuracy and catches 71.6% of injected errors, but user feedback highlights issues with false positives and nitpicks.
A Microsoft and York University paper argues that attributing human-like attributes to LLMs is problematic due to flawed experimental designs, using Age of Empires II as an analogy to highlight measurement issues.