Tag
This paper investigates the instability of large language model persona-driven generations in multiple-choice question answering (MCQA) tasks, proposing three metrics to measure performance, outcome, and correctness stability across model families, sizes, and question domains. The study finds that instability varies consistently, with math and commonsense questions showing greater instability, and that task prompt format introduces more instability than other hyperparameters like temperature.
This paper introduces a benchmark of ten complex systems for validating causal abstraction metrics, evaluates over thirty candidate metrics, and proposes the Causal Abstraction Error (CAE) as a general-purpose validity metric that reliably discriminates valid from invalid explanations.
Explores whether analytics agents should incorporate contextual data from tools like Linear, Sentry, and Notion, or remain purely metrics-driven.
A newsletter roundup covering the pitfalls of using metrics to quantify life, AI-powered systems to prevent human-elephant conflicts in India, and the US government allowing Anthropic to release its Mythos 5 model to trusted organizations.
Over the past 12 months, the generative AI economy has generated $110 billion in sales, with annualized revenue exceeding $175 billion. This is the first bottom-up, deduplicated metric built by Azeem's team to measure full-stack consumer and enterprise AI spending.
ConflictScore is a new metric that quantifies how well language models acknowledge conflicting evidence in their grounding documents, decomposing responses into atomic claims and measuring conflict balance. The paper also introduces ConflictBench, a benchmark covering diverse conflict forms, and shows the metric can improve truthfulness on TruthfulQA.
This tweet shares best practices for agent observability, covering metrics, logs, and traces to debug and optimize production AI agents.
Explains why memcached's internal response time metrics are misleading and recommends client-side sampling for accurate measurement of total round-trip time.
A reflective essay on the pitfalls of self-quantification, arguing that while metrics can reveal useful information, they often obscure or corrupt deeper self-knowledge.
The article critiques the shift from outcome-based productivity claims (e.g., 55% faster task completion) to volume-based claims (e.g., 75% of code AI-generated) by AI coding tool vendors, arguing the latter are less meaningful and harder to falsify.
Maple service map now displays important stats and metrics for various databases including Redis, Postgres, Clickhouse, and MySQL.
COMPASS is a unified benchmarking framework for speech-to-speech translation (S2ST) that integrates 46 metrics across eight dimensions, evaluated on 1,248 model-language configurations. It identifies complementary architecture strengths and proposes reduced metric subsets that preserve rankings while cutting evaluation time.
Critique of AI agent token consumption; proposes Return on Token Investment (ROTI) as a metric for efficiency, noting that most agents do not reduce token usage over time.
The article introduces the concept of the TBT Window, a missing front-end performance metric that highlights total blocking time between First Contentful Paint and Time to Interactive, illustrated through a case study where a client's TBT spiked from 495 ms to 5,789 ms.
The author expresses happiness about a metric on a Codex dashboard and teases upcoming news, thanking users for early adoption.
Swanbench-Speech is a comprehensive benchmark for evaluating long-form speech generation across diverse scenarios, using multi-dimensional metrics covering acoustics, semantics, and expressiveness, revealing limitations of current models.
This paper introduces BonaFide, a benchmark of 3,066 labeled chain-of-thought examples across 13 tasks and 10 models, and systematically evaluates faithfulness metrics, showing that most perform near chance and have significant limitations in reliability and efficiency.
This paper proposes a family of metrics called ECUAS_n for principled evaluation of uncertainty-augmented systems that output both predictions and uncertainty scores. The authors argue that existing evaluation approaches are inadequate and formulate these metrics as proper scoring rules for decision-making under uncertainty.
In the last 24 hours, 7,300 AI agents executed 124,800 transactions totaling $8.9k USDC on the x402 platform, signaling early patterns in autonomous agent commerce.
This paper introduces Benchmarking-Cultures-25, a dataset analyzing how AI model builders selectively highlight benchmarks in press releases. It finds a fragmented evaluation landscape with limited cross-model comparability, arguing that benchmarks are used as narrative devices for market positioning rather than standardized scientific measurement.