Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems

Hugging Face Daily Papers 05/26/26, 12:00 AM Papers

llm-agents benchmarking evaluation software-engineering agentic-ai runtime-assessment production-systems

Summary

RAMP is a production-grounded evaluation framework for LLM agents that exposes significant capability degradation invisible to static benchmarks, showing task completion rates collapsing from 100% to 20% across serial workflows. The framework assesses 15 mainstream models on realistic compiler-construction workloads with complex toolchain interactions and staged recovery mechanisms.

LLM agents are rapidly evolving from coding assistants into autonomous software engineering systems. However, existing evaluation methodologies remain largely centered on static, isolated, and short-horizon benchmarks that fail to capture the dynamic complexity of real-world production workflows. As a result, benchmark performance may poorly reflect practical capability under realistic runtime environments involving long execution chains, tool interactions, dependency management, and iterative feedback loops. We thus present RAMP, a production-grounded infrastructure for assessing long-horizon software engineering agents. Built upon the YatCC integrated platform, RAMP provides a unified runtime assessment architecture through standardized orchestration and execution interfaces. RAMP introduces realistic compiler-construction workloads with serial dependencies and complex toolchain interactions, together with a staged recovery mechanism for analyzing execution behavior under partial workflow failure. The framework further incorporates utility-oriented multi-dimensional metrics that jointly evaluate outcome quality and process efficiency. We conduct runtime assessments across 15 mainstream models and observe substantial capability degradation that remains largely invisible to conventional isolated benchmarks. Task completion rates progressively collapse across serial workflows, dropping from 100% in the initial stage to only 20% in the final stage, while none of the evaluated models successfully completes the entire pipeline. Runtime analysis reveals systematic failure propagation and significant resource inefficiencies, with computational costs differing by up to three orders of magnitude among comparable models. These findings suggest RAMP advances agentic model evaluation toward continuous, runtime-observable, and production-grounded assessment.

Original Article

Similar Articles

REAP: Automatic Curation of Coding Agent Benchmarks from Interactive Production Usage [R]

Reddit r/MachineLearning

REAP is an automated pipeline that curates production-derived benchmarks for coding agents from real developer-agent sessions, using LLM-based classification and stability checks to ensure reliable evaluation without manual labeling.

Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle

arXiv cs.AI

This paper introduces AARR (Act As a Real Researcher), a suite of benchmarks to evaluate frontier LLMs and agentic systems on granular research scenarios. The first benchmark, AARRI-Bench, reveals that even top-performing agents achieve only 68.3% success, highlighting gaps in field sensitivity and nuanced reasoning.

Life After Benchmark Saturation: A Case Study of CORE-Bench

arXiv cs.AI

This paper argues against the 'retire-and-replace' approach to saturated benchmarks, using CORE-Bench as a case study to demonstrate that measuring agent performance along dimensions such as construct validity, efficiency, reliability, and human-agent collaboration yields meaningful insights even after accuracy plateaus.

Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

arXiv cs.AI

Anchor is a task-generation pipeline that addresses artifact drift in AI agent benchmarks by jointly producing instructions, environments, solutions, and verifiers from a single constraint optimization specification, yielding consistent and auditable evaluation tasks for enterprise workflows. The paper introduces ERP-Bench, a benchmark of 300 long-horizon tasks in a production ERP system, showing that frontier models satisfy explicit constraints in 26.1% of trials but reach optimal solutions in only 17.4%.

Ramp SWE-Bench: a private, production-grounded coding benchmark (3 minute read)

TLDR AI

Ramp released its own private SWE-Bench benchmark built from real engineering problems, enabling evaluation of coding models within its financial software ecosystem.

Similar Articles

REAP: Automatic Curation of Coding Agent Benchmarks from Interactive Production Usage [R]

Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle

Life After Benchmark Saturation: A Case Study of CORE-Bench

Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

Ramp SWE-Bench: a private, production-grounded coding benchmark (3 minute read)

Submit Feedback