@ms_aifrontiers: SentinelBench tests agents in time-evolving web environments where success requires waiting. How you wait matters: on 4…
Summary
SentinelBench is a new benchmark for testing AI agents in time-evolving web environments. It finds that agents using a specialized change-detection tool outperform those using sleep-and-poll loops, reducing cost by 9.7x.
Similar Articles
@ms_aifrontiers: A lot of agent benchmarks assume the world changes only when the agent acts. Many real tasks are different: tickets go …
Discusses a limitation of current agent benchmarks that assume the world changes only when the agent acts, whereas many real-world tasks require the agent to wait for external events before acting.
Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems
This paper introduces AgingBench, a benchmark for measuring how deployed AI agents degrade over time due to memory state changes, interaction history, and lifecycle events. It categorizes aging into four mechanisms and provides diagnostic tools for targeted repairs.
@rohanpaul_ai: Today’s frontier agents are far less ready for real-world automation than their benchmark scores suggest. This paper pr…
This paper introduces Agents' Last Exam, a benchmark that tests AI agents on real expert work across 55 digital work areas. Current best agents fail most tasks, averaging only 2.6% pass rate on the hardest tier, revealing a large gap between benchmark scores and real-world automation readiness.
I built a benchmark for AI “memory” in coding agents. looking for others to beat it.
Developer created a new benchmark called continuity-benchmarks to test AI coding agents' ability to maintain consistency with project rules during active development, addressing gaps in existing memory benchmarks that focus on semantic recall rather than real-time architectural consistency and multi-session behavior.
SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering
SaaSBench is a new benchmark for evaluating AI agents in enterprise SaaS development, involving multi-component system integration across 30 tasks, 6 domains, and 5,370 validation nodes. Experiments reveal that the main bottleneck for agents is system configuration and integration rather than isolated code generation.