process-assessment

Tag

Cards List
#process-assessment

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

Hugging Face Daily Papers · 2026-05-13 Cached

AgentLens is a framework for process-level assessment of software engineering agent trajectories, revealing that over 10% of passing trajectories exhibit a 'Lucky Pass' behavior. It introduces AgentLens-Bench, a dataset annotated with quality scores, and shows that ranking by quality score can shift model rankings significantly.

0 favorites 0 likes
← Back to home

Submit Feedback