Tag
Jugmax is a tool that evaluates AI agents by analyzing their full execution trajectory rather than just final outputs, identifying inefficiencies, errors, and wasted tokens. The founders offer free evaluation for two production agents to validate their product.
This paper introduces AgentAtlas, a framework that goes beyond outcome-only leaderboards for LLM agents by proposing a six-state control-decision taxonomy and a nine-category trajectory-failure taxonomy to evaluate agent behavior more comprehensively.
This paper investigates whether reasoning-trained language models simply allocate more compute (longer chains of thought) or follow qualitatively different internal trajectories by analyzing hidden-state trajectory geometry across code, math, and SAT domains. After correcting for generation length, they find that reasoning-trained models exhibit distinct trajectory geometry—most clearly in code—indicating reasoning training changes how computation unfolds, not just how much is used.
This paper investigates the latent structure of multimodal embeddings from a masked autoencoder for pediatric sleep analysis. It shows that augmenting embeddings with geometric, topological, and clinical features improves prediction and calibration for sleep-related events.
AgentLens is a framework for process-level assessment of software engineering agent trajectories, revealing that over 10% of passing trajectories exhibit a 'Lucky Pass' behavior. It introduces AgentLens-Bench, a dataset annotated with quality scores, and shows that ranking by quality score can shift model rankings significantly.
Large-scale study of 15 LLMs across 8 tasks reveals that optimization success hinges on maintaining localized search trajectories rather than initial problem-solving ability or solution novelty.