trajectory-analysis

#trajectory-analysis

I want to show you where your agent messes up so I can validate my product

Reddit r/AI_Agents ↗ · 2026-06-16

Jugmax is a tool that evaluates AI agents by analyzing their full execution trajectory rather than just final outputs, identifying inefficiencies, errors, and wasted tokens. The founders offer free evaluation for two production agents to validate their product.

0 favorites 0 likes

#trajectory-analysis

AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

arXiv cs.AI ↗ · 2026-05-22 Cached

This paper introduces AgentAtlas, a framework that goes beyond outcome-only leaderboards for LLM agents by proposing a six-state control-decision taxonomy and a nine-category trajectory-failure taxonomy to evaluate agent behavior more comprehensively.

0 favorites 0 likes

#trajectory-analysis

Reasoning Models Don't Just Think Longer, They Move Differently

arXiv cs.CL ↗ · 2026-05-18 Cached

This paper investigates whether reasoning-trained language models simply allocate more compute (longer chains of thought) or follow qualitatively different internal trajectories by analyzing hidden-state trajectory geometry across code, math, and SAT domains. After correcting for generation length, they find that reasoning-trained models exhibit distinct trajectory geometry—most clearly in code—indicating reasoning training changes how computation unfolds, not just how much is used.

0 favorites 0 likes

#trajectory-analysis

Uncovering Trajectory and Topological Signatures in Multimodal Pediatric Sleep Embeddings

arXiv cs.LG ↗ · 2026-05-15 Cached

This paper investigates the latent structure of multimodal embeddings from a masked autoencoder for pediatric sleep analysis. It shows that augmenting embeddings with geometric, topological, and clinical features improves prediction and calibration for sleep-related events.

0 favorites 0 likes

#trajectory-analysis

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

Hugging Face Daily Papers ↗ · 2026-05-13 Cached

AgentLens is a framework for process-level assessment of software engineering agent trajectories, revealing that over 10% of passing trajectories exhibit a 'Lucky Pass' behavior. It introduces AgentLens-Bench, a dataset annotated with quality scores, and shows that ranking by quality score can shift model rankings significantly.

0 favorites 0 likes

#trajectory-analysis

What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search

Hugging Face Daily Papers ↗ · 2026-04-21 Cached

Large-scale study of 15 LLMs across 8 tasks reveals that optimization success hinges on maintaining localized search trajectories rather than initial problem-solving ability or solution novelty.

0 favorites 0 likes

trajectory-analysis

I want to show you where your agent messes up so I can validate my product

AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

Reasoning Models Don't Just Think Longer, They Move Differently

Uncovering Trajectory and Topological Signatures in Multimodal Pediatric Sleep Embeddings

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search

Submit Feedback