agent-evals

Tag

Cards List
#agent-evals

@BraceSproul: I've been thinking a lot about the two different groups of evals you need in general agents/agents which handle broad t…

X AI KOLs Following · 2026-05-19 Cached

A Twitter thread discussing two distinct evaluation suites needed for general AI agents: a lightweight benchmark eval for quick iteration and a comprehensive test coverage eval for thorough validation across diverse user paths.

0 favorites 0 likes
#agent-evals

@ArizePhoenix: Who judges the evaluators? When you use LLM-as-a-judge, you’re trusting a model to decide whether your agent, workflow,…

X AI KOLs Following · 2026-05-07

The article discusses the challenges of debugging and evaluating LLM judges using Arize Phoenix, which traces evaluator runs via OpenTelemetry to inspect decision logic, costs, and potential biases.

0 favorites 0 likes
← Back to home

Submit Feedback