Tag
This paper demonstrates that allowing attackers to strategically choose when to attack (attack selection) in agentic AI control evaluations significantly reduces measured safety, suggesting that current evaluations may overestimate safety against selective attackers.
Discusses the need for evolving AI evaluation benchmarks through difficulty, quality, and diversity refinement, citing examples like MMLU-Pro, MMLU-Redux, BIG-Bench Extra Hard, RealMath, MathArena, and DatBench.
A developer asks for recommendations for open-source alternatives to LangSmith for tracing, evaluations, and debugging agent workflows, citing restrictive paywalls.
Arize Phoenix announces a free 2-hour evaluations workshop from the AI Engineer: Europe conference, led by head of DevRel Laurie Voss, covering manual data examination and built-in/custom evals.