Tag
RealMath-Eval is a benchmark of 224 real-world high school math exam responses that reveals a significant 'Evaluation Gap': state-of-the-art LLM judges perform poorly on authentic human reasoning (MSE ~2.96) compared to synthetic LLM-generated solutions (MSE ~1.17), due to higher diversity and surprisal in human error patterns.
Aigon is an open-source tool that runs multiple AI coding agents in parallel on the same feature specified in a markdown spec and uses an LLM judge to select the best implementation, with a visual dashboard and optional scheduling.
A detailed evaluation of a RAG customer support chatbot reveals that retrieval issues often masquerade as LLM problems, heuristic evaluators are misleading, deduplication improves quality, stricter grounding trades helpfulness for accuracy, and model sweeping can dramatically reduce cost while improving performance.
Brex open-sources CrabTrap, an LLM-as-a-judge HTTP proxy that filters and secures AI agent traffic before it reaches production services.
A curated list of 11 links shared daily to help people learn AI evaluation techniques, covering evals, observability, LLM-as-judge, and agent evaluation.