@andykonwinski: 3 take-aways from chatting w/ top AI researchers last month: - Evals are the “source code” of AI agents (48:35) - BigAI…
Summary
Summary of three key takeaways from conversations with leading AI researchers at CAISconf, covering the importance of evaluations for AI agents, the trade-offs between industry and academia, and a novel pedagogical RL approach.
View Cached Full Text
Cached at: 06/26/26, 04:12 PM
3 take-aways from chatting w/ top AI researchers last month:
-
Evals are the “source code” of AI agents (48:35)
-
BigAI labs: $ + anonymous impact. Academia -> impact w/ your own individual voice (44:00)
-
Pedagogical RL: agent solves problem it knows the answer to (unintuitive!); reward solutions that don’t take shortcuts, then distill them into a student model. (53:40)
Laude Institute (@LaudeInstitute): At @CAISconf last month, @andykonwinski sat down with researchers on the conference floor – @matei_zaharia @istoica05 @lateinteraction @dawnsongtweets @gneubig @pgasawa @JonSaadFalcon @heathercmiller @ryanmart3n @alexgshaw @profjoeyg @swyx and Ioannis Ioannidis – to talk
Similar Articles
@pauliusztin_: Every day, 100+ people ask me, "How can I learn AI evals?" I copy-paste these 11 links (every time): 1. AI evals & obse…
A curated list of 11 links shared daily to help people learn AI evaluation techniques, covering evals, observability, LLM-as-judge, and agent evaluation.
Demystifying evals for AI agents
Anthropic provides a guide on designing rigorous automated evaluations for AI agents, addressing the complexities of multi-turn interactions and state modifications.
@zodchiii: Three Anthropic engineers just spent 16 minutes on what makes AI agents actually succeed in production. If the people w…
Anthropic engineers share insights on making AI agents succeed in production, highlighting proven patterns from their work on Claude.
@OpenAI: Let’s talk about evals. We’re always looking for better ways to measure and forecast model progress, especially as benc…
OpenAI discusses the importance of evals (evaluations) for measuring and forecasting model progress, especially as benchmarks become saturated or gamed, featuring insights from Tejal Patwardhan and Andrew Mayne.
@levie: Almost all AI model and agent progress is downstream from evals. Open weights post training for specific domains comes …
Almost all AI model and agent progress depends on evaluations (evals). Understanding workflows and agent performance through evals will become a core enterprise competency for driving automation.