@xdotli: sharing my personal library on evals 1/n i put together the highest quality blogs, podcasts, papers, and projects on ev…
Summary
A Twitter thread sharing a curated personal library of high-quality blogs, podcasts, papers, and projects on AI evaluations (evals), inviting additions.
View Cached Full Text
Cached at: 06/24/26, 10:22 AM
sharing my personal library on evals 1/n
i put together the highest quality blogs, podcasts, papers, and projects on evals. additions are welcome!
2/n
3/n
4/n
5/n
6/n
7/n
8/n
9/n
10/n
https://github.com/benchflow-ai/awesome-evals…
Similar Articles
@MaxForAI: You'd be hard-pressed to find a better eval resource library. If you're interested in eval, these are what you should read. Thanks to @xdotli for sharing.
Share a curated AI evaluation (evals) resource library, including high-quality blogs, podcasts, papers, and projects, compiled by Xiangyi Li.
@pauliusztin_: Every day, 100+ people ask me, "How can I learn AI evals?" I copy-paste these 11 links (every time): 1. AI evals & obse…
A curated list of 11 links shared daily to help people learn AI evaluation techniques, covering evals, observability, LLM-as-judge, and agent evaluation.
@adxtyahq: Good list. I'd add: - Dataset Engineering - https://huyenchip.com/machine-learning-systems-design/toc.html… - Product E…
A tweet thread compiling essential resources for AI engineering, covering dataset engineering, evaluations, context engineering, agent memory, MCP, observability, inference optimization, and security.
@systemdesignone: If you want to elevate your AI engineering career (in 2026), save these 20 GitHub repositories: 1 OpenClaw ↳ Runs a per…
A Twitter thread listing 20 essential GitHub repositories for AI engineering, covering tools, frameworks, and models for local AI agents, LLMs, image generation, and workflow automation.
Every Eval Ever: A Unifying Schema and Community Repository for AI Evaluation Results
Introduces Every Eval Ever, a shared schema and community-crowdsourced repository for standardizing AI evaluation results, with automatic converters and a hosted database spanning over 22k models and 2.2k benchmarks.