Tag
This paper geometrically analyzes why LLMs acting as judges agree strongly with each other but weakly with humans, finding that inter-LLM consensus reflects a collapsed subspace rather than true human alignment on subjective rubrics. Post-hoc calibration on human data improves alignment, but even calibrated LLMs fall short of human reliability.
This paper argues that traditional benchmarks both overestimate and underestimate frontier AI capabilities, and proposes 'open-world evaluations'—long-horizon, real-world tasks assessed qualitatively—as a complementary approach. The CRUX project is introduced, with a demonstration where an AI agent successfully published an iOS app to the App Store with minimal intervention.
This paper introduces Self-Supervised Prompt Optimization (SPO), a framework that optimizes prompts for LLMs without external references by using output comparisons, significantly reducing costs and data requirements.