evaluation-methods

#evaluation-methods

The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment

arXiv cs.CL ↗ · 2026-06-03 Cached

This paper geometrically analyzes why LLMs acting as judges agree strongly with each other but weakly with humans, finding that inter-LLM consensus reflects a collapsed subspace rather than true human alignment on subjective rubrics. Post-hoc calibration on human data improves alignment, but even calibrated LLMs fall short of human reliability.

0 favorites 0 likes

#evaluation-methods

Open-World Evaluations for Measuring Frontier AI Capabilities

arXiv cs.AI ↗ · 2026-05-22 Cached

This paper argues that traditional benchmarks both overestimate and underestimate frontier AI capabilities, and proposes 'open-world evaluations'—long-horizon, real-world tasks assessed qualitatively—as a complementary approach. The CRUX project is introduced, with a demonstration where an AI agent successfully published an iOS app to the App Store with minimal intervention.

0 favorites 0 likes

#evaluation-methods

Self-Supervised Prompt Optimization

Papers with Code Trending ↗ · 2025-02-07 Cached

This paper introduces Self-Supervised Prompt Optimization (SPO), a framework that optimizes prompts for LLMs without external references by using output comparisons, significantly reducing costs and data requirements.

0 favorites 0 likes

evaluation-methods

The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment

Open-World Evaluations for Measuring Frontier AI Capabilities

Self-Supervised Prompt Optimization

Submit Feedback