llm-judging

#llm-judging

SenseJudge: Human-Centric Preference-Driven Judgment Framework

arXiv cs.CL ↗ · 2026-06-03 Cached

SenseJudge is a human-centric framework for customizable LLM judging that adapts to diverse user preferences, outperforming existing methods. It also introduces SenseBench, a benchmark derived from real-world multi-turn interactions.

0 favorites 0 likes

#llm-judging

PaperBench: Evaluating AI’s Ability to Replicate AI Research

OpenAI Blog ↗ · 2025-04-02 Cached

OpenAI introduces PaperBench, a benchmark evaluating AI agents' ability to replicate state-of-the-art AI research by replicating 20 ICML 2024 papers with 8,316 gradable tasks. The best-performing model (Claude 3.5 Sonnet) achieves only 21% replication score, below human PhD-level performance, highlighting current limitations in autonomous research capabilities.

0 favorites 0 likes

llm-judging

SenseJudge: Human-Centric Preference-Driven Judgment Framework

PaperBench: Evaluating AI’s Ability to Replicate AI Research

Submit Feedback