llm-judging

Tag

Cards List
#llm-judging

SenseJudge: Human-Centric Preference-Driven Judgment Framework

arXiv cs.CL · 2026-06-03 Cached

SenseJudge is a human-centric framework for customizable LLM judging that adapts to diverse user preferences, outperforming existing methods. It also introduces SenseBench, a benchmark derived from real-world multi-turn interactions.

0 favorites 0 likes
#llm-judging

PaperBench: Evaluating AI’s Ability to Replicate AI Research

OpenAI Blog · 2025-04-02 Cached

OpenAI introduces PaperBench, a benchmark evaluating AI agents' ability to replicate state-of-the-art AI research by replicating 20 ICML 2024 papers with 8,316 gradable tasks. The best-performing model (Claude 3.5 Sonnet) achieves only 21% replication score, below human PhD-level performance, highlighting current limitations in autonomous research capabilities.

0 favorites 0 likes
← Back to home

Submit Feedback