benchmark-scaling

#benchmark-scaling

How Inference Compute Shapes Frontier LLM Evaluation

arXiv cs.AI ↗ · 2d ago Cached

This paper systematically studies how inference-time compute (token budgets, context compaction, repeated submissions) affects frontier LLM performance on challenging benchmarks, demonstrating that scores are protocol-dependent and advocating for evaluations that report capability as a function of inference compute.

0 favorites 0 likes

benchmark-scaling

How Inference Compute Shapes Frontier LLM Evaluation

Submit Feedback