benchmark-scaling

Tag

Cards List
#benchmark-scaling

How Inference Compute Shapes Frontier LLM Evaluation

arXiv cs.AI · 3d ago Cached

This paper systematically studies how inference-time compute (token budgets, context compaction, repeated submissions) affects frontier LLM performance on challenging benchmarks, demonstrating that scores are protocol-dependent and advocating for evaluations that report capability as a function of inference compute.

0 favorites 0 likes
← Back to home

Submit Feedback