Tag
The paper introduces the Capability Frontier, a Pareto frontier over models that corrects for biases in single-model and single-run evaluations, showing that standard benchmarks miss up to 82% of model performance and that collective LLM capabilities are substantially underestimated.
This paper systematically studies how inference-time compute (token budgets, context compaction, repeated submissions) affects frontier LLM performance on challenging benchmarks, demonstrating that scores are protocol-dependent and advocating for evaluations that report capability as a function of inference compute.
Saagar Pateder analyzes the diminishing marginal returns of AI intelligence for consumer and enterprise tasks, and predicts that open-weight models will diffuse globally by 2029, based on historical trends in model performance and cost.
A discussion questioning what makes Anthropic and OpenAI's agent implementations special, suggesting they may just be basic ReAct loops with tools, and asking about the gap with local Ollama model implementations.
A tweet referencing AI researcher Sebastien Bubeck suggests that certain discussed capabilities would require an advanced model like the hypothetical GPT-5.5.