model-capabilities

Tag

Cards List
#model-capabilities

The Capability Frontier: Benchmarks Miss 82% of Model Performance

arXiv cs.AI · 6d ago Cached

The paper introduces the Capability Frontier, a Pareto frontier over models that corrects for biases in single-model and single-run evaluations, showing that standard benchmarks miss up to 82% of model performance and that collective LLM capabilities are substantially underestimated.

0 favorites 0 likes
#model-capabilities

How Inference Compute Shapes Frontier LLM Evaluation

arXiv cs.AI · 2026-06-17 Cached

This paper systematically studies how inference-time compute (token budgets, context compaction, repeated submissions) affects frontier LLM performance on challenging benchmarks, demonstrating that scores are protocol-dependent and advocating for evaluations that report capability as a function of inference compute.

0 favorites 0 likes
#model-capabilities

Mythos-class models will diffuse throughout the world by 2029 (7 minute read)

TLDR AI · 2026-06-12 Cached

Saagar Pateder analyzes the diminishing marginal returns of AI intelligence for consumer and enterprise tasks, and predicts that open-weight models will diffuse globally by 2029, based on historical trends in model performance and cost.

0 favorites 0 likes
#model-capabilities

Anthropic and OpenAI claims that their models are so powerful that it can “break” their sandbox…but what so special about their agent implementation?

Reddit r/AI_Agents · 2026-05-16

A discussion questioning what makes Anthropic and OpenAI's agent implementations special, suggesting they may just be basic ReAct loops with tools, and asking about the gap with local Ollama model implementations.

0 favorites 0 likes
#model-capabilities

@SebastienBubeck: What he talks about couldn't have happened before GPT-5.5

X AI KOLs Following · 2026-05-10 Cached

A tweet referencing AI researcher Sebastien Bubeck suggests that certain discussed capabilities would require an advanced model like the hypothetical GPT-5.5.

0 favorites 0 likes
← Back to home

Submit Feedback