Tag
Highlights OpenAI researcher Noam Brown's argument: the true ceiling of LLM capabilities is far higher than current benchmarks show, due to insufficient test-time compute, and stronger models benefit more from additional computation. This poses a serious challenge for AI safety evaluation, as many dangerous capabilities may only emerge under long time and high compute budgets.
Proposes UniScale, an online framework that unifies model routing and test-time scaling via contextual bandit optimization for better quality-cost trade-offs in LLM inference.
This paper introduces Reflection-Augmented Scaling (RAS), a method that uses execution feedback from failed Cypher queries to iteratively refine query generation via in-context learning, reducing execution error rates by 41-50% across multiple datasets and models.
Introduces Ethical Immanence, a new AI alignment paradigm that embeds ethical behavior into model architecture via loss function regularization and metacognitive detection, promising lower costs and inherent stability for open-source LLMs.
This paper investigates where and why output diversity collapses during post-training of language models, analyzing three OLMo 3 lineages (Think, Instruct, RL-Zero) across multiple tasks and metrics. The authors find that diversity collapse is primarily determined by training data composition and embedded in model weights during training, not addressable at inference time alone.
A post reflecting on the DSPy framework's architecture built around signatures, modules, and optimizers, and noting its continued growth since 2022.
This paper investigates how 1D coarse-to-fine token structures in autoregressive models improve test-time search efficiency compared to classical 2D grid tokenization. The authors show that such ordered tokens enable better test-time scaling and even training-free text-to-image generation when guided by image-text verifiers.