Tag
This paper introduces BitCal-TTS, a runtime controller that improves accuracy and reduces premature halting in quantized reasoning models by calibrating confidence signals during test-time scaling.
This paper investigates whether verbalized evaluation awareness (VEA) in large reasoning models causally affects their behavior on safety, alignment, moral reasoning, and political opinion benchmarks. The authors find that VEA has limited behavioral impact, with near-zero effects from injecting VEA and small shifts from removing it, suggesting that high VEA rates should not be taken as strong evidence of strategic behavior or alignment tampering.
A tweet highlights that while reasoning models excel at nuance and natural language understanding, this capability hasn't translated to retrieval systems, pointing to a key bottleneck in AI.
Researchers introduce HarmThoughts, a benchmark with 56,931 annotated sentences from 1,018 reasoning traces to evaluate harmful behavior emergence step-by-step, revealing that current detectors miss nuanced unsafe reasoning transitions.
TEMPO introduces a test-time training framework that alternates policy refinement with critic recalibration to prevent diversity collapse and sustain performance gains in large reasoning models, boosting AIME 2024 scores for Qwen3-14B from 42.3% to 65.8%.
This paper introduces Adaptive Tool Trust Calibration (ATTC), a framework that improves tool-integrated reasoning models by enabling them to adaptively decide when to trust or ignore tool results based on code confidence scores. The approach addresses the "Tool Ignored" problem where models incorrectly dismiss correct tool outputs, achieving 4.1-7.5% performance improvements across multiple models and datasets.
ATTNPO introduces an attention-guided process supervision framework that reduces overthinking in large reasoning models by leveraging intrinsic attention signals for step-level credit assignment, achieving improved performance with shorter reasoning lengths across 9 benchmarks.
This paper investigates multilingual latent reasoning in large reasoning models across 11 languages, revealing that while latent reasoning capabilities exist, they are unevenly distributed—stronger in resource-rich languages and weaker in low-resource ones. The study finds that despite surface-level differences, the internal reasoning mechanisms are largely aligned with an English-centered pathway.
This paper introduces a data-efficient fine-tuning framework for teaching reasoning models to code-switch (mix languages) effectively, demonstrating that strategic code-switching can improve reasoning capabilities for lower-resource languages. The work analyzes code-switching behaviors in large language models across diverse languages, tasks, and domains, then develops interventions to promote beneficial code-switching patterns.
This paper introduces TESSY, a teacher-student cooperative framework for fine-tuning reasoning models that generates on-policy SFT data by decoupling generation into capability tokens (from teacher) and style tokens (from student), addressing catastrophic forgetting issues when using off-policy teacher data.
OpenAI researchers study whether reasoning models can deliberately obscure their chain-of-thought to evade monitoring, finding that current models struggle to control their reasoning even when aware of monitoring. They introduce CoT-Control, an open-source evaluation suite with over 13,000 tasks to measure chain-of-thought controllability in reasoning models.
OpenAI researchers introduce a framework and suite of 13 evaluations to systematically measure chain-of-thought monitorability in large language models, finding that monitoring reasoning processes is substantially more effective than monitoring outputs alone, with important implications for AI safety and supervision at scale.
OpenAI releases gpt-oss-safeguard, open-weight reasoning models for safety classification tasks available in 120B and 20B sizes under Apache 2.0 license. The models use chain-of-thought reasoning to classify content according to developer-provided policies at inference time, enabling flexible and explainable content moderation.
OpenAI releases gpt-oss-safeguard-120b and gpt-oss-safeguard-20b, open-weight reasoning models designed for policy-based content classification with full chain-of-thought reasoning. The technical report provides baseline safety evaluations and demonstrates the models' capabilities for content labeling tasks under the Apache 2.0 license.
OpenAI released system cards for o3 and o4-mini models, which feature advanced reasoning capabilities combined with tool integration (web browsing, Python, image analysis, etc.) and are evaluated under OpenAI's Preparedness Framework v2 for safety in biological, cybersecurity, and AI self-improvement domains.
Factory launches a Command Center for software development leveraging OpenAI's o1, o3-mini, and GPT-4o reasoning models to accelerate engineering cycles by 20-400%, reduce context switching by 60%, and provide developers with 10+ additional hours per week through AI-powered code understanding and reasoning across the development lifecycle.
OpenAI releases the o3-mini System Card, documenting safety evaluations and risk assessments for their advanced reasoning model trained with reinforcement learning. The model achieves state-of-the-art safety performance on certain benchmarks and is classified as Medium risk overall under OpenAI's Preparedness Framework.
OpenAI presents evidence that reasoning models like o1 become more robust to adversarial attacks when given more inference-time compute to think longer. The research demonstrates that increased computation reduces attack success rates across multiple task types including mathematics, factuality, and adversarial images, though significant exceptions remain.