Improving mathematical reasoning with process supervision

OpenAI Blog Papers

Summary

OpenAI demonstrates that process supervision—rewarding intermediate reasoning steps rather than just final answers—improves mathematical reasoning while reducing alignment costs. This approach produces more interpretable, human-aligned reasoning without sacrificing model performance.

We've trained a model to achieve a new state-of-the-art in mathematical problem solving by rewarding each correct step of reasoning (“process supervision”) instead of simply rewarding the correct final answer (“outcome supervision”). In addition to boosting performance relative to outcome supervision, process supervision also has an important alignment benefit: it directly trains the model to produce a chain-of-thought that is endorsed by humans.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 02:44 PM

# Improving mathematical reasoning with process supervision Source: [https://openai.com/index/improving-mathematical-reasoning-with-process-supervision/](https://openai.com/index/improving-mathematical-reasoning-with-process-supervision/) Process supervision has several alignment advantages over outcome supervision\. It directly rewards the model for following an aligned chain\-of\-thought, since each step in the process receives precise supervision\. Process supervision is also more likely to produce interpretable reasoning, since it encourages the model to follow a human\-approved process\. In contrast, outcome supervision may reward an unaligned process, and it is generally harder to scrutinize\. In some cases, safer methods for AI systems can lead to reduced performance[3](https://openai.com/index/improving-mathematical-reasoning-with-process-supervision/#citation-bottom-3), a cost which is known as an*alignment tax*\. In general, any alignment tax may hinder the adoption of alignment methods, due to pressure to deploy the most capable model\. Our results below show that process supervision in fact incurs a negative alignment tax, at least in the math domain\. This could increase the adoption of process supervision, which we believe would have positive alignment side\-effects\.

Similar Articles

ATTNPO: Attention-Guided Process Supervision for Efficient Reasoning

arXiv cs.CL

ATTNPO introduces an attention-guided process supervision framework that reduces overthinking in large reasoning models by leveraging intrinsic attention signals for step-level credit assignment, achieving improved performance with shorter reasoning lengths across 9 benchmarks.

Evaluating chain-of-thought monitorability

OpenAI Blog

OpenAI researchers introduce a framework and suite of 13 evaluations to systematically measure chain-of-thought monitorability in large language models, finding that monitoring reasoning processes is substantially more effective than monitoring outputs alone, with important implications for AI safety and supervision at scale.

Reasoning models struggle to control their chains of thought, and that’s good

OpenAI Blog

OpenAI researchers study whether reasoning models can deliberately obscure their chain-of-thought to evade monitoring, finding that current models struggle to control their reasoning even when aware of monitoring. They introduce CoT-Control, an open-source evaluation suite with over 13,000 tasks to measure chain-of-thought controllability in reasoning models.

Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information

arXiv cs.CL

This paper proposes a novel Chain-of-Thought distillation framework that transfers teacher models' stepwise attention on key information to student models through a Mixture-of-Layers module for dynamic layer alignment. The method achieves consistent performance improvements on mathematical and commonsense reasoning benchmarks by explicitly guiding student models to progressively focus on critical information during reasoning.