Tag
Introduces IOP, a framework that internalizes outcome supervision into process supervision for reasoning reinforcement learning, enabling fine-grained credit assignment without external annotations.
ATTNPO introduces an attention-guided process supervision framework that reduces overthinking in large reasoning models by leveraging intrinsic attention signals for step-level credit assignment, achieving improved performance with shorter reasoning lengths across 9 benchmarks.
OpenAI demonstrates that process supervision—rewarding intermediate reasoning steps rather than just final answers—improves mathematical reasoning while reducing alignment costs. This approach produces more interpretable, human-aligned reasoning without sacrificing model performance.