Tag
This paper introduces Self-Evaluation Elicitation (SEE), which uses calibration-coupled reinforcement learning and masked distillation to elicit latent judge calibration in base LLMs with minimal data, improving calibration across benchmarks while preserving answer quality.
The paper proposes a method using mismatched wrong drafts from a weaker model to elicit superior reasoning in a stronger learner via GRPO, achieving state-of-the-art results on Mathstral-7B for MATH-500 and AIME benchmarks.