CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

Hugging Face Daily Papers Papers

Summary

Contrastive Reflection (CORE) is a non-parametric algorithm that generates concise, interpretable insights from comparing successful and unsuccessful reasoning traces, enabling faster and more efficient self-improvement for language models with fewer samples and rollouts than existing methods.

Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both parametric (e.g. RLVR) and non-parametric (e.g. prompt optimization) approaches to doing so typically require hundreds of training samples and thousands of model rollouts, making them expensive in the best case and intractable in the worst. To address this challenge, we introduce Contrastive Reflection (CORE), a non-parametric learning algorithm that compares past reasoning traces to generate insights: short natural-language descriptions of reasoning strategies and constraints that capture differences between successful and unsuccessful problem attempts. Across four reasoning tasks, we demonstrate that CORE enables more rapid improvement than both parametric (GRPO) and non-parametric (GEPA, episodic RAG, and MemRL) methods, while using fewer rollouts. Under fixed rollout budgets with as few as five training samples, we then show that CORE also achieves comparable or greater performance gains than each baseline. Finally, we highlight how CORE is also substantially more context-efficient than non-parametric baselines, requiring fewer prompt tokens while storing learned knowledge as compact, interpretable natural-language insights. Our results therefore suggest that distilling contrasts between successful and unsuccessful reasoning traces into abstract and useful insights can provide a more efficient and interpretable route to model self-improvement than weight updates, prompt optimization, or direct reuse of stored reasoning traces.
Original Article
View Cached Full Text

Cached at: 06/08/26, 07:17 PM

Paper page - CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

Source: https://huggingface.co/papers/2605.28742

Abstract

Contrastive Reflection (CORE) improves language model reasoning by analyzing differences between successful and unsuccessful attempts to generate concise, interpretable insights that enable faster and more efficient self-improvement compared to traditional parametric and non-parametric approaches.

Language models can useverifiable rewardsto improve at a wide variety ofreasoning tasks. However, both parametric (e.g. RLVR) and non-parametric (e.g. prompt optimization) approaches to doing so typically require hundreds oftraining samplesand thousands ofmodel rollouts, making them expensive in the best case and intractable in the worst. To address this challenge, we introduceContrastive Reflection(CORE), a non-parametric learning algorithm that compares pastreasoning tracesto generate insights: shortnatural-language descriptionsofreasoning strategiesandconstraintsthat capture differences between successful and unsuccessful problem attempts. Across fourreasoning tasks, we demonstrate that CORE enables more rapid improvement than both parametric (GRPO) and non-parametric (GEPA,episodic RAG, andMemRL) methods, while using fewer rollouts. Under fixedrollout budgetswith as few as fivetraining samples, we then show that CORE also achieves comparable or greater performance gains than each baseline. Finally, we highlight how CORE is also substantially more context-efficient than non-parametric baselines, requiring fewerprompt tokenswhile storing learned knowledge as compact, interpretable natural-language insights. Our results therefore suggest that distilling contrasts between successful and unsuccessfulreasoning tracesinto abstract and useful insights can provide a more efficient and interpretable route to modelself-improvementthan weight updates, prompt optimization, or direct reuse of storedreasoning traces.

View arXiv pageView PDFProject pageGitHub2Add to collection

Get this paper in your agent:

hf papers read 2605\.28742

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.28742 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.28742 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.28742 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

arXiv cs.CL

Proposes ProxyCoT, a training framework that improves long-context reasoning in large language models by first obtaining chain-of-thought reasoning traces on short proxy contexts (via reinforcement learning or distillation) and then grounding them in full long contexts through supervised fine-tuning. Experiments show consistent improvements over baselines with reduced computational cost.