CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning
Summary
Contrastive Reflection (CORE) is a non-parametric algorithm that generates concise, interpretable insights from comparing successful and unsuccessful reasoning traces, enabling faster and more efficient self-improvement for language models with fewer samples and rollouts than existing methods.
View Cached Full Text
Cached at: 06/08/26, 07:17 PM
Paper page - CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning
Source: https://huggingface.co/papers/2605.28742
Abstract
Contrastive Reflection (CORE) improves language model reasoning by analyzing differences between successful and unsuccessful attempts to generate concise, interpretable insights that enable faster and more efficient self-improvement compared to traditional parametric and non-parametric approaches.
Language models can useverifiable rewardsto improve at a wide variety ofreasoning tasks. However, both parametric (e.g. RLVR) and non-parametric (e.g. prompt optimization) approaches to doing so typically require hundreds oftraining samplesand thousands ofmodel rollouts, making them expensive in the best case and intractable in the worst. To address this challenge, we introduceContrastive Reflection(CORE), a non-parametric learning algorithm that compares pastreasoning tracesto generate insights: shortnatural-language descriptionsofreasoning strategiesandconstraintsthat capture differences between successful and unsuccessful problem attempts. Across fourreasoning tasks, we demonstrate that CORE enables more rapid improvement than both parametric (GRPO) and non-parametric (GEPA,episodic RAG, andMemRL) methods, while using fewer rollouts. Under fixedrollout budgetswith as few as fivetraining samples, we then show that CORE also achieves comparable or greater performance gains than each baseline. Finally, we highlight how CORE is also substantially more context-efficient than non-parametric baselines, requiring fewerprompt tokenswhile storing learned knowledge as compact, interpretable natural-language insights. Our results therefore suggest that distilling contrasts between successful and unsuccessfulreasoning tracesinto abstract and useful insights can provide a more efficient and interpretable route to modelself-improvementthan weight updates, prompt optimization, or direct reuse of storedreasoning traces.
View arXiv pageView PDFProject pageGitHub2Add to collection
Get this paper in your agent:
hf papers read 2605\.28742
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.28742 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.28742 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.28742 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding
CoRD is a collaborative multi-teacher decoding framework that synthesizes reasoning trajectories through predictive perplexity scoring and beam search, enabling efficient distillation of large reasoning models with high-quality outputs and generalized performance.
CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization
CEPO improves reinforcement learning with verifiable rewards by using contrastive signals from rejected rollouts to distinguish decisive reasoning steps from filler tokens, achieving higher accuracy on multimodal math reasoning benchmarks compared to GRPO.
Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning
Proposes ProxyCoT, a training framework that improves long-context reasoning in large language models by first obtaining chain-of-thought reasoning traces on short proxy contexts (via reinforcement learning or distillation) and then grounding them in full long contexts through supervised fine-tuning. Experiments show consistent improvements over baselines with reduced computational cost.
ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation
ReflectMT introduces a two-stage RL method that trains LRMs to internalize reflection, enabling single-pass high-quality translation with 94% fewer tokens than multi-step reasoning models like DeepSeek-R1.
CORE: Conflict-Oriented Reasoning for General Multimodal Manipulation Detection
Proposes the CORE framework that endows multimodal large language models with explicit conflict-capturing capability for generalizable manipulation detection, adapting to unseen manipulation types with few or zero samples.