grpo

#grpo

VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO

Hacker News Top ↗ · yesterday Cached

This technical report introduces VibeThinker-3B, a 3B parameter dense model that achieves frontier-level reasoning performance on benchmarks like AIME26 and LiveCodeBench, matching or exceeding much larger models such as DeepSeek V3.2 and GLM-5 through a combination of curriculum-based SFT, multi-domain RL, and offline self-distillation.

0 favorites 0 likes

#grpo

@TheTuringPost: An open-source Agent Reinforcement Trainer (ART) – plugs GRPO into any Python app → Your app defines the task and rewar…

X AI KOLs Timeline ↗ · 4d ago Cached

The Agent Reinforcement Trainer (ART) is an open-source framework that plugs GRPO-based RL into any Python app, enabling agents to learn from environment interaction via trajectory scoring and LoRA updates, with claims of outperforming OpenAI's o3 on email retrieval using a Qwen 2.5 14B model.

0 favorites 0 likes

#grpo

Reward as An Agent for Embodied World Models

arXiv cs.AI ↗ · 4d ago Cached

This paper introduces Reward as an Agent and DynDiff-GRPO to address reward hacking and limited exploration in reinforcement learning for embodied world models, achieving significant accuracy gains.

0 favorites 0 likes

#grpo

MetaResearcher: Scaling Deep Research via Self-Reflective Reinforcement Learning in Adversarial Virtual Environments

arXiv cs.AI ↗ · 4d ago Cached

MetaResearcher proposes a framework for training deep research agents using self-reflective reinforcement learning in adversarial virtual environments, addressing limitations of static environments and fact-retrieval-only tasks.

0 favorites 0 likes

#grpo

@SergioPaniego: continuous batching just landed in TRL for GRPO at 64 generations it runs faster and uses less VRAM than plain generate…

X AI KOLs Following ↗ · 4d ago Cached

Continuous batching has been added to TRL for GRPO, improving speed and VRAM usage without needing vLLM. The tweet explains how it works and when to use it.

0 favorites 0 likes

#grpo

@akshay_pachaar: Karpathy's prediction about RL is coming true now! He called reward functions unreliable and argued that a single rewar…

X AI KOLs Following ↗ · 4d ago Cached

Karpathy's critique of reward functions in RL is addressed by OpenPipe's ART framework using RULER, which allows natural language reward definitions evaluated by an LLM, replacing manual reward engineering.

0 favorites 0 likes

#grpo

DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams

Hugging Face Daily Papers ↗ · 5d ago Cached

DataClaw0 proposes an agentic data tailoring paradigm that uses learnable data processing to structure high-entropy multimodal streams, achieving robust alignment via SFT and GRPO on a novel benchmark.

0 favorites 0 likes

#grpo

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

arXiv cs.CL ↗ · 6d ago Cached

This paper shows that a carefully crafted data recipe for long-context reinforcement learning, using minimal outcome-based GRPO, significantly improves reasoning across multiple models and benchmarks, and transfers to agentic tasks like GAIA and BrowseComp.

0 favorites 0 likes

#grpo

SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

arXiv cs.LG ↗ · 6d ago Cached

This paper demonstrates that selecting the SFT checkpoint with the highest pass@1 for GRPO can fail because SFT overtraining compresses output diversity, leading to entropy collapse and rank inversion in reinforcement learning. Experiments on Qwen2.5-Coder-3B and DeepSeek-Coder-6.7B show that pre-RL entropy is positively associated with GRPO outcome, and a two-stage diagnostic can detect high-risk checkpoints.

0 favorites 0 likes

#grpo

Dynamic Rollout Editing for Reducing Overthinking in RL-Trained Reasoning Models

arXiv cs.CL ↗ · 2026-06-17 Cached

This paper introduces Dynamic Rollout Editing (DRE), a training-time intervention to reduce overthinking in GRPO-style reinforcement learning for reasoning models. DRE edits successful trajectories by preserving the solution-reachable prefix and preferring verified shorter edits, weakening the preference for unnecessary thinking.

0 favorites 0 likes

#grpo

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

Hugging Face Daily Papers ↗ · 2026-06-17 Cached

STARE addresses policy entropy collapse in GRPO-based reinforcement learning for large language models by introducing surprisal-guided token-level advantage reweighting and target-entropy regulation, achieving 4%-8% accuracy gains on AIME benchmarks.

0 favorites 0 likes

#grpo

CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning

arXiv cs.CL ↗ · 2026-06-16 Cached

This paper introduces CoRA, a GRPO-based reinforcement learning framework that aligns LLM confidence with generated rationales to improve the reliability of chain-of-thought reasoning, achieving up to 26.51% reduction in misalignment error across multiple benchmarks.

0 favorites 0 likes

#grpo

@neural_avb: Locally generating GRPO-like rollouts with my SLM, and using this tiny RM as the rubric. Next I'll be RL training on fr…

X AI KOLs Timeline ↗ · 2026-06-11 Cached

Neural_avb releases a lightweight Answer-eq Reward Model for RL training on QA tasks, claiming 80% agreement with external judge LM and faster than F1/ROUGE/BertScore.

0 favorites 0 likes

#grpo

ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward

arXiv cs.CL ↗ · 2026-06-11 Cached

ProcessThinker introduces a practical post-training pipeline that provides step-level process rewards without training an explicit process reward model. It uses rollout-based rewards to give dense credit assignment for multi-step reasoning in multimodal LLMs, consistently improving performance on video benchmarks.

0 favorites 0 likes

#grpo

Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

Hugging Face Daily Papers ↗ · 2026-06-11 Cached

SWITCH is a switchable latent reasoning framework that uses explicit boundary tokens to enable trainable and interpretable recurrent hidden-state reasoning via on-policy reinforcement learning, outperforming prior approaches.

0 favorites 0 likes

#grpo

InterleaveThinker: Reinforcing Agentic Interleaved Generation

Hugging Face Daily Papers ↗ · 2026-06-11 Cached

InterleaveThinker introduces a multi-agent pipeline with planner and critic agents to enable interleaved text-image generation for existing image generators, achieving performance comparable to state-of-the-art models and improving reasoning benchmarks.

0 favorites 0 likes

#grpo

@akshay_pachaar: https://x.com/akshay_pachaar/status/2064700531600458093

X AI KOLs Following ↗ · 2026-06-10 Cached

This article explains how to use GRPO to fine-tune an LLM (Qwen3-8B) for reliable JSON structured output, improving schema accuracy from 62% to 82%, surpassing GPT-4.1's 58%.

0 favorites 0 likes

#grpo

Reasoning or Memorization? Direction-Aware Diversity Exploration in LLM Reinforcement Learning

arXiv cs.AI ↗ · 2026-06-10 Cached

This paper introduces DiRL, a direction-aware reinforcement learning framework that distinguishes reasoning-driven diversity from memorization-driven diversity in LLM exploration. It extracts an internal reasoning-memorization direction from model representations and shapes rewards to prioritize reasoning-aligned exploration, showing improvements on math and general reasoning benchmarks.

0 favorites 0 likes

#grpo

LEAF: Growing Trees Without Branching for Speech-Aware Large Language Model Post-Training

arXiv cs.LG ↗ · 2026-06-09 Cached

This paper proposes LEAF, a retrospective tree-based reinforcement learning method for speech-aware large language model post-training that improves credit assignment without online branching. LEAF outperforms GRPO on speech question answering and speech translation benchmarks.

0 favorites 0 likes

#grpo

N-GRPO: Embedding-Level Neighbor Mixing for Enhanced Policy Optimization

Hugging Face Daily Papers ↗ · 2026-06-09 Cached

N-GRPO introduces semantic neighbor mixing in the GRPO framework to enhance mathematical reasoning diversity while preserving semantic consistency, achieving improvements on math benchmarks and out-of-distribution tasks.

0 favorites 0 likes

grpo

Submit Feedback