Tag
This technical report introduces VibeThinker-3B, a 3B parameter dense model that achieves frontier-level reasoning performance on benchmarks like AIME26 and LiveCodeBench, matching or exceeding much larger models such as DeepSeek V3.2 and GLM-5 through a combination of curriculum-based SFT, multi-domain RL, and offline self-distillation.
The Agent Reinforcement Trainer (ART) is an open-source framework that plugs GRPO-based RL into any Python app, enabling agents to learn from environment interaction via trajectory scoring and LoRA updates, with claims of outperforming OpenAI's o3 on email retrieval using a Qwen 2.5 14B model.
This paper introduces Reward as an Agent and DynDiff-GRPO to address reward hacking and limited exploration in reinforcement learning for embodied world models, achieving significant accuracy gains.
MetaResearcher proposes a framework for training deep research agents using self-reflective reinforcement learning in adversarial virtual environments, addressing limitations of static environments and fact-retrieval-only tasks.
Continuous batching has been added to TRL for GRPO, improving speed and VRAM usage without needing vLLM. The tweet explains how it works and when to use it.
Karpathy's critique of reward functions in RL is addressed by OpenPipe's ART framework using RULER, which allows natural language reward definitions evaluated by an LLM, replacing manual reward engineering.
DataClaw0 proposes an agentic data tailoring paradigm that uses learnable data processing to structure high-entropy multimodal streams, achieving robust alignment via SFT and GRPO on a novel benchmark.
This paper shows that a carefully crafted data recipe for long-context reinforcement learning, using minimal outcome-based GRPO, significantly improves reasoning across multiple models and benchmarks, and transfers to agentic tasks like GAIA and BrowseComp.
This paper demonstrates that selecting the SFT checkpoint with the highest pass@1 for GRPO can fail because SFT overtraining compresses output diversity, leading to entropy collapse and rank inversion in reinforcement learning. Experiments on Qwen2.5-Coder-3B and DeepSeek-Coder-6.7B show that pre-RL entropy is positively associated with GRPO outcome, and a two-stage diagnostic can detect high-risk checkpoints.
This paper introduces Dynamic Rollout Editing (DRE), a training-time intervention to reduce overthinking in GRPO-style reinforcement learning for reasoning models. DRE edits successful trajectories by preserving the solution-reachable prefix and preferring verified shorter edits, weakening the preference for unnecessary thinking.
STARE addresses policy entropy collapse in GRPO-based reinforcement learning for large language models by introducing surprisal-guided token-level advantage reweighting and target-entropy regulation, achieving 4%-8% accuracy gains on AIME benchmarks.
This paper introduces CoRA, a GRPO-based reinforcement learning framework that aligns LLM confidence with generated rationales to improve the reliability of chain-of-thought reasoning, achieving up to 26.51% reduction in misalignment error across multiple benchmarks.
Neural_avb releases a lightweight Answer-eq Reward Model for RL training on QA tasks, claiming 80% agreement with external judge LM and faster than F1/ROUGE/BertScore.
ProcessThinker introduces a practical post-training pipeline that provides step-level process rewards without training an explicit process reward model. It uses rollout-based rewards to give dense credit assignment for multi-step reasoning in multimodal LLMs, consistently improving performance on video benchmarks.
SWITCH is a switchable latent reasoning framework that uses explicit boundary tokens to enable trainable and interpretable recurrent hidden-state reasoning via on-policy reinforcement learning, outperforming prior approaches.
InterleaveThinker introduces a multi-agent pipeline with planner and critic agents to enable interleaved text-image generation for existing image generators, achieving performance comparable to state-of-the-art models and improving reasoning benchmarks.
This article explains how to use GRPO to fine-tune an LLM (Qwen3-8B) for reliable JSON structured output, improving schema accuracy from 62% to 82%, surpassing GPT-4.1's 58%.
This paper introduces DiRL, a direction-aware reinforcement learning framework that distinguishes reasoning-driven diversity from memorization-driven diversity in LLM exploration. It extracts an internal reasoning-memorization direction from model representations and shapes rewards to prioritize reasoning-aligned exploration, showing improvements on math and general reasoning benchmarks.
This paper proposes LEAF, a retrospective tree-based reinforcement learning method for speech-aware large language model post-training that improves credit assignment without online branching. LEAF outperforms GRPO on speech question answering and speech translation benchmarks.
N-GRPO introduces semantic neighbor mixing in the GRPO framework to enhance mathematical reasoning diversity while preserving semantic consistency, achieving improvements on math benchmarks and out-of-distribution tasks.