Tag
Ai2 released Tmax-27B, a terminal-agent LLM trained with DPPO (RL) on Qwen3.6-27B, and the author provides importance-matrix-calibrated GGUF quantizations that achieve competitive performance on agentic benchmarks even at very low bit-widths, with a grafted MTP draft head for speculative decoding.
Prime Intellect released prime-rl v0.6.0, enabling reinforcement learning at trillion-parameter MoE scale with sub-5-minute step times and optimized inference, training, and rollout.
Prime Intellect releases prime-rl v0.6.0, enabling efficient reinforcement learning at trillion-parameter scale on large Mixture-of-Experts models, with sub-5-minute step times and optimizations for asynchronous RL.
This technical report introduces VibeThinker-3B, a 3B parameter dense model that achieves frontier-level reasoning performance on benchmarks like AIME26 and LiveCodeBench, matching or exceeding much larger models such as DeepSeek V3.2 and GLM-5 through a combination of curriculum-based SFT, multi-domain RL, and offline self-distillation.
Self-Reset Policy Optimization (SRPO) addresses credit assignment in multi-step reasoning RL post-training by localizing the first wrong reasoning step and learning from counterfactual continuations without external supervision.
This article explains vLLM's weight syncing API for reinforcement learning, covering how it facilitates weight updates and KV cache recompute in RL training, with a focus on reducing complexity for training frameworks.
A blog post summarizing ten recent agentic RL frameworks and best practices, covering modular interfaces, trajectory structure, action masks, process rewards, advantage normalization, scalable rollouts, stability/exploration, and task curriculum.
Introduces Test-Time Reinforcement Learning (TTRL), a method that uses majority voting on unlabeled data to create pseudo-labels for RL training, enabling self-improvement of LLMs without ground-truth answers. Achieves significant gains (e.g., +159-211% on AIME 2024 for Qwen-2.5-Math-7B).
PhoneBuddy combines real and mock app environments to train open models for agentic phone use, achieving 45.33% task success rate on real phones through mixed reinforcement learning, showing that mock-app training complements real-app training.
Tmax introduces a simplified RL training recipe for terminal agents, achieving state-of-the-art performance with a 9B parameter model using a novel data generation taxonomy and an expanded open-source dataset.
ENPIRE is a framework that enables coding agents to autonomously improve robot manipulation policies through a real-world feedback loop, achieving 99% success on dexterous tasks like pin insertion and zip tie cutting.
A curated roundup of 10 open-source tools for training AI agents using reinforcement learning, covering frameworks like OpenPipe ART, verl-agent, Agent Lightning, and Unsloth, with details on their use cases and strengths.
This article deeply analyzes the problem that AI's sample efficiency is far lower than that of humans, pointing out that frontier models require massive amounts of domain-specific data, while humans can learn from just a few examples. This data black hole is a core bottleneck in current AI development. Through multiple comparisons (annotation volume, robot manipulation, driving) and refuting common objections, the article demonstrates the severity of this gap and explores its impact on the goals of AI automation.
PolicyTrim is a reinforcement learning-based post-training framework that improves action chunk utilization by 3× and reduces physical execution steps by 51.4% in Vision-Language-Action models, delivering up to 5.83× deployment speedup.
A research paper that combines a small amount of human demonstrations as a regularization objective with self-play reinforcement learning, enabling human-compatible driving policies using far less human data (30 minutes vs thousands of hours) and training in 15 hours on a single consumer GPU.
The Agent Reinforcement Trainer (ART) is an open-source framework that plugs GRPO-based RL into any Python app, enabling agents to learn from environment interaction via trajectory scoring and LoRA updates, with claims of outperforming OpenAI's o3 on email retrieval using a Qwen 2.5 14B model.
Robert Nishihara highlights a paper on disaggregating RL workloads, showing that using compute-optimized H800s for prefill and bandwidth-optimized H20s for decode can cut rollout times by 21-51% and 47% respectively, emphasizing that no single hardware type fits all stages.
RollArt presents a disaggregated architecture for large-scale reinforcement learning, demonstrating significant improvements in efficiency and scalability.
This paper proposes a novel architecture integrating multi-head attention with the Soft Actor-Critic algorithm for porosity prediction and process parameter optimization in additive manufacturing, achieving faster convergence and higher rewards than standard RL methods.
This paper presents Process-Verified Reinforcement Learning, using the Lean proof assistant as a process oracle to provide fine-grained tactic-level feedback during training, improving theorem proving performance.