reinforcement-learning

#reinforcement-learning

Tmax-27b - a Qwen3.6-27b terminal agent for small GPUs trained with DPPO (RL)

Reddit r/LocalLLaMA ↗ · 3h ago

Ai2 released Tmax-27B, a terminal-agent LLM trained with DPPO (RL) on Qwen3.6-27B, and the author provides importance-matrix-calibrated GGUF quantizations that achieve competitive performance on agentic benchmarks even at very low bit-widths, with a grafted MTP draft head for speculative decoding.

0 favorites 0 likes

#reinforcement-learning

@samsja19: prime-rl can now train 1T parameters MoE blazingly fast, under 5 minutes per step, or 1k steps in ~3 days To achieve th…

X AI KOLs Following ↗ · 19h ago Cached

Prime Intellect released prime-rl v0.6.0, enabling reinforcement learning at trillion-parameter MoE scale with sub-5-minute step times and optimized inference, training, and rollout.

0 favorites 0 likes

#reinforcement-learning

@eliebakouch: every infra piece you need to know to do RL on GLM-5 https://primeintellect.ai/blog/rl-at-1t-scale…

X AI KOLs Timeline ↗ · 19h ago Cached

Prime Intellect releases prime-rl v0.6.0, enabling efficient reinforcement learning at trillion-parameter scale on large Mixture-of-Experts models, with sub-5-minute step times and optimizations for asynchronous RL.

0 favorites 0 likes

#reinforcement-learning

VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO

Hacker News Top ↗ · 20h ago Cached

This technical report introduces VibeThinker-3B, a 3B parameter dense model that achieves frontier-level reasoning performance on benchmarks like AIME26 and LiveCodeBench, matching or exceeding much larger models such as DeepSeek V3.2 and GLM-5 through a combination of curriculum-based SFT, multi-domain RL, and offline self-distillation.

0 favorites 0 likes

#reinforcement-learning

@Ankur_Samanta_: New work on credit assignment in multi-step reasoning RL post-training Introducing Self-Reset Policy Optimization (SRPO…

X AI KOLs Timeline ↗ · yesterday Cached

Self-Reset Policy Optimization (SRPO) addresses credit assignment in multi-step reasoning RL post-training by localizing the first wrong reasoning step and learning from counterfactual continuations without external supervision.

0 favorites 0 likes

#reinforcement-learning

@kazukifujii: This vLLM blog post explains weight updates in RL + KV cache recompute in a very clear and illustrated way, and it also…

X AI KOLs Timeline ↗ · yesterday Cached

This article explains vLLM's weight syncing API for reinforcement learning, covering how it facilitates weight updates and KV cache recompute in RL training, with a focus on reducing complexity for training frameworks.

0 favorites 0 likes

#reinforcement-learning

@cwolferesearch: I just published a blog on agentic RL that covers 10+ recent frameworks in the space. Here are the key takeaways… Link …

X AI KOLs Timeline ↗ · yesterday Cached

A blog post summarizing ten recent agentic RL frameworks and best practices, covering modular interfaces, trajectory structure, action masks, process rewards, advantage normalization, scalable rollouts, stability/exploration, and task curriculum.

0 favorites 0 likes

#reinforcement-learning

@VukRosic99: Test Time Reinforcement Learning 1. Take an unlabeled question 2. Sample many answers from the LLM 3. Majority vote → t…

X AI KOLs Timeline ↗ · yesterday Cached

Introduces Test-Time Reinforcement Learning (TTRL), a method that uses majority voting on unlabeled data to create pseudo-labels for RL training, enabling self-improvement of LLMs without ground-truth answers. Achieves significant gains (e.g., +159-211% on AIME 2024 for Qwen-2.5-Math-7B).

1 favorites 1 likes

#reinforcement-learning

Training Open Models for Agentic Phone Use

Hugging Face Daily Papers ↗ · yesterday Cached

PhoneBuddy combines real and mock app environments to train open models for agentic phone use, achieving 45.33% task success rate on real phones through mixed reinforcement learning, showing that mock-app training complements real-app training.

0 favorites 0 likes

#reinforcement-learning

Tmax: A simple recipe for terminal agents

Hugging Face Daily Papers ↗ · yesterday Cached

Tmax introduces a simplified RL training recipe for terminal agents, achieving state-of-the-art performance with a 9B parameter model using a novel data generation taxonomy and an expanded open-source dataset.

0 favorites 0 likes

#reinforcement-learning

Nvidia's Autonomous Robotics Research (6 minute read)

TLDR AI ↗ · yesterday Cached

ENPIRE is a framework that enables coding agents to autonomously improve robot manipulation policies through a real-world feedback loop, achieving 99% success on dexterous tasks like pin insertion and zip tie cutting.

0 favorites 0 likes

#reinforcement-learning

@TheTuringPost: 10 open-source tools for the Agent RL stack ↓ OpenPipe ART verl-agent Agent Lightning Unsloth OpenRLHF SkyRL NVIDIA’s P…

X AI KOLs Timeline ↗ · 2d ago Cached

A curated roundup of 10 open-source tools for training AI agents using reinforcement learning, covering frameworks like OpenPipe ART, verl-agent, Agent Lightning, and Unsloth, with details on their use cases and strengths.

1 favorites 1 likes

#reinforcement-learning

The data black hole at the center of AI

Reddit r/artificial ↗ · 2d ago Cached

This article deeply analyzes the problem that AI's sample efficiency is far lower than that of humans, pointing out that frontier models require massive amounts of domain-specific data, while humans can learn from just a few examples. This data black hole is a core bottleneck in current AI development. Through multiple comparisons (annotation volume, robot manipulation, driving) and refuting common objections, the article demonstrates the severity of this gap and explores its impact on the goals of AI automation.

0 favorites 0 likes

#reinforcement-learning

PolicyTrim: Boosting Intrinsic Policy Efficiency of Vision-Language-Action Models

Hugging Face Daily Papers ↗ · 2d ago Cached

PolicyTrim is a reinforcement learning-based post-training framework that improves action chunk utilization by 3× and reduces physical execution steps by 51.4% in Vision-Language-Action models, delivering up to 5.83× deployment speedup.

0 favorites 0 likes

#reinforcement-learning

@dair_ai: // Self-play with a pinch of human data // Really cool paper combining human demonstrations and self-play RL. 30 minute…

X AI KOLs Following ↗ · 3d ago Cached

A research paper that combines a small amount of human demonstrations as a regularization objective with self-play reinforcement learning, enabling human-compatible driving policies using far less human data (30 minutes vs thousands of hours) and training in 15 hours on a single consumer GPU.

0 favorites 0 likes

#reinforcement-learning

@TheTuringPost: An open-source Agent Reinforcement Trainer (ART) – plugs GRPO into any Python app → Your app defines the task and rewar…

X AI KOLs Timeline ↗ · 3d ago Cached

The Agent Reinforcement Trainer (ART) is an open-source framework that plugs GRPO-based RL into any Python app, enabling agents to learn from environment interaction via trajectory scoring and LoRA updates, with claims of outperforming OpenAI's o3 on email retrieval using a Qwen 2.5 14B model.

0 favorites 0 likes

#reinforcement-learning

@robertnishihara: A great example of the importance of disaggregation in RL. From the paper LLM generation alternates between prefill and…

X AI KOLs Following ↗ · 3d ago Cached

Robert Nishihara highlights a paper on disaggregating RL workloads, showing that using compute-optimized H800s for prefill and bandwidth-optimized H20s for decode can cut rollout times by 21-51% and 47% respectively, emphasizing that no single hardware type fits all stages.

0 favorites 0 likes

#reinforcement-learning

@raydistributed: RollArt is an impressive example of disaggregation in large-scale RL. https://cse.ust.hk/~weiwa/papers/rollart-osdi26.p…

X AI KOLs Following ↗ · 3d ago Cached

RollArt presents a disaggregated architecture for large-scale reinforcement learning, demonstrating significant improvements in efficiency and scalability.

0 favorites 0 likes

#reinforcement-learning

Multi-Head Attention-Based Feature Extractor Integration with Soft Actor-Critic for Porosity Prediction and Process Parameter Optimization in Additive Manufacturing

arXiv cs.AI ↗ · 3d ago Cached

This paper proposes a novel architecture integrating multi-head attention with the Soft Actor-Critic algorithm for porosity prediction and process parameter optimization in additive manufacturing, achieving faster convergence and higher rewards than standard RL methods.

0 favorites 0 likes

#reinforcement-learning

Process-Verified Reinforcement Learning for Theorem Proving via Lean

arXiv cs.AI ↗ · 3d ago Cached

This paper presents Process-Verified Reinforcement Learning, using the Lean proof assistant as a process oracle to provide fine-grained tactic-level feedback during training, improving theorem proving performance.

0 favorites 0 likes

reinforcement-learning

Submit Feedback