I created an LLM post-training method called RPS. Preliminary results show that it improved Qwen3-8b's program synthesis reliability. [R]
Summary
RPS is a two-stage LLM post-training method inspired by neuroscience, combining curriculum learning with learning rate decay. Preliminary results show improved program synthesis reliability on Qwen3-8b compared to equal learning rate training.
Similar Articles
@rasbt: Crazy model! It actually uses the old Qwen2.5-Coder-3B stack and got really great performance with their post-training …
A 3B parameter model using the Qwen2.5-Coder-3B stack achieves coding benchmark scores comparable to Claude Opus 4.5, with detailed post-training techniques including synthetic data, filtering, two-stage SFT, and a novel RL method (MGPO).
RL Excursions during Pre-Training: Re-examining Policy Optimization for LLM training
Harvard researchers challenge the standard LLM training pipeline by showing RL can be effectively applied during pre-training rather than only after SFT, finding that data composition matters more than model scale, and proposing parallel averaging of RL and SFT objectives that outperforms sequential approaches while preserving general capabilities.
Accelerating Long-Tail Generation in Synchronous RLHF Training via Adaptive Tensor Parallelism
This paper proposes PAT, an adaptive tensor parallelism method that dynamically reconfigures TP during the generation stage of synchronous RLHF training to mitigate long-tail generation bottlenecks. Evaluations on LLaMA3.1-8B and Qwen3-14B show reductions in generation latency by up to 34.6% and end-to-end iteration latency by up to 27.2%.
ExpRL: Exploratory RL for LLM Mid-Training
ExpRL is a new RL-based mid-training method that uses human-written reference solutions as dense reward scaffolds (never shown to the policy) to improve LLM reasoning, achieving significant gains on hard math benchmarks like AIME-2026.
From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning
This paper proposes the LLM-as-Environment-Engineer framework, where a policy model analyzes failures to automatically redesign the training environment for reinforcement learning, and introduces MAPF-FrozenLake as a controllable testbed. The framework, using Qwen3-4B, outperforms larger models like GPT and Gemini, showing that policy learning improves the model's ability to diagnose weaknesses.