I created an LLM post-training method called RPS. Preliminary results show that it improved Qwen3-8b's program synthesis reliability. [R]

Reddit r/MachineLearning 05/21/26, 04:19 PM Tools

post-training curriculum-learning learning-rate-decay program-synthesis qwen llm-training

Summary

RPS is a two-stage LLM post-training method inspired by neuroscience, combining curriculum learning with learning rate decay. Preliminary results show improved program synthesis reliability on Qwen3-8b compared to equal learning rate training.

RPS is inspired by neuroscience. As humans, we learn basic skills as kids with high neuro-plasticity. We then learn advanced skills as teens and adults with low neuro-plasticity. RPS trains a model in 2 stages. In stage 1, the model is trained on easy data with high learning rate. In stage 2, the model is trained on hard data with 10% the learning rate of stage 1. RPS is basically a combination of existing ideas: curriculum learning + learning rate decay. ARC-AGI 1 public eval scores: base model: Qwen3-8b RPS: 4% EPS (equal learning rate in both stages): 2.4% Program Synthesis Stats: Program executions without error: RPS: 1145/1200 EPS: 870/1200 [https://iamjasonfeng.blogspot.com/2026/05/regressive-plasticity-schedule.html](https://iamjasonfeng.blogspot.com/2026/05/regressive-plasticity-schedule.html) [https://github.com/iamjasonfeng/RPS](https://github.com/iamjasonfeng/RPS)

Original Article

Similar Articles

@rasbt: Crazy model! It actually uses the old Qwen2.5-Coder-3B stack and got really great performance with their post-training …

X AI KOLs Following

A 3B parameter model using the Qwen2.5-Coder-3B stack achieves coding benchmark scores comparable to Claude Opus 4.5, with detailed post-training techniques including synthetic data, filtering, two-stage SFT, and a novel RL method (MGPO).

RL Excursions during Pre-Training: Re-examining Policy Optimization for LLM training

arXiv cs.LG

Harvard researchers challenge the standard LLM training pipeline by showing RL can be effectively applied during pre-training rather than only after SFT, finding that data composition matters more than model scale, and proposing parallel averaging of RL and SFT objectives that outperforms sequential approaches while preserving general capabilities.

Accelerating Long-Tail Generation in Synchronous RLHF Training via Adaptive Tensor Parallelism

arXiv cs.AI

This paper proposes PAT, an adaptive tensor parallelism method that dynamically reconfigures TP during the generation stage of synchronous RLHF training to mitigate long-tail generation bottlenecks. Evaluations on LLaMA3.1-8B and Qwen3-14B show reductions in generation latency by up to 34.6% and end-to-end iteration latency by up to 27.2%.

ExpRL: Exploratory RL for LLM Mid-Training

Hugging Face Daily Papers

ExpRL is a new RL-based mid-training method that uses human-written reference solutions as dense reward scaffolds (never shown to the policy) to improve LLM reasoning, achieving significant gains on hard math benchmarks like AIME-2026.

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

arXiv cs.CL

This paper proposes the LLM-as-Environment-Engineer framework, where a policy model analyzes failures to automatically redesign the training environment for reinforcement learning, and introduces MAPF-FrozenLake as a controllable testbed. The framework, using Qwen3-4B, outperforms larger models like GPT and Gemini, showing that policy learning improves the model's ability to diagnose weaknesses.