@0xLogicrw: Former OpenAI post-training core member Jiayi Weng proposed a new reinforcement learning paradigm called "Heuristic Learning" in his personal capacity and open-sourced all experimental code. He used Codex (GPT-5.4) to repeatedly play the Atari game Breakout, but GPT-5.4 was never retrained...

X AI KOLs Timeline 05/08/26, 10:46 AM Papers

reinforcement-learning heuristic-learning code-based-rl open-source llm-programming ai-research

Summary

Former OpenAI researcher Jiayi Weng proposed a new paradigm called "Heuristic Learning", which uses large language models to generate and iteratively modify Python code to solve reinforcement learning tasks. Knowledge is stored in interpretable code rather than neural network parameters, effectively avoiding catastrophic forgetting. It has achieved excellent results on Atari and MuJoCo benchmarks and the code has been open-sourced.

Former OpenAI post-training core member Jiayi Weng proposed a new reinforcement learning paradigm called "Heuristic Learning" in his personal capacity and open-sourced all experimental code. He used Codex (GPT-5.4) to repeatedly play the Atari game Breakout, but GPT-5.4 was never retrained. What actually improved was the game strategy code written by GPT-5.4. The process went like this: GPT-5.4 first wrote a Python strategy for Breakout, ran a round, watched the replay, identified where it missed the ball, and then modified the code to run again. After several iterations, the strategy score increased from 387 to a perfect 864. No neural network was trained throughout the entire process — it purely relied on AI repeatedly modifying if-else rules, adjusting landing predictions, and adding infinite loop detection. The final code included a ball trajectory predictor, stuck-ball detector, regression tests, and experiment logs — it grew into a complete software system. The core difference from traditional reinforcement learning lies in "where the learned knowledge is stored." Traditional methods compress knowledge into neural network parameters, making it unreadable for humans and prone to overwriting old knowledge when learning new tasks (i.e., catastrophic forgetting). Weng's approach reverses this: knowledge is code — readable, modifiable, and lockable with tests, so learning new things does not overwrite old skills. Besides achieving a perfect score in Breakout, he also achieved deep RL-level performance (over 6000 points) on MuJoCo Ant (simulated ant walking) and approached the PPO baseline on the full Atari57 suite of 57 games. However, Weng explicitly delineated the boundaries: pure code cannot handle complex perception tasks, such as using Python if-else to recognize images. His envisioned endgame is a hybrid architecture: at the bottom, lightweight neural networks handle vision and perception; in the middle, heuristic learning handles real-time logic and safety rules; at the top, large models review logs and modify code, periodically updating themselves with high-quality data accumulated from the lower layers. Handwritten rules were abandoned in the past not because they were useless, but because humans could not maintain them. Now that AI can write code quickly and well, this old approach has become viable again.

Original Article

Similar Articles

Building more with GPT-5.1-Codex-Max

@qloog: Stop calling AI a mere efficiency booster. This OpenAI-endorsed Codex tutorial lets one person do an entire team’s job—iOS app, code, investor deck—end-to-end. Two levers: custom skills (reusable know-how) + automation (exponential speed).

Introducing GPT-5.3-Codex

Coding and design with GPT-5

Addendum to GPT-5 system card: GPT-5-Codex

Submit Feedback

Similar Articles

Building more with GPT-5.1-Codex-Max

@qloog: Stop calling AI a mere efficiency booster. This OpenAI-endorsed Codex tutorial lets one person do an entire team’s job—iOS app, code, investor deck—end-to-end. Two levers: custom skills (reusable know-how) + automation (exponential speed).

Addendum to GPT-5 system card: GPT-5-Codex