@0xLogicrw: Former OpenAI post-training core member Jiayi Weng proposed a new reinforcement learning paradigm called "Heuristic Learning" in his personal capacity and open-sourced all experimental code. He used Codex (GPT-5.4) to repeatedly play the Atari game Breakout, but GPT-5.4 was never retrained...

X AI KOLs Timeline Papers

Summary

Former OpenAI researcher Jiayi Weng proposed a new paradigm called "Heuristic Learning", which uses large language models to generate and iteratively modify Python code to solve reinforcement learning tasks. Knowledge is stored in interpretable code rather than neural network parameters, effectively avoiding catastrophic forgetting. It has achieved excellent results on Atari and MuJoCo benchmarks and the code has been open-sourced.

Former OpenAI post-training core member Jiayi Weng proposed a new reinforcement learning paradigm called "Heuristic Learning" in his personal capacity and open-sourced all experimental code. He used Codex (GPT-5.4) to repeatedly play the Atari game Breakout, but GPT-5.4 was never retrained. What actually improved was the game strategy code written by GPT-5.4. The process went like this: GPT-5.4 first wrote a Python strategy for Breakout, ran a round, watched the replay, identified where it missed the ball, and then modified the code to run again. After several iterations, the strategy score increased from 387 to a perfect 864. No neural network was trained throughout the entire process — it purely relied on AI repeatedly modifying if-else rules, adjusting landing predictions, and adding infinite loop detection. The final code included a ball trajectory predictor, stuck-ball detector, regression tests, and experiment logs — it grew into a complete software system. The core difference from traditional reinforcement learning lies in "where the learned knowledge is stored." Traditional methods compress knowledge into neural network parameters, making it unreadable for humans and prone to overwriting old knowledge when learning new tasks (i.e., catastrophic forgetting). Weng's approach reverses this: knowledge is code — readable, modifiable, and lockable with tests, so learning new things does not overwrite old skills. Besides achieving a perfect score in Breakout, he also achieved deep RL-level performance (over 6000 points) on MuJoCo Ant (simulated ant walking) and approached the PPO baseline on the full Atari57 suite of 57 games. However, Weng explicitly delineated the boundaries: pure code cannot handle complex perception tasks, such as using Python if-else to recognize images. His envisioned endgame is a hybrid architecture: at the bottom, lightweight neural networks handle vision and perception; in the middle, heuristic learning handles real-time logic and safety rules; at the top, large models review logs and modify code, periodically updating themselves with high-quality data accumulated from the lower layers. Handwritten rules were abandoned in the past not because they were useless, but because humans could not maintain them. Now that AI can write code quickly and well, this old approach has become viable again.
Original Article

Similar Articles

Building more with GPT-5.1-Codex-Max

OpenAI Blog

OpenAI introduces GPT-5.1-Codex-Max, a new agentic coding model with improved reasoning, token efficiency, and the ability to maintain coherent work across millions of tokens through a 'compaction' mechanism. The model is faster, more intelligent, and can sustain long-running tasks for hours or days, representing a significant advancement in AI-assisted software engineering.

Introducing GPT-5.3-Codex

OpenAI Blog

OpenAI introduces GPT-5.3-Codex, an advanced agentic coding model that combines frontier coding capabilities with reasoning and professional knowledge, achieving state-of-the-art performance on SWE-Bench Pro and Terminal-Bench while being 25% faster than its predecessor.

Coding and design with GPT-5

OpenAI Blog

OpenAI announces GPT-5 capabilities for coding and design tasks, demonstrating advanced applications of the latest model across software development and creative design workflows.

Addendum to GPT-5 system card: GPT-5-Codex

OpenAI Blog

OpenAI has released GPT-5-Codex, a version of GPT-5 optimized for agentic coding tasks, trained with reinforcement learning on real-world coding environments. It is available via Codex CLI, IDE extensions, GitHub, and ChatGPT mobile, with comprehensive safety measures including sandboxing and prompt injection mitigations.