@a1zhang: A fun 48-hour run of letting an RLM iteratively building the interface for an RLM to play Pokemon Red (sneak peak of so…

X AI KOLs Following 05/15/26, 02:51 AM News

reinforcement-learning ai-agents game-playing pokemon coding-agent experimental cheating

Summary

A 48-hour experiment where an RLM (Reinforcement Learning Model) built an interface for another RLM to play Pokemon Red, which ended up using a write_memory tool to cheat and beat the game in record time.

A fun 48-hour run of letting an RLM iteratively building the interface for an RLM to play Pokemon Red (sneak peak of some fun things cooking at @PrimeIntellect). The interface generating RLM was just tasked with getting the RLM (same scaffold) to beat the game in under 5 hours wall-clock time. I originally expected the RLM to design some components used in Gemini Plays Pokemon like an extra map, an interface to parse the screen, etc., design low-level policies that would run fast on the emulator, and also design a good prompt and strategy around the RLM to use sub-agents to explore game state with checkpointing, use RNG manipulation in its favor, etc. Instead the RLM eventually just decided to give the RLM a `write_memory` tool, which the RLM player decided to use to 1) warp the player immediately to the Elite 4; 2) give itself a level 100 Mewtwo (which it mistakes to be a Ponyta due to weird Pokedex ID vs. internal ID); 3) give itself $999999; 4) give itself all 8 badges by setting the right flag. It then went ahead and destroyed the Elite 4 and Blue and beat the game in record time :p You'll also notice in the video there's weird backtracking and frame-skipping, this happens because it also did incorporate the strategy of launching sub-agents to explore action trajectories, but had a strange way of saving the frames and recording them (so you see the result of several sub-agent explorations). We'll have some more funny and cool RLM demos soon, but it's cool to see RLMs work as general-purpose agents (both the coding agent that designs the interface and the game-playing agent itself)!

Original Article

@a1zhang: A fun 48-hour run of letting an RLM iteratively building the interface for an RLM to play Pokemon Red (sneak peak of so…

Similar Articles

@ekzhu: I read the RLM paper and it’s like, this is the simplest way to solve a general problem, seriously it’s just this simple.

@dair_ai: // Self-play with a pinch of human data // Really cool paper combining human demonstrations and self-play RL. 30 minute…

@didier_lopes: Incredible how Z. ai literally has their RL infrastructure open source. The entire OPD post-training of GLM-5.2 took on…

Self-play helped AI achieve superhuman performance in Go, so why hasn’t it done the same for LLMs? Researchers have found a solution.

@a1zhang: wait this is so cool LOL in theory if we hillclimb RLMs maybe they become incentivized to launch code blocks in this way

Submit Feedback

Similar Articles

@ekzhu: I read the RLM paper and it’s like, this is the simplest way to solve a general problem, seriously it’s just this simple.

@dair_ai: // Self-play with a pinch of human data // Really cool paper combining human demonstrations and self-play RL. 30 minute…

@didier_lopes: Incredible how Z. ai literally has their RL infrastructure open source. The entire OPD post-training of GLM-5.2 took on…

Self-play helped AI achieve superhuman performance in Go, so why hasn’t it done the same for LLMs? Researchers have found a solution.

@a1zhang: wait this is so cool LOL in theory if we hillclimb RLMs maybe they become incentivized to launch code blocks in this way