Tag
This paper introduces SPARK, a self-play reinforcement learning framework that leverages knowledge graphs derived from scientific literature to improve relational reasoning in vision-language models.
Cornell researchers propose POP, a self-play framework that lets an LLM generate its own rubrics and training pairs for open-ended tasks, boosting Qwen-2.5-7B on healthcare QA, creative writing and instruction following without human labels.
Researchers from University of Edinburgh propose a self-play framework using Liquid Haskell for formal verification to train LLMs on semantic equivalence reasoning, releasing OpInstruct-HSx dataset (28k programs) and achieving 13.3pp accuracy gains on EquiBench.
STRATAGEM is a new framework for improving reasoning transferability in language models by using game self-play with a Reasoning Transferability Coefficient and Reasoning Evolution Reward to reinforce abstract, domain-agnostic reasoning patterns over game-specific heuristics. Experiments show strong improvements on mathematical reasoning, general reasoning, and code generation benchmarks.
OpenAI Five became the first AI system to defeat Dota 2 world champions using large-scale deep reinforcement learning with self-play, demonstrating superhuman performance on a complex game with long time horizons and imperfect information.
OpenAI demonstrates that agents trained in a hide-and-seek environment discover six distinct emergent strategies and tool-use behaviors through multi-agent competition, without explicit incentives for object interaction. This work suggests multi-agent co-adaptation can produce complex intelligent behavior through self-supervised learning.
OpenAI Five is a reinforcement learning agent that masters Dota 2 through self-play training with curriculum learning and strategic randomization, progressing from random behavior to executing complex human-level strategies.
OpenAI demonstrates that competitive self-play in simulated 3D robot environments enables AI agents to discover complex physical behaviors like tackling, ducking, and faking without explicit instruction, suggesting self-play will be fundamental to future powerful AI systems.
OpenAI describes iterative improvements to their Dota 2 bot during The International tournament, combining coaching with self-play to enhance agent performance through rapid training cycles and strategic refinements discovered during professional matches.
OpenAI created a bot that defeats world-class Dota 2 professionals in 1v1 matches using only self-play learning, without imitation learning or tree search. The achievement demonstrates progress toward AI systems that can accomplish complex goals in dynamic, multi-agent environments.