Tag
This paper introduces JAMEL, a framework that jointly trains agentic memory and exploration policies using novelty signals, enabling efficient exploration in open-ended environments with reduced computational costs.
GRLO introduces a novel reinforcement learning post-training method that achieves strong generalization across multiple domains (math, code, etc.) from only 5K prompts and 22.7 GPU hours, significantly outperforming in-domain RLVR baselines in efficiency and data requirements.