Tag
GROW proposes a novel reinforcement learning framework that adapts GRPO to multi-turn VLM agent tasks by decomposing trajectories into state-action pairs and computing advantages between them, achieving state-of-the-art performance on over 800 Minecraft tasks.
AtlasVA is a teacher-free visual skill memory framework for vision-language model agents that uses spatial heatmaps, visual exemplars, and symbolic text skills to improve spatial decision-making in long-horizon tasks, outperforming baselines on several benchmarks.