Tag
SpatialAct is a new simulator-grounded benchmark that probes whether VLM agents can perform coherent spatial reasoning and translate it into actions in 3D environments across multi-turn feedback settings. Experiments reveal a significant reasoning-to-action gap, with current VLMs struggling to maintain spatial beliefs and produce reliable actions despite performing well on isolated reasoning tasks.
This paper introduces satisfiable drift, a failure mode where multi-turn reasoning systems silently violate prior commitments while maintaining internal logical consistency, dominating contradictions. The authors present DRIFT-Bench, a benchmark of 816 problems, and find that after repair, 98-100% of residual errors are drift errors.
This paper introduces MedAction, a framework for training LLMs on active, multi-turn clinical diagnosis by simulating iterative test ordering and hypothesis updates. It presents a new dataset, MedAction-32K, and demonstrates state-of-the-art performance for open-source models on medical benchmarks.