Tag
Introduces Target Viewpoint Reproduction (TVR) task and TVRBench benchmark for evaluating foundation models' ability to actively adjust 3D viewpoints to match target images. Experiments reveal significant limitations in current open and closed-source models, with a unified post-training framework boosting success rates from ~12% to ~51%.
Introduces ESI-BENCH, a comprehensive benchmark for embodied spatial intelligence built on OmniGibson, covering 10 task categories and 29 subcategories. Experiments show active exploration substantially outperforms passive approaches, with failures mainly due to action blindness rather than perception, revealing a metacognitive gap in models compared to humans.