physical-tool-use

#physical-tool-use

Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

arXiv cs.CL ↗ · 2026-06-10 Cached

This paper introduces PhysTool-Bench, a benchmark for evaluating multimodal large language models' ability to recognize and plan the use of physical tools in real-world scenes. The authors find that even the best model identifies only 58.7% of tools and completes just 21.0% of queries end-to-end, revealing a two-level deficit in perception and functional commonsense.

0 favorites 0 likes

physical-tool-use

Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

Submit Feedback