Tag
This paper introduces Tailor-Bench, a benchmark that systematically evaluates visual world models on irregular physical interactions, revealing a long-tail gap in generalization where models perform well on common scenarios but degrade on unconventional and impossible ones.