Trimming the Long-Tail of Visual World Modeling Evaluation
Summary
This paper introduces Tailor-Bench, a benchmark that systematically evaluates visual world models on irregular physical interactions, revealing a long-tail gap in generalization where models perform well on common scenarios but degrade on unconventional and impossible ones.
View Cached Full Text
Cached at: 06/30/26, 07:35 AM
Paper page - Trimming the Long-Tail of Visual World Modeling Evaluation
Source: https://huggingface.co/papers/2606.24256
Abstract
Current visual world models demonstrate limited generalization beyond common physical interactions, struggling with rare and irregular scenarios despite achieving realism on standard benchmarks.
Physical interactionsfollow along-tailed distribution: a set of common and regular interactions dominates human experience and visual data, while a broad spectrum of rare and irregular interactions remains underrepresented. Although recentvisual world models, including image andvideo generationmodels, achieve impressive realism on existing benchmarks, they primarily focus on simulating commonphysical interactions. This raises a central question: Do currentvisual world modelsinternalize and generalize physical principles? In this work, we introduce Tailor-Bench, a benchmark that challenges world models to simulate irregularphysical interactions. To enable systematic evaluation, we design threescenario modesthat progressively challenge model reasoning:Regular scenariosreflect common tool-task pairs,Unconventional scenariosreplace conventional tools with attribute-compatible substitutes to testaffordance generalization, andImpossible scenariosintroduce attribute-violating tools to probeconstraint awareness. Additionally, we design two complementary settings under a unified evaluation protocol:predictive generationrequires inferring outcomes without guidance, whiledescriptive generationspecifies the target outcome for faithful realization. Our experimental results reveal a clear long-tail gap in physical world modeling: performance degrades from Regular to Unconventional andImpossible scenarios, indicating limited generalization beyond common interactions. Failure analysis further shows that models rely on superficial visual patterns: image models fail to realize correct state changes, while video models further suffer from temporal inconsistencies.
View arXiv pageView PDFProject pageGitHub0Add to collection
Get this paper in your agent:
hf papers read 2606\.24256
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.24256 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.24256 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.24256 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation
WBench is a comprehensive multi-turn benchmark for evaluating interactive world models across five dimensions using 289 test cases and 1,058 interaction turns, providing automatic sub-metrics and diagnostic insights. It reveals that no single model excels across all dimensions.
Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?
The Visual Aesthetic Benchmark (VAB) evaluates multimodal models' ability to judge aesthetics through comparative selection, revealing significant gaps versus human experts and showing that fine-tuning on expert examples improves accuracy.
WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark
Introduces WorldBench, a visually diverse multimodal reasoning benchmark that reveals significant limitations in current multimodal large language models' visual understanding.
MemoBench: Benchmarking World Modeling in Dynamically Changing Environments
MemoBench is a diagnostic benchmark for evaluating video generation models' memory consistency in dynamically changing environments, where objects disappear and reappear in updated states. It includes 360 ground-truth clips and an evaluation suite combining automated metrics with VQA-based assessment, revealing insights into memory consistency challenges.
@rohanpaul_ai: Most video models look better than they understand and Video quality is only the easiest thing to notice. LongCat just …
LongCat released WBench, a benchmark for video world models that tests control, memory, instruction-following, and physical plausibility across 289 cases and 20 models, finding that no model excels in all dimensions, highlighting the gap between video quality and true world simulation.