Trimming the Long-Tail of Visual World Modeling Evaluation

Hugging Face Daily Papers 06/23/26, 12:00 AM Papers

Summary

This paper introduces Tailor-Bench, a benchmark that systematically evaluates visual world models on irregular physical interactions, revealing a long-tail gap in generalization where models perform well on common scenarios but degrade on unconventional and impossible ones.

Physical interactions follow a long-tailed distribution: a set of common and regular interactions dominates human experience and visual data, while a broad spectrum of rare and irregular interactions remains underrepresented. Although recent visual world models, including image and video generation models, achieve impressive realism on existing benchmarks, they primarily focus on simulating common physical interactions. This raises a central question: Do current visual world models internalize and generalize physical principles? In this work, we introduce Tailor-Bench, a benchmark that challenges world models to simulate irregular physical interactions. To enable systematic evaluation, we design three scenario modes that progressively challenge model reasoning: Regular scenarios reflect common tool-task pairs, Unconventional scenarios replace conventional tools with attribute-compatible substitutes to test affordance generalization, and Impossible scenarios introduce attribute-violating tools to probe constraint awareness. Additionally, we design two complementary settings under a unified evaluation protocol: predictive generation requires inferring outcomes without guidance, while descriptive generation specifies the target outcome for faithful realization. Our experimental results reveal a clear long-tail gap in physical world modeling: performance degrades from Regular to Unconventional and Impossible scenarios, indicating limited generalization beyond common interactions. Failure analysis further shows that models rely on superficial visual patterns: image models fail to realize correct state changes, while video models further suffer from temporal inconsistencies.

Original Article

View Cached Full Text

Cached at: 06/30/26, 07:35 AM

Paper page - Trimming the Long-Tail of Visual World Modeling Evaluation

Source: https://huggingface.co/papers/2606.24256

Abstract

Current visual world models demonstrate limited generalization beyond common physical interactions, struggling with rare and irregular scenarios despite achieving realism on standard benchmarks.

Physical interactionsfollow along-tailed distribution: a set of common and regular interactions dominates human experience and visual data, while a broad spectrum of rare and irregular interactions remains underrepresented. Although recentvisual world models, including image andvideo generationmodels, achieve impressive realism on existing benchmarks, they primarily focus on simulating commonphysical interactions. This raises a central question: Do currentvisual world modelsinternalize and generalize physical principles? In this work, we introduce Tailor-Bench, a benchmark that challenges world models to simulate irregularphysical interactions. To enable systematic evaluation, we design threescenario modesthat progressively challenge model reasoning:Regular scenariosreflect common tool-task pairs,Unconventional scenariosreplace conventional tools with attribute-compatible substitutes to testaffordance generalization, andImpossible scenariosintroduce attribute-violating tools to probeconstraint awareness. Additionally, we design two complementary settings under a unified evaluation protocol:predictive generationrequires inferring outcomes without guidance, whiledescriptive generationspecifies the target outcome for faithful realization. Our experimental results reveal a clear long-tail gap in physical world modeling: performance degrades from Regular to Unconventional andImpossible scenarios, indicating limited generalization beyond common interactions. Failure analysis further shows that models rely on superficial visual patterns: image models fail to realize correct state changes, while video models further suffer from temporal inconsistencies.

View arXiv page View PDF Project page GitHub0 Add to collection

Get this paper in your agent:

hf papers read 2606\.24256

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.24256 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.24256 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.24256 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Trimming the Long-Tail of Visual World Modeling Evaluation

Paper page - Trimming the Long-Tail of Visual World Modeling Evaluation

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?

WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

MemoBench: Benchmarking World Modeling in Dynamically Changing Environments

@rohanpaul_ai: Most video models look better than they understand and Video quality is only the easiest thing to notice. LongCat just …

Submit Feedback

Similar Articles

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?

WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

MemoBench: Benchmarking World Modeling in Dynamically Changing Environments

@rohanpaul_ai: Most video models look better than they understand and Video quality is only the easiest thing to notice. LongCat just …