Trimming the Long-Tail of Visual World Modeling Evaluation

Hugging Face Daily Papers Papers

Summary

This paper introduces Tailor-Bench, a benchmark that systematically evaluates visual world models on irregular physical interactions, revealing a long-tail gap in generalization where models perform well on common scenarios but degrade on unconventional and impossible ones.

Physical interactions follow a long-tailed distribution: a set of common and regular interactions dominates human experience and visual data, while a broad spectrum of rare and irregular interactions remains underrepresented. Although recent visual world models, including image and video generation models, achieve impressive realism on existing benchmarks, they primarily focus on simulating common physical interactions. This raises a central question: Do current visual world models internalize and generalize physical principles? In this work, we introduce Tailor-Bench, a benchmark that challenges world models to simulate irregular physical interactions. To enable systematic evaluation, we design three scenario modes that progressively challenge model reasoning: Regular scenarios reflect common tool-task pairs, Unconventional scenarios replace conventional tools with attribute-compatible substitutes to test affordance generalization, and Impossible scenarios introduce attribute-violating tools to probe constraint awareness. Additionally, we design two complementary settings under a unified evaluation protocol: predictive generation requires inferring outcomes without guidance, while descriptive generation specifies the target outcome for faithful realization. Our experimental results reveal a clear long-tail gap in physical world modeling: performance degrades from Regular to Unconventional and Impossible scenarios, indicating limited generalization beyond common interactions. Failure analysis further shows that models rely on superficial visual patterns: image models fail to realize correct state changes, while video models further suffer from temporal inconsistencies.
Original Article
View Cached Full Text

Cached at: 06/30/26, 07:35 AM

Paper page - Trimming the Long-Tail of Visual World Modeling Evaluation

Source: https://huggingface.co/papers/2606.24256

Abstract

Current visual world models demonstrate limited generalization beyond common physical interactions, struggling with rare and irregular scenarios despite achieving realism on standard benchmarks.

Physical interactionsfollow along-tailed distribution: a set of common and regular interactions dominates human experience and visual data, while a broad spectrum of rare and irregular interactions remains underrepresented. Although recentvisual world models, including image andvideo generationmodels, achieve impressive realism on existing benchmarks, they primarily focus on simulating commonphysical interactions. This raises a central question: Do currentvisual world modelsinternalize and generalize physical principles? In this work, we introduce Tailor-Bench, a benchmark that challenges world models to simulate irregularphysical interactions. To enable systematic evaluation, we design threescenario modesthat progressively challenge model reasoning:Regular scenariosreflect common tool-task pairs,Unconventional scenariosreplace conventional tools with attribute-compatible substitutes to testaffordance generalization, andImpossible scenariosintroduce attribute-violating tools to probeconstraint awareness. Additionally, we design two complementary settings under a unified evaluation protocol:predictive generationrequires inferring outcomes without guidance, whiledescriptive generationspecifies the target outcome for faithful realization. Our experimental results reveal a clear long-tail gap in physical world modeling: performance degrades from Regular to Unconventional andImpossible scenarios, indicating limited generalization beyond common interactions. Failure analysis further shows that models rely on superficial visual patterns: image models fail to realize correct state changes, while video models further suffer from temporal inconsistencies.

View arXiv pageView PDFProject pageGitHub0Add to collection

Get this paper in your agent:

hf papers read 2606\.24256

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.24256 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.24256 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.24256 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?

Hugging Face Daily Papers

The Visual Aesthetic Benchmark (VAB) evaluates multimodal models' ability to judge aesthetics through comparative selection, revealing significant gaps versus human experts and showing that fine-tuning on expert examples improves accuracy.

MemoBench: Benchmarking World Modeling in Dynamically Changing Environments

Hugging Face Daily Papers

MemoBench is a diagnostic benchmark for evaluating video generation models' memory consistency in dynamically changing environments, where objects disappear and reappear in updated states. It includes 360 ground-truth clips and an evaluation suite combining automated metrics with VQA-based assessment, revealing insights into memory consistency challenges.