Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?

Hugging Face Daily Papers 05/12/26, 12:00 AM Papers

Summary

The Visual Aesthetic Benchmark (VAB) evaluates multimodal models' ability to judge aesthetics through comparative selection, revealing significant gaps versus human experts and showing that fine-tuning on expert examples improves accuracy.

Multimodal large language models (MLLMs) are now routinely deployed for visual understanding, generation, and curation. A substantial fraction of these applications require an explicit aesthetic judgment. Most existing solutions reduce this judgment to predicting a scalar score for a single image. We first ask whether such scores faithfully capture comparative preference: in a controlled study with eight expert annotators, score-derived rankings align poorly with the same annotators' direct comparisons, while direct ranking yields substantially higher inter-annotator agreement on best- and worst-image labels. Motivated by this finding, we introduce the Visual Aesthetic Benchmark (VAB), which casts aesthetic evaluation as comparative selection over candidate sets with matched subject matter. VAB contains 400 tasks and 1,195 images across fine art, photography, and illustration, with labels derived from the consensus of 10 independent expert judges per task. Evaluating 20 frontier MLLMs and six dedicated visual-quality reward models, we find that the strongest system identifies both the best and the worst image correctly across three random permutations of the candidate order in only 26.5% of tasks, far below the 68.9% achieved by human experts. Fine-tuning a 35B-parameter model on 2,000 expert examples brings its accuracy close to that of a 397B-parameter open-weight model, suggesting that the comparative signal in VAB is transferable. Together, these results expose a clear and measurable gap between current multimodal models and expert aesthetic judgment, and VAB provides the first set-based, expert-grounded testbed on which that gap can be tracked and closed.

Original Article

View Cached Full Text

Cached at: 05/14/26, 04:16 AM

Paper page - Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?

Source: https://huggingface.co/papers/2605.12684 Authors:

Abstract

Current multimodal models struggle to match human expert aesthetic judgment in comparative image selection tasks, as demonstrated by the Visual Aesthetic Benchmark which reveals significant performance gaps and shows that fine-tuning on expert examples can improve accuracy.

Multimodal large language models(MLLMs) are now routinely deployed for visual understanding, generation, and curation. A substantial fraction of these applications require an explicit aesthetic judgment. Most existing solutions reduce this judgment to predicting a scalar score for a single image. We first ask whether such scores faithfully capture comparative preference: in a controlled study with eight expert annotators, score-derived rankings align poorly with the same annotators’ direct comparisons, while direct ranking yields substantially higher inter-annotator agreement on best- and worst-image labels. Motivated by this finding, we introduce theVisual Aesthetic Benchmark(VAB), which casts aesthetic evaluation ascomparative selectionover candidate sets with matched subject matter. VAB contains 400 tasks and 1,195 images across fine art, photography, and illustration, with labels derived from the consensus of 10 independent expert judges per task. Evaluating 20 frontier MLLMs and six dedicated visual-quality reward models, we find that the strongest system identifies both the best and the worst image correctly across three random permutations of the candidate order in only 26.5% of tasks, far below the 68.9% achieved by human experts.Fine-tuninga 35B-parameter model on 2,000 expert examples brings its accuracy close to that of a 397B-parameter open-weight model, suggesting that the comparative signal in VAB is transferable. Together, these results expose a clear and measurable gap between current multimodal models and expert aesthetic judgment, and VAB provides the first set-based, expert-grounded testbed on which that gap can be tracked and closed.

View arXiv page View PDF Project page GitHub29 Add to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.12684 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.12684 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.12684 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?

Paper page - Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Trimming the Long-Tail of Visual World Modeling Evaluation

Blind-Spots-Bench: Evaluating Blind Spots in Multimodal Models

VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

Open benchmark: how well can multimodal LLMs read a calendar week-view from a screenshot? Humans ~99%, Q4 local models.....

@rohanpaul_ai: Most video models look better than they understand and Video quality is only the easiest thing to notice. LongCat just …

Submit Feedback

Similar Articles

Trimming the Long-Tail of Visual World Modeling Evaluation

Blind-Spots-Bench: Evaluating Blind Spots in Multimodal Models

VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

Open benchmark: how well can multimodal LLMs read a calendar week-view from a screenshot? Humans ~99%, Q4 local models.....

@rohanpaul_ai: Most video models look better than they understand and Video quality is only the easiest thing to notice. LongCat just …