Tag
The Visual Aesthetic Benchmark (VAB) evaluates multimodal models' ability to judge aesthetics through comparative selection, revealing significant gaps versus human experts and showing that fine-tuning on expert examples improves accuracy.
This paper introduces UNO, an Understanding-Oriented Post-Training framework that uses comprehension tasks as supervisory signals to enhance image generation and editing in unified multimodal models.
Researchers introduce GSI-Bench, the first benchmark to quantify generative spatial intelligence in multimodal models by evaluating 3D spatial constraint compliance during image generation. Fine-tuning on their synthetic dataset boosts both spatial editing fidelity and downstream spatial understanding, showing generative training can strengthen spatial reasoning.