Tag
IDEAL proposes an in-depth alignment framework for discrete representation autoencoding, jointly aligning quantized tokens with shallow and deep VFM features to achieve superior reconstruction and generation performance.
Proposes an attention-guided encoder-decoder for longitudinal medical visual question answering, using a frozen DINO-based mask generator and auxiliary losses to improve consistency and interpretability, achieving strong results on the Medical-Diff-VQA benchmark.
SOCO benchmark evaluates structured object understanding in vision models through consistent part-level annotations and keypoint descriptions, revealing gaps between language-grounded localization and visual correspondence while demonstrating strong prediction of downstream task performance.
Re2Pix is a hierarchical video prediction framework that improves future video generation by first predicting semantic representations using frozen vision foundation models, then conditioning a latent diffusion model on these predictions to generate photorealistic frames. The approach addresses train-test mismatches through nested dropout and mixed supervision strategies, achieving improved temporal semantic consistency and perceptual quality on autonomous driving benchmarks.