Tag
Z-Reward is a teacher-student framework that decouples complex reasoning from efficient reward deployment for text-to-image training. It achieves 89.6% human preference accuracy with a 27B teacher and 88.6% with a 9B student, outperforming prior methods.
Jerry Liu's team is presenting ParseBench, a comprehensive document understanding benchmark for VLMs, at CVPR 2026. The benchmark includes 2,000 pages of real-world enterprise documents with evaluation metrics for tables, charts, and visual grounding.
This paper introduces a nested geometric decomposition framework to analyze how prompting reorganizes internal representations in large language and vision-language models. The authors show that affine transformations, particularly cross-dimensional linear mixing, are key to explaining prompt-induced behavioral changes.
Fast-dDrive is a block-diffusion VLA model for end-to-end autonomous driving that achieves state-of-the-art trajectory accuracy while delivering over 12x throughput speedup over autoregressive baselines, addressing the trade-off between high-fidelity planning and efficient inference for edge deployment.
This paper presents SpaceNum, a unified framework to evaluate how vision-language models (VLMs) understand numerical values in spatial contexts, finding that current models largely fail to ground numbers spatially and often perform close to random guessing.
GridProbe is a training-free inference paradigm for Long-Video VLMs that adaptively selects relevant frames using posterior probing, achieving sub-quadratic attention costs with minimal accuracy loss.
GeoStack introduces a geometric framework to compose independently trained domain experts in Vision-Language Models without catastrophic forgetting, achieving constant-time inference and a 10x reduction in geometric error.
Researchers from MIT, WPI, and Google propose WRING, a novel post-processing debiasing method for Vision-Language Models that avoids the 'Whac-a-mole dilemma' of amplifying other biases when removing specific ones.