vlms

#vlms

Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions

Hugging Face Daily Papers ↗ · 2026-06-08 Cached

Z-Reward is a teacher-student framework that decouples complex reasoning from efficient reward deployment for text-to-image training. It achieves 89.6% human preference accuracy with a 27B teacher and 88.6% with a 9B student, outperforming prior methods.

0 favorites 0 likes

#vlms

@jerryjliu0: Our team is at CVPR 2026 if you want to come say hi :)

X AI KOLs Following ↗ · 2026-06-04 Cached

Jerry Liu's team is presenting ParseBench, a comprehensive document understanding benchmark for VLMs, at CVPR 2026. The benchmark includes 2,000 pages of real-world enterprise documents with evaluation metrics for tables, charts, and visual grounding.

0 favorites 0 likes

#vlms

Decomposing how prompting steers behavior

arXiv cs.AI ↗ · 2026-06-03 Cached

This paper introduces a nested geometric decomposition framework to analyze how prompting reorganizes internal representations in large language and vision-language models. The authors show that affine transformations, particularly cross-dimensional linear mixing, are key to explaining prompt-induced behavioral changes.

0 favorites 0 likes

#vlms

Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving

arXiv cs.CL ↗ · 2026-05-25 Cached

Fast-dDrive is a block-diffusion VLA model for end-to-end autonomous driving that achieves state-of-the-art trajectory accuracy while delivering over 12x throughput speedup over autoregressive baselines, addressing the trade-off between high-fidelity planning and efficient inference for edge deployment.

0 favorites 0 likes

#vlms

SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

arXiv cs.AI ↗ · 2026-05-25 Cached

This paper presents SpaceNum, a unified framework to evaluate how vision-language models (VLMs) understand numerical values in spatial contexts, finding that current models largely fail to ground numbers spatially and often perform close to random guessing.

0 favorites 0 likes

#vlms

GridProbe: Posterior-Probing for Adaptive Test-Time Compute in Long-Video VLMs

Hugging Face Daily Papers ↗ · 2026-05-11 Cached

GridProbe is a training-free inference paradigm for Long-Video VLMs that adaptively selects relevant frames using posterior probing, achieving sub-quadratic attention costs with minimal accuracy loss.

0 favorites 0 likes

#vlms

GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs

Hugging Face Daily Papers ↗ · 2026-05-07 Cached

GeoStack introduces a geometric framework to compose independently trained domain experts in Vision-Language Models without catastrophic forgetting, achieving constant-time inference and a 10x reduction in geometric error.

0 favorites 0 likes

#vlms

Solving the “Whac-a-mole dilemma”: A smarter way to debias AI vision models

MIT News — Artificial Intelligence ↗ · 2026-04-29 Cached

Researchers from MIT, WPI, and Google propose WRING, a novel post-processing debiasing method for Vision-Language Models that avoids the 'Whac-a-mole dilemma' of amplifying other biases when removing specific ones.

0 favorites 0 likes

vlms

Submit Feedback