vision-language-models

#vision-language-models

Large Vision-Language Models Get Lost in Attention

arXiv cs.AI ↗ · 2d ago Cached

This research paper analyzes the internal mechanics of Large Vision-Language Models (LVLMs) using information theory, revealing that attention mechanisms may be redundant while Feed-Forward Networks drive semantic innovation. The authors demonstrate that replacing learned attention weights with random values can yield comparable performance, suggesting current models 'get lost in attention'.

0 favorites 0 likes

#vision-language-models

SPARK: Self-Play with Asymmetric Reward from Knowledge Graphs

arXiv cs.AI ↗ · 2d ago Cached

This paper introduces SPARK, a self-play reinforcement learning framework that leverages knowledge graphs derived from scientific literature to improve relational reasoning in vision-language models.

0 favorites 0 likes

#vision-language-models

PRISM: Perception Reasoning Interleaved for Sequential Decision Making

arXiv cs.AI ↗ · 2d ago Cached

This paper introduces PRISM, a framework that integrates Vision-Language Models and Large Language Models through a dynamic question-answering pipeline to improve sequential decision-making in embodied AI tasks.

0 favorites 0 likes

#vision-language-models

Uncovering Entity Identity Confusion in Multimodal Knowledge Editing

arXiv cs.CL ↗ · 2d ago Cached

This paper identifies a failure mode called Entity Identity Confusion in multimodal knowledge editing, where models incorrectly bind image-entity relationships. It introduces EC-Bench to diagnose this issue and proposes mitigation strategies for faithful editing.

0 favorites 0 likes

#vision-language-models

Boosting multimodal inference performance by >10% with a single Python dict

Hacker News Top ↗ · 3d ago Cached

Modal engineers profiled SGLang's scheduler on multimodal VLM workloads and found that replacing expensive GPU memory bookkeeping with a simple Python dict cache improved throughput by 16% and reduced latency by over 13%, with the fix merged into SGLang v0.5.10.

0 favorites 0 likes

#vision-language-models

Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning

arXiv cs.CL ↗ · 2026-04-21 Cached

FreshPER introduces a freshness-aware prioritized experience replay method for LLM/VLM reinforcement learning that addresses the 'priority staleness' problem by applying exponential age decay to stored priorities, enabling off-policy reuse of trajectories. Evaluated on eight agentic, reasoning, and math tasks, FreshPER significantly outperforms on-policy baselines with gains up to +367% on Sokoban.

0 favorites 0 likes

#vision-language-models

SynopticBench: Evaluating Vision-Language Models on Generating Weather Forecast Discussions of the Future

arXiv cs.CL ↗ · 2026-04-21 Cached

This paper introduces SynopticBench, a dataset of 1.3M+ weather forecast discussions paired with meteorological images, and SPACE, a novel evaluation framework for assessing VLM-generated weather forecasts.

0 favorites 0 likes

#vision-language-models

@jerryjliu0: Our core mission today is using AI to solve document OCR. All of our product offerings, from commercial (LlamaParse) to…

X AI KOLs Following ↗ · 2026-04-21 Cached

LlamaIndex has revamped its website and reaffirmed its core mission of AI-powered document OCR, with offerings including commercial product LlamaParse and open-source tools LiteParse and ParseBench. LlamaParse uses VLM-powered agentic document understanding to handle complex layouts, tables, charts, and handwritten text at scale.

0 favorites 0 likes

#vision-language-models

GIST: Multimodal Knowledge Extraction and Spatial Grounding via Intelligent Semantic Topology

arXiv cs.AI ↗ · 2026-04-20 Cached

GIST is a multimodal knowledge extraction pipeline that transforms mobile point cloud data into semantically annotated navigation topologies for dense environments, enabling semantic search, localization, and natural language routing with 80% navigation success rates in real-world evaluation.

0 favorites 0 likes

#vision-language-models

Mechanisms of Prompt-Induced Hallucination in Vision-Language Models

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper investigates prompt-induced hallucinations in vision-language models through mechanistic analysis, identifying specific attention heads responsible for the models' tendency to favor textual prompts over visual evidence. The authors demonstrate that ablating these PIH-heads reduces hallucinations by at least 40% without additional training, revealing model-specific mechanisms underlying this failure mode.

0 favorites 0 likes

#vision-language-models

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper introduces CrossMath, a controlled multimodal reasoning benchmark that reveals a critical limitation in current vision-language models: they perform reasoning primarily in textual space rather than genuine vision-grounded reasoning, with visual input often degrading performance compared to text-only baselines. The authors propose fine-tuning approaches to mitigate this modality gap and improve multimodal reasoning capabilities.

0 favorites 0 likes

#vision-language-models

TTL: Test-time Textual Learning for OOD Detection with Pretrained Vision-Language Models

arXiv cs.CL ↗ · 2026-04-20 Cached

TTL introduces a test-time textual learning framework for OOD detection using pretrained vision-language models like CLIP, which dynamically learns OOD semantics from unlabeled test streams without external OOD labels. The method uses pseudo-labeled samples and an OOD knowledge purification strategy to improve detection robustness across diverse and evolving OOD distributions.

0 favorites 0 likes

#vision-language-models

HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning

arXiv cs.CL ↗ · 2026-04-20 Cached

HyperGVL introduces the first benchmark for evaluating Large Vision-Language Models on hypergraph understanding and reasoning, featuring 84,000 QA samples across 12 tasks and real-world applications. The paper also proposes WiseHyGR, a generalizable router that enhances LVLM performance through adaptive hypergraph representations.

0 favorites 0 likes

#vision-language-models

Mitigating Multimodal Hallucination via Phase-wise Self-reward

Hugging Face Daily Papers ↗ · 2026-04-20 Cached

PSRD framework halves multimodal hallucination in LVLMs by using phase-wise self-reward decoding and a distilled lightweight reward model without extra supervision.

0 favorites 0 likes

#vision-language-models

When Background Matters: Breaking Medical Vision Language Models by Transferable Attack

Hugging Face Daily Papers ↗ · 2026-04-19 Cached

MedFocusLeak introduces the first transferable black-box adversarial attack on medical vision-language models, using imperceptible background perturbations to mislead clinical diagnoses across six imaging modalities.

0 favorites 0 likes

#vision-language-models

EasyVideoR1: Easier RL for Video Understanding

Hugging Face Daily Papers ↗ · 2026-04-18 Cached

EasyVideoR1 is an efficient reinforcement learning framework for training large vision-language models on video understanding tasks, featuring offline preprocessing with tensor caching for 1.47x throughput improvement, a task-aware reward system covering 11 problem types, and evaluation across 22 video benchmarks. It also supports joint image-video training and a mixed offline-online data training paradigm.

0 favorites 0 likes

#vision-language-models

Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

Hugging Face Daily Papers ↗ · 2026-04-16 Cached

Switch-KD proposes a novel visual-switch knowledge distillation framework for efficiently compressing vision-language models by unifying multimodal knowledge transfer within a shared text-probability space. The method achieves 3.6-point average improvement across 10 multimodal benchmarks when distilling a 0.5B TinyLLaVA student from a 3B teacher model.

0 favorites 0 likes

#vision-language-models

RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

Hugging Face Daily Papers ↗ · 2026-04-16 Cached

RadAgent is a tool-using AI agent that generates chest CT reports through interpretable step-by-step reasoning, improving clinical accuracy by 36.4% relative and achieving 37% faithfulness—a capability absent in existing 3D vision-language models. The system provides fully inspectable reasoning traces allowing clinicians to validate and refine diagnostic outputs.

0 favorites 0 likes

#vision-language-models

Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding

Hugging Face Daily Papers ↗ · 2026-04-14 Cached

Proposes Slipform, a training framework that uses lexical concreteness to select harder negatives and a margin-based Cement loss, boosting compositional reasoning in vision-language models.

0 favorites 0 likes

#vision-language-models

Anthropogenic Regional Adaptation in Multimodal Vision-Language Model

Hugging Face Daily Papers ↗ · 2026-04-13 Cached

This paper introduces Anthropogenic Regional Adaptation, a paradigm for optimizing vision-language models to specific regional contexts while maintaining global generalization. The authors propose GG-EZ, an adaptation method using regional data filtering and model merging, demonstrating 5-15% improvements in cultural relevance for Southeast Asia across three VL architectures.

0 favorites 0 likes

vision-language-models

Submit Feedback