vision-language

#vision-language

REVEAL++: Differentiable Phenotypic Grouping for Vision-Language Retinal Modeling of Alzheimer's Disease Risk

arXiv cs.AI ↗ · 4d ago Cached

This paper introduces REVEAL++, a differentiable phenotypic grouping method for vision-language contrastive learning, applied to retinal fundus images and clinical risk narratives for Alzheimer's disease risk prediction, outperforming discrete grouping baselines.

0 favorites 0 likes

#vision-language

Researchers introduce T-Rex, a framework that unifies vision, language, and tactile sensing so robots can respond to physical contact in real time rather than relying on vision alone

Reddit r/singularity ↗ · 4d ago

Researchers introduced T-Rex, a framework that integrates vision, language, and tactile sensing, enabling robots to respond to physical contact in real time rather than relying solely on vision.

0 favorites 0 likes

#vision-language

DeepSeek Introduces Vision

Hacker News Top ↗ · 6d ago

DeepSeek announces a new vision capability, likely a vision-language model, expanding its AI offerings.

0 favorites 0 likes

#vision-language

Revisiting LLM Adaptation for 3D CT Report Generation: A Study of Scaling and Diagnostic Priors

arXiv cs.CL ↗ · 2026-06-17 Cached

This paper investigates parameter-efficient strategies for adapting large language models to 3D CT report generation, introducing RAD3D-Prefix, a lightweight diagnostic-prior conditioning framework that keeps the LLM frozen and requires minimal trainable parameters. It shows that freezing larger LLMs (~1B+) and training only lightweight projection layers provides a superior trade-off between performance, generalization, and computational efficiency.

0 favorites 0 likes

#vision-language

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

Hugging Face Daily Papers ↗ · 2026-06-16 Cached

This paper introduces SR-REAL, a unified framework for spatial vision-language models that combines linguistic deduction and 3D geometric reasoning via reinforcement learning, enabling robust multi-step spatial reasoning across diverse tasks.

0 favorites 0 likes

#vision-language

UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer

Hugging Face Daily Papers ↗ · 2026-06-15 Cached

UniDDT proposes a decoupled diffusion transformer framework that unifies multimodal understanding and generation by leveraging a Noisy ViT encoder and LLM for semantic encoding, achieving strong performance on both tasks.

0 favorites 0 likes

#vision-language

OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models

arXiv cs.AI ↗ · 2026-06-12 Cached

OpenMedQ is a fully-open medical vision-language model pretrained on 14 datasets (~3.35M samples), achieving state-of-the-art results on medical VQA and classification benchmarks.

0 favorites 0 likes

#vision-language

JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence

Hugging Face Daily Papers ↗ · 2026-06-10 Cached

This paper presents JoyAI-VL-Interaction, an open-source 8B-scale vision-language model that operates continuously in real-time, deciding autonomously when to respond or delegate. It includes a complete deployable system and a training recipe, outperforming Doubao and Gemini in human evaluations.

0 favorites 0 likes

#vision-language

Improving Multimodal Reasoning via Worst Dimension Optimization

arXiv cs.AI ↗ · 2026-06-09 Cached

This paper introduces Multimodal Multi-Dimensional Scalarization Process Reward Modeling (MMS-PRM), which enforces the worst dimension's robustness in multimodal reasoning to prevent failures like visual hallucinations from being masked by strong text logic.

0 favorites 0 likes

#vision-language

Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

Hugging Face Daily Papers ↗ · 2026-06-09 Cached

Embodied-R1.5 is a unified embodied foundation model that achieves state-of-the-art performance on 16 out of 24 embodied vision-language benchmarks using multi-task balanced reinforcement learning. It introduces a Planner-Grounder-Corrector closed-loop framework for long-horizon tasks and is open-sourced to facilitate future research.

0 favorites 0 likes

#vision-language

ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

Hugging Face Daily Papers ↗ · 2026-06-09 Cached

ARM presents a unified autoregressive framework for image understanding, generation, and editing using discrete semantic tokenization and reinforcement learning optimization, showing cross-task synergy.

0 favorites 0 likes

#vision-language

AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents

arXiv cs.LG ↗ · 2026-06-05 Cached

AsyncWebRL introduces an asynchronous multi-step reinforcement learning system for vision-language web agents, achieving up to 2.9x training speedup and setting a new state-of-the-art on WebGym by replacing per-trajectory normalization with a constant to reduce trajectory length inefficiency.

0 favorites 0 likes

#vision-language

@liquidai: Introducing LFM2.5-VL-1.6B-Extract and LFM2.5-VL-450M-Extract: Vision-language models that return structured JSON, not …

X AI KOLs Timeline ↗ · 2026-06-05 Cached

Liquid AI released LFM2.5-VL-1.6B-Extract and LFM2.5-VL-450M-Extract, vision-language models that output structured JSON from images and field lists. The models are open-weight and available in two sizes.

0 favorites 0 likes

#vision-language

Struct-Searcher: Agentic Structural Thinking Advances Multimodal Deep Information Seeking

Hugging Face Daily Papers ↗ · 2026-06-05 Cached

Struct-Searcher introduces a belief revision theory-based structural agentic workflow for multimodal deep information seeking, achieving significant accuracy improvements over existing vision-language models and deep research agents.

0 favorites 0 likes

#vision-language

KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models

arXiv cs.LG ↗ · 2026-06-04 Cached

This paper introduces KODA (Kernel Optimization for Discrepancy Analysis), a kernel-based framework for comparing and aligning vision-language model representations by identifying sample subsets that are clustered differently across models like CLIP, SigLIP, and BLIP. The method uses contrastive embedding clustering and randomized low-dimensional approximations to scale to large datasets while providing interpretable structural differences between representations.

0 favorites 0 likes

#vision-language

Query-based Cross-Modal Projector Bolstering Mamba Multimodal LLM

arXiv cs.CL ↗ · 2026-06-04 Cached

This paper proposes a query-based cross-modal projector that compresses visual tokens via cross-attention to improve Mamba-based multimodal LLMs, boosting both performance and throughput on vision-language benchmarks while eliminating the need for manual 2D scan order design.

0 favorites 0 likes

#vision-language

Fine-grained Fragment Retrieval in Multi-modal Long-form Dialogues

arXiv cs.CL ↗ · 2026-06-04 Cached

This paper introduces Fine-grained Fragment Retrieval (FFR), a new task for locating semantically coherent multi-modal fragments (text and images) within long-form dialogues. The authors propose F2RVLM, a generation-based retrieval model trained with reinforcement learning, and FFRS, a two-stage retrieval system, along with a new dataset MLDR for evaluation.

0 favorites 0 likes

#vision-language

Can Generalist Agents Automate Data Curation?

arXiv cs.AI ↗ · 2026-06-04 Cached

Researchers introduce Curation-Bench, a benchmark to evaluate whether generalist coding agents can automate the iterative data curation loop in AI development. Results show agents can match strong baselines within ten iterations, but reliable data research requires scaffolded method adaptation rather than open-ended prompting alone.

0 favorites 0 likes

#vision-language

ToolGate: Token-Efficient Pre-Call Control for Tool-Augmented Vision-Language Agents

arXiv cs.AI ↗ · 2026-06-03 Cached

ToolGate is a lightweight external controller that predicts whether to execute or skip perceptual tool calls in vision-language agents, reducing token cost to 64–69% of baseline while preserving accuracy in cross-domain settings.

0 favorites 0 likes

#vision-language

MapAgent: An Industrial-Grade Agentic Framework for City-scale Lane-level Map Generation

Hugging Face Daily Papers ↗ · 2026-06-03 Cached

MapAgent is an industrial-grade agentic framework that combines vision-language processing with constraint-aware reasoning to automatically produce specification-compliant lane-level maps, achieving over 95% automation in Baidu Maps for more than 360 cities.

0 favorites 0 likes

vision-language

Submit Feedback