Tag
This paper introduces REVEAL++, a differentiable phenotypic grouping method for vision-language contrastive learning, applied to retinal fundus images and clinical risk narratives for Alzheimer's disease risk prediction, outperforming discrete grouping baselines.
Researchers introduced T-Rex, a framework that integrates vision, language, and tactile sensing, enabling robots to respond to physical contact in real time rather than relying solely on vision.
DeepSeek announces a new vision capability, likely a vision-language model, expanding its AI offerings.
This paper investigates parameter-efficient strategies for adapting large language models to 3D CT report generation, introducing RAD3D-Prefix, a lightweight diagnostic-prior conditioning framework that keeps the LLM frozen and requires minimal trainable parameters. It shows that freezing larger LLMs (~1B+) and training only lightweight projection layers provides a superior trade-off between performance, generalization, and computational efficiency.
This paper introduces SR-REAL, a unified framework for spatial vision-language models that combines linguistic deduction and 3D geometric reasoning via reinforcement learning, enabling robust multi-step spatial reasoning across diverse tasks.
UniDDT proposes a decoupled diffusion transformer framework that unifies multimodal understanding and generation by leveraging a Noisy ViT encoder and LLM for semantic encoding, achieving strong performance on both tasks.
OpenMedQ is a fully-open medical vision-language model pretrained on 14 datasets (~3.35M samples), achieving state-of-the-art results on medical VQA and classification benchmarks.
This paper presents JoyAI-VL-Interaction, an open-source 8B-scale vision-language model that operates continuously in real-time, deciding autonomously when to respond or delegate. It includes a complete deployable system and a training recipe, outperforming Doubao and Gemini in human evaluations.
This paper introduces Multimodal Multi-Dimensional Scalarization Process Reward Modeling (MMS-PRM), which enforces the worst dimension's robustness in multimodal reasoning to prevent failures like visual hallucinations from being masked by strong text logic.
Embodied-R1.5 is a unified embodied foundation model that achieves state-of-the-art performance on 16 out of 24 embodied vision-language benchmarks using multi-task balanced reinforcement learning. It introduces a Planner-Grounder-Corrector closed-loop framework for long-horizon tasks and is open-sourced to facilitate future research.
ARM presents a unified autoregressive framework for image understanding, generation, and editing using discrete semantic tokenization and reinforcement learning optimization, showing cross-task synergy.
AsyncWebRL introduces an asynchronous multi-step reinforcement learning system for vision-language web agents, achieving up to 2.9x training speedup and setting a new state-of-the-art on WebGym by replacing per-trajectory normalization with a constant to reduce trajectory length inefficiency.
Liquid AI released LFM2.5-VL-1.6B-Extract and LFM2.5-VL-450M-Extract, vision-language models that output structured JSON from images and field lists. The models are open-weight and available in two sizes.
Struct-Searcher introduces a belief revision theory-based structural agentic workflow for multimodal deep information seeking, achieving significant accuracy improvements over existing vision-language models and deep research agents.
This paper introduces KODA (Kernel Optimization for Discrepancy Analysis), a kernel-based framework for comparing and aligning vision-language model representations by identifying sample subsets that are clustered differently across models like CLIP, SigLIP, and BLIP. The method uses contrastive embedding clustering and randomized low-dimensional approximations to scale to large datasets while providing interpretable structural differences between representations.
This paper proposes a query-based cross-modal projector that compresses visual tokens via cross-attention to improve Mamba-based multimodal LLMs, boosting both performance and throughput on vision-language benchmarks while eliminating the need for manual 2D scan order design.
This paper introduces Fine-grained Fragment Retrieval (FFR), a new task for locating semantically coherent multi-modal fragments (text and images) within long-form dialogues. The authors propose F2RVLM, a generation-based retrieval model trained with reinforcement learning, and FFRS, a two-stage retrieval system, along with a new dataset MLDR for evaluation.
Researchers introduce Curation-Bench, a benchmark to evaluate whether generalist coding agents can automate the iterative data curation loop in AI development. Results show agents can match strong baselines within ten iterations, but reliable data research requires scaffolded method adaptation rather than open-ended prompting alone.
ToolGate is a lightweight external controller that predicts whether to execute or skip perceptual tool calls in vision-language agents, reducing token cost to 64–69% of baseline while preserving accuracy in cross-domain settings.
MapAgent is an industrial-grade agentic framework that combines vision-language processing with constraint-aware reasoning to automatically produce specification-compliant lane-level maps, achieving over 95% automation in Baidu Maps for more than 360 cities.