Tag
The paper introduces the Latent Bridge, a trainable continuous channel that couples a slow reasoning VLM (Qwen3-VL-8B-Thinking) and a fast reactive VLM (MiniCPM-o 4.5) for real-time game agents. Experiments on Atari games and MetaDrive show it matches or outperforms the text-based bridge while avoiding destructive interference when used alone.
Physics Question Scene Graph (PQSG) is a hierarchical question-based pipeline using VLMs to evaluate video generation models' physical plausibility with fine-grained violation detection. It introduces the FinePhyEval dataset and shows higher correlation with human judgments than prior work.
This paper introduces CF-World, a counterfactual benchmark to evaluate whether text-to-image models rely on causal reasoning or mere pattern matching. Experiments show all models degrade sharply in counterfactual settings, suggesting their understanding is limited to tightly coupled visual-textual patterns rather than genuine causal reasoning.
Semantic Browsing introduces a method for controlled diversity in text-to-image generation by using a Vision Language Model with an agentic workflow to generate structured, interpretable variations based on semantic decisions.
Loft Orbital's YAM-9 satellite runs Google's Gemma 3 vision-language model onboard for real-time image analysis, reducing downlink bandwidth and latency by deciding what data to send to Earth.
Researchers trained a vision-language model without a vision encoder for only $100, inspired by Gemma 4 12B, achieving a 30% reduction in end-to-end latency on an M3 Pro MacBook.
NAVI-Orbital demonstrates the first in-orbit deployment of a zero-shot vision-language model (Gemma 3) on a LEO satellite, enabling autonomous scene classification and semantic compression of Earth observation data without fine-tuning.
FinAcumen is a framework that accumulates reasoning experience from prior trajectories into a persistent memory bank for financial multimodal reasoning, improving performance across four benchmarks while maintaining a frozen 8B vision-language model.
A test of the open-weight MiniMax M3 model using MLX-VLM on a Mac Studio shows it can autonomously fill out a US customs form from a driver's license photo and a scanned document, using tool calls for fields, checkboxes, and signature.
A satellite called Yam-9 used Google DeepMind's Gemma 3 vision-language model in orbit to autonomously identify areas of interest based on natural language queries, marking the first reported use of a VLM in space and signaling a shift toward more autonomous satellite operations.
LLaVA-OneVision-2 introduces codec-stream tokenization for efficient video understanding, significantly outperforming Qwen3-VL-8B on temporal and spatial benchmarks. The model, data, and code are open-sourced.
SciOrch presents an 8B vision-language model trained with MCTS to coordinate multiple expert LLMs for multimodal scientific reasoning, achieving superior performance while reducing API costs.
This paper from Meta and Carnegie Mellon presents a multi-modal vision-language model pipeline for detecting AI-generated content on social media, achieving state-of-the-art performance and positive downstream impacts on user engagement.
This paper introduces a self-evolving framework for vision-language models to improve their question-generation capabilities without external supervision, enhancing both question quality and answerer performance.
This paper presents Architect-Ant, an editable automatic furnishing framework for architectural floor plans, together with a curated dataset (AntPlan-270) of 270 floor plans with furniture annotations. The method uses a fine-tuned vision-language model and a domain-specific language to generate geometrically valid and functionally plausible furniture layouts that can be rasterized into blueprint-style images.
A scalable framework combines self-distillation and reinforcement learning to transfer task-solving abilities from vision-language models to video diffusion models without requiring labeled task-video data.
OmniGameArena introduces a unified benchmark for evaluating VLM agents in diverse Unreal Engine 5 game environments, featuring an Improvement Dynamics Curve for tracking skill evolution across reflection rounds.
VoLoAgent integrates vision-language models with robot capabilities for open-vocabulary long-horizon manipulation tasks, introducing a physical orchestrator that plans, monitors, and recovers using interruptible tools, and a benchmark called RoboVoLo for evaluation.
A visual guide explaining the full architecture of Gemma 4 12B, covering how it handles text, images, and audio without separate encoder models by removing traditional vision and audio encoders.
This paper introduces Structured Defect Grounding (SDG), a method that models text-to-image defects as structured (location, type, reason, importance) tuples and uses VLMs for detection, along with a 30K-image dataset SDG-30K and a diagnosis-to-alignment framework called BoxFlow-GRPO.