vision-language-model

#vision-language-model

The Latent Bridge: A Continuous Slow-Fast Channel for Real-Time Game Agents

arXiv cs.AI ↗ · 2d ago Cached

The paper introduces the Latent Bridge, a trainable continuous channel that couples a slow reasoning VLM (Qwen3-VL-8B-Thinking) and a fast reactive VLM (MiniCPM-o 4.5) for real-time game agents. Experiments on Atari games and MetaDrive show it matches or outperforms the text-based bridge while avoiding destructive interference when used alone.

0 favorites 0 likes

#vision-language-model

Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation

Hugging Face Daily Papers ↗ · 2d ago Cached

Physics Question Scene Graph (PQSG) is a hierarchical question-based pipeline using VLMs to evaluate video generation models' physical plausibility with fine-grained violation detection. It introduces the FinePhyEval dataset and shows higher correlation with human judgments than prior work.

0 favorites 0 likes

#vision-language-model

Are Text-to-Image Models Inductivist Turkeys? A Counterfactual Benchmark for Causal Reasoning

Hugging Face Daily Papers ↗ · 3d ago Cached

This paper introduces CF-World, a counterfactual benchmark to evaluate whether text-to-image models rely on causal reasoning or mere pattern matching. Experiments show all models degrade sharply in counterfactual settings, suggesting their understanding is limited to tightly coupled visual-textual patterns rather than genuine causal reasoning.

0 favorites 0 likes

#vision-language-model

Semantic Browsing: Controllable Diversity for Image Generation

Hugging Face Daily Papers ↗ · 4d ago Cached

Semantic Browsing introduces a method for controlled diversity in text-to-image generation by using a Vision Language Model with an agentic workflow to generate structured, interpretable variations based on semantic decisions.

0 favorites 0 likes

#vision-language-model

A satellite is now running Google's Gemma 3 vision-language model in orbit, doing onboard inference instead of downlinking everything first

Reddit r/singularity ↗ · 6d ago

Loft Orbital's YAM-9 satellite runs Google's Gemma 3 vision-language model onboard for real-time image analysis, reducing downlink bandwidth and latency by deciding what data to send to Earth.

0 favorites 0 likes

#vision-language-model

@andimarafioti: Can a VLM see without a vision encoder? We trained one for $100, inspired by Gemma 4 12B. Latency on an M3 Pro MacBook:…

X AI KOLs Timeline ↗ · 2026-06-18 Cached

Researchers trained a vision-language model without a vision encoder for only $100, inspired by Gemma 4 12B, achieving a 30% reduction in end-to-end latency on an M3 Pro MacBook.

0 favorites 0 likes

#vision-language-model

NAVI-Orbital: First In-Orbit Demonstration of a Zero-Shot Vision-Language Model for Autonomous Earth Observation

arXiv cs.AI ↗ · 2026-06-18 Cached

NAVI-Orbital demonstrates the first in-orbit deployment of a zero-shot vision-language model (Gemma 3) on a LEO satellite, enabling autonomous scene classification and semantic compression of Earth observation data without fine-tuning.

0 favorites 0 likes

#vision-language-model

FinAcumen: Financial Multimodal Reasoning via Self-Evolving Experience Memory Harness

arXiv cs.AI ↗ · 2026-06-17 Cached

FinAcumen is a framework that accumulates reasoning experience from prior trajectories into a persistent memory bank for financial multimodal reasoning, improving performance across four benchmarks while maintaining a frozen 8B vision-language model.

0 favorites 0 likes

#vision-language-model

@atomic_chat_hq: Open-weight MiniMax M3 filled out a US customs form from a driver's license photo For this test we deployed MiniMax M3 …

X AI KOLs Timeline ↗ · 2026-06-15 Cached

A test of the open-weight MiniMax M3 model using MLX-VLM on a Mac Studio shows it can autonomously fill out a US customs form from a driver's license photo and a scanned document, using tool calls for fields, checkboxes, and signature.

0 favorites 0 likes

#vision-language-model

A satellite just learned to find things on its own — here’s what that means

TechCrunch AI ↗ · 2026-06-15 Cached

A satellite called Yam-9 used Google DeepMind's Gemma 3 vision-language model in orbit to autonomously identify areas of interest based on natural language queries, marking the first reported use of a VLM in space and signaling a shift toward more autonomous satellite operations.

0 favorites 0 likes

#vision-language-model

@jiqizhixin: What if your AI could “see” video like a streaming codec—spending tokens only on the most important moments? Introducin…

X AI KOLs Timeline ↗ · 2026-06-15 Cached

LLaVA-OneVision-2 introduces codec-stream tokenization for efficient video understanding, significantly outperforming Qwen3-VL-8B on temporal and spatial benchmarks. The model, data, and code are open-sourced.

0 favorites 0 likes

#vision-language-model

SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks

Hugging Face Daily Papers ↗ · 2026-06-14 Cached

SciOrch presents an 8B vision-language model trained with MCTS to coordinate multiple expert LLMs for multimodal scientific reasoning, achieving superior performance while reducing API costs.

0 favorites 0 likes

#vision-language-model

Detecting AI-Generated Content on Social Media with Multi-modal Language Models

arXiv cs.CL ↗ · 2026-06-11 Cached

This paper from Meta and Carnegie Mellon presents a multi-modal vision-language model pipeline for detecting AI-generated content on social media, achieving state-of-the-art performance and positive downstream impacts on user engagement.

0 favorites 0 likes

#vision-language-model

Self-Evolving Visual Questioner

Hugging Face Daily Papers ↗ · 2026-06-11 Cached

This paper introduces a self-evolving framework for vision-language models to improve their question-generation capabilities without external supervision, enhancing both question quality and answerer performance.

0 favorites 0 likes

#vision-language-model

Architect-Ant: Editable Automatic Furnishing of Architectural Floor Plans

arXiv cs.AI ↗ · 2026-06-10 Cached

This paper presents Architect-Ant, an editable automatic furnishing framework for architectural floor plans, together with a curated dataset (AntPlan-270) of 270 floor plans with furniture annotations. The method uses a fine-tuned vision-language model and a domain-specific language to generate geometrically valid and functionally plausible furniture layouts that can be rasterized into blueprint-style images.

0 favorites 0 likes

#vision-language-model

World Model Self-Distillation: Training World Models to Solve General Tasks

Hugging Face Daily Papers ↗ · 2026-06-10 Cached

A scalable framework combines self-distillation and reinforcement learning to transfer task-solving abilities from vision-language models to video diffusion models without requiring labeled task-video data.

0 favorites 0 likes

#vision-language-model

OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

Hugging Face Daily Papers ↗ · 2026-06-08 Cached

OmniGameArena introduces a unified benchmark for evaluating VLM agents in diverse Unreal Engine 5 game environments, featuring an Improvement Dynamics Curve for tracking skill evolution across reflection rounds.

0 favorites 0 likes

#vision-language-model

VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation

Hugging Face Daily Papers ↗ · 2026-06-05 Cached

VoLoAgent integrates vision-language models with robot capabilities for open-vocabulary long-horizon manipulation tasks, introducing a physical orchestrator that plans, monitors, and recovers using interruptible tools, and a benchmark called RoboVoLo for evaluation.

0 favorites 0 likes

#vision-language-model

@_philschmid: We released Gemma 4 12B yesterday. Here is a visual guide that explains the full architecture. → How encoders typically…

X AI KOLs Following ↗ · 2026-06-04 Cached

A visual guide explaining the full architecture of Gemma 4 12B, covering how it handles text, images, and audio without separate encoder models by removing traditional vision and audio encoders.

0 favorites 0 likes

#vision-language-model

Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback

Hugging Face Daily Papers ↗ · 2026-06-04 Cached

This paper introduces Structured Defect Grounding (SDG), a method that models text-to-image defects as structured (location, type, reason, importance) tuples and uses VLMs for detection, along with a 30K-image dataset SDG-30K and a diagnosis-to-alignment framework called BoxFlow-GRPO.

0 favorites 0 likes

vision-language-model

Submit Feedback