visual-reasoning

Tag

Cards List
#visual-reasoning

Thinking with Visual Grounding

Hugging Face Daily Papers · 2026-06-15 Cached

This paper introduces visually grounded thinking, a method for vision-language models to interleave natural-language reasoning with explicit visual evidence grounding using points or boxes. A scalable synthesis pipeline and grounding-aware reinforcement learning improve reasoning accuracy, enabling a 4B model to match or surpass a 27B model on spatial and counting benchmarks.

0 favorites 0 likes
#visual-reasoning

Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning

Hugging Face Daily Papers · 2026-06-08 Cached

Visual Para-Thinker++ proposes a single-policy multi-agent framework for visual reasoning that uses role-conditioned agents (Main, Worker, Summary) and dedicated training methods to reduce hallucinations and improve efficiency, outperforming baselines on hallucination-sensitive benchmarks.

0 favorites 0 likes
#visual-reasoning

DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning

Hugging Face Daily Papers · 2026-06-06 Cached

This paper identifies that failures in visual reasoning often stem from breakdowns in dynamic cross-modal coordination between visual and textual evidence during chain-of-thought generation. It introduces DyCo-RL, a reinforcement learning framework that rewards effective cross-modal coordination, leading to improved reasoning performance.

0 favorites 0 likes
#visual-reasoning

Differentiable Efficient Operator Search

arXiv cs.LG · 2026-06-05 Cached

Introduces Efficient Operator Search (EOS), a unified differentiable framework that generalizes token reduction methods (pruning, merging, pooling, adaptive reweighting) into a shared operator space, automatically searching for optimal operator compositions under budget constraints. The method achieves competitive results across benchmarks and reveals consistent operator patterns.

0 favorites 0 likes
#visual-reasoning

VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark

arXiv cs.AI · 2026-06-04 Cached

VAMPS is a new benchmark of 1,168 multimodal bilingual math problems designed to evaluate whether LLMs can benefit from constructing and reasoning over graphs/visualizations. Key finding: direct analytical solving surprisingly outperforms tool-enabled visual solving even on problems where plotting is a natural strategy.

0 favorites 0 likes
#visual-reasoning

TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL

Hugging Face Daily Papers · 2026-06-01 Cached

TRON introduces a scalable online environment for visual reasoning reinforcement learning that generates unlimited diverse training instances with verifiable answers, showing consistent performance improvements across multiple multimodal benchmarks.

0 favorites 0 likes
#visual-reasoning

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

Hugging Face Daily Papers · 2026-05-28 Cached

VisualThink-VLA introduces a visual intermediate reasoning framework for vision-language-action policies that preserves spatial precision and dramatically reduces latency compared to text-based reasoning, achieving sub-second inference and state-of-the-art success rates on robot manipulation benchmarks.

0 favorites 0 likes
#visual-reasoning

Does Seeing More Mean Knowing More? Mono-Anchored Advantage Normalization for Multi-Source Visual Reasoning

Hugging Face Daily Papers · 2026-05-25 Cached

This paper proposes MARS, a mono-anchored multi-source reasoning framework that uses dynamic anchors to quantify information gain and regulate modality interactions during reinforcement learning with verifiable rewards, achieving 3.2% and 4.9% performance gains on GRPO and DAPO across diverse datasets.

0 favorites 0 likes
#visual-reasoning

ETCHR: Editing To Clarify and Harness Reasoning

Hugging Face Daily Papers · 2026-05-22 Cached

ETCHR is a novel image editing approach that decouples visual reasoning from image generation, using a two-stage training process (Reasoning Imitation and Reasoning Enhancement) to improve multimodal language model performance across five visual reasoning tasks. It achieves consistent gains of 4-5% Pass@1 on models like Qwen3-VL-8B, Gemini-3.1-Flash-Lite, and Kimi K2.5.

0 favorites 0 likes
#visual-reasoning

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

Hugging Face Daily Papers · 2026-05-14 Cached

ATLAS presents a visual reasoning framework that combines agentic operations and latent representations using functional tokens, enabling efficient training via next-token prediction and reinforcement learning while avoiding intermediate image generation.

0 favorites 0 likes
#visual-reasoning

Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning

arXiv cs.CL · 2026-05-11 Cached

This paper introduces RIS, a framework for spatial-semantic grounded latent visual reasoning in Multimodal Large Language Models to overcome information bottlenecks. It proposes anchoring latent tokens to spatial and semantic evidence, showing improvements on benchmarks like V* and HRBench.

0 favorites 0 likes
#visual-reasoning

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

Hugging Face Daily Papers · 2026-05-11 Cached

This paper introduces On-Policy Data Evolution (ODE) and a visual-native agent harness to improve multimodal deep search agents. By enabling reusable visual evidence and closed-loop data generation, ODE significantly boosts the performance of Qwen3-VL agents across multiple benchmarks, surpassing Gemini 2.5 Pro.

0 favorites 0 likes
#visual-reasoning

The Cost of Context: Mitigating Textual Bias in Multimodal Retrieval-Augmented Generation

arXiv cs.CL · 2026-05-08 Cached

This paper identifies and formalizes 'recorruption' in multimodal RAG, where adding accurate context causes models to abandon correct predictions due to attentional collapse (visual blindness and positional bias). The authors propose BAIR, a parameter-free inference-time framework that restores visual saliency and penalizes textual distractors, improving reliability across medical, fairness, and geospatial benchmarks.

0 favorites 0 likes
#visual-reasoning

Visual Reasoning through Tool-supervised Reinforcement Learning

Hugging Face Daily Papers · 2026-04-21 Cached

Introduces ToolsRL, a two-stage reinforcement learning framework that teaches multimodal LLMs to use simple visual tools for complex visual reasoning tasks.

0 favorites 0 likes
#visual-reasoning

Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

Hugging Face Daily Papers · 2026-04-17 Cached

Research shows Chain-of-Thought prompting harms visual-spatial reasoning in multimodal LLMs due to shortcut learning and hallucinating visual details from text alone.

0 favorites 0 likes
#visual-reasoning

Learning Adaptive Reasoning Paths for Efficient Visual Reasoning

Hugging Face Daily Papers · 2026-04-16 Cached

AVR is an adaptive visual reasoning framework that dynamically selects optimal reasoning formats to reduce token usage by 50-90% while maintaining accuracy in visual reasoning tasks. The method addresses reasoning path redundancy by decomposing visual reasoning into three cognitive functions and using FS-GRPO training to encourage efficient format selection.

0 favorites 0 likes
#visual-reasoning

Boosting Visual Instruction Tuning with Self-Supervised Guidance

Hugging Face Daily Papers · 2026-04-14 Cached

This paper proposes augmenting visual instruction tuning in multimodal language models with self-supervised tasks expressed as natural language instructions, improving vision-centric reasoning without additional architecture or annotations. By reformulating classical self-supervised pretext tasks as image-instruction-response triplets, the method achieves consistent performance improvements across multiple benchmarks by injecting only 3-10% visually grounded instructions into the training data.

0 favorites 0 likes
#visual-reasoning

A better method for planning complex visual tasks

MIT News — Artificial Intelligence · 2026-03-11 Cached

MIT researchers developed VLMFP, a two-stage generative AI approach combining vision-language models with formal planning software to achieve 70% success rate on complex visual planning tasks like robot navigation, nearly 2.3x better than existing baselines. The method automatically translates visual scenarios into planning files that classical solvers can process, enabling effective long-horizon planning in novel environments.

0 favorites 0 likes
#visual-reasoning

Thinking with images

OpenAI Blog · 2025-04-16 Cached

OpenAI releases o3 and o4-mini models that can reason with images in their chain-of-thought process, enabling visual understanding through native image manipulation tools like cropping and zooming without separate specialized models. These models achieve state-of-the-art performance on multimodal benchmarks including STEM questions, chart reading, and visual search tasks.

0 favorites 0 likes
← Back to home

Submit Feedback