From Structure to Synergy: A Survey of Vision-Language Perception Paradigm Evolution in Multimodal Large Language Models

arXiv cs.CL 06/26/26, 04:00 AM Papers
Summary
This survey paper systematically reviews the paradigm evolution of unified vision-language perception in multimodal large language models (MLLMs), proposing a five-stage taxonomy and identifying open challenges toward general multimodal intelligence.
arXiv:2606.26196v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have recently made remarkable progress in unifying vision-language understanding and reasoning, especially following the introduction of models such as OpenAI's O-series and DeepSeek's R-series, which have driven a paradigm shift toward perception-centric intelligence. However, there remains a lack of systematic surveys that examine perception from a truly unified vision-language perspective -- one that treats vision and language as an inseparable modality. Existing reviews are often fragmented, focusing separately on either vision or language, and thus rarely capture the cross-modal evolution of perception as an integrated capability. To bridge this gap, we present the first systematic survey of unified vision-language perception in MLLMs. Specifically, we (1) formalize MLLM perception as an intrinsic, unified vision-language capability analogous to human innate perception, (2) introduce a five-stage taxonomy tracing the paradigm evolution of MLLM perception and survey representative methods and milestones at each phase, and (3) identify open challenges and outline promising research directions toward truly general, unified multimodal intelligence. We hope our study will provide both a foundational understanding and an actionable roadmap to foster further innovation on the path toward artificial general intelligence (AGI).
Original Article
View Cached Full Text
Cached at: 06/26/26, 05:14 AM
# From Structure to Synergy: A Survey of Vision-Language Perception Paradigm Evolution in Multimodal Large Language Models
Source: [https://arxiv.org/html/2606.26196](https://arxiv.org/html/2606.26196)
###### Abstract

Multimodal Large Language Models \(MLLMs\) have recently made remarkable progress in unifying vision\-language understanding and reasoning, especially following the introduction of models such as OpenAI’s O\-series and DeepSeek’s R\-series, which have driven a paradigm shift toward perception\-centric intelligence\. However, there remains a lack of systematic surveys that examine perception from a truly unified vision\-language perspective—one that treats vision and language as an inseparable modality\. Existing reviews are often fragmented, focusing separately on either vision or language, and thus rarely capture the cross\-modal evolution of perception as an integrated capability\. To bridge this gap, we present the first systematic survey of unified vision\-language perception in MLLMs\. Specifically, we \(1\) formalize MLLM perception as an intrinsic, unified vision\-language capability analogous to human innate perception, \(2\) introduce a five\-stage taxonomy tracing the paradigm evolution of MLLM perception and survey representative methods and milestones at each phase, and \(3\) identify open challenges and outline promising research directions toward truly general, unified multimodal intelligence\. We hope our study will provide both a foundational understanding and an actionable roadmap to foster further innovation on the path toward artificial general intelligence \(AGI\)\.

###### keywords:

Multimodal Large Language Models , Perception , Vision\-Language Models , Unified Framework

††journal:Information Fusion\\affiliation

\[label1\]organization=School of Computer Science, Sichuan University,addressline=No\. 24 South Section 1, Yihuan Road, city=Chengdu, postcode=610065, state=Sichuan, country=China

\\affiliation

\[label2\]organization=School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School, addressline=2199 Lishui Road, city=Shenzhen, postcode=518055, state=Guangdong, country=China\\affiliation\[label3\]organization=Institute of Artificial Intelligence \(TeleAI\), China Telecom and Northwestern Polytechnical University, addressline=127 Youyi West Road, city=Xi’an, postcode=710072, state=Shaanxi, country=China

## 1Introduction

*“Semantic knowledge transforms the sensory cacophony into a symphony of meaning\.”*

—*Matthew A\. Lambon Ralph*

In recent years, Multimodal Large Language Models \(MLLMs\)111In this paper, “MLLM” is used as a unified term encompassing Large Language Models \(LLMs\), Large Vision\-Language Models \(LVLMs\) and multimodal LLMs\. Our focus is on their unified vision\-language perception capabilities\.have evolved from a pure text\-processing paradigm to unified vision\-language understanding and reasoning, significantly advancing cross\-modal tasks such as image captioning\[[156](https://arxiv.org/html/2606.26196#bib.bib245),[13](https://arxiv.org/html/2606.26196#bib.bib246)\]visual grounding\[[194](https://arxiv.org/html/2606.26196#bib.bib201),[71](https://arxiv.org/html/2606.26196#bib.bib64)\], and multimodal question answering\[[96](https://arxiv.org/html/2606.26196#bib.bib194),[109](https://arxiv.org/html/2606.26196#bib.bib248)\]\.

The perception capability of MLLMs—the ability to extract, localize, and reason about structured semantic information from raw visual signals—serves as the cornerstone for vision\-language understanding and reasoning\. It underpins applications ranging from classic computer vision benchmarks\[[194](https://arxiv.org/html/2606.26196#bib.bib201),[88](https://arxiv.org/html/2606.26196#bib.bib202)\]to open\-ended, human\-centric tasks\[[175](https://arxiv.org/html/2606.26196#bib.bib86),[17](https://arxiv.org/html/2606.26196#bib.bib199)\]\. This perception capability enables MLLMs to see, comprehend, and interact with multimodal information in a human\-like manner\[[124](https://arxiv.org/html/2606.26196#bib.bib249)\]\.

Despite remarkable advances and rapid evolution in MLLMs, there remains a lack of systematic surveys that approach these developments from the perspective of perception\. Existing reviews are typically siloed, focusing on either vision\-centric\[[130](https://arxiv.org/html/2606.26196#bib.bib207),[138](https://arxiv.org/html/2606.26196#bib.bib206)\]or language\-centric\[[166](https://arxiv.org/html/2606.26196#bib.bib208),[21](https://arxiv.org/html/2606.26196#bib.bib209)\]tasks, and thus fail to address perception as a unified vision\-language capability\. This fragmented perspective overlooks the paradigm\-level transitions in vision\-language perception within MLLMs, hindering a holistic understanding of their full potential and underscoring the need for an integrated, perception\-focused review\.

To address this gap, our survey provides the first in\-depth and structured analysis of the evolution ofvision\-language perceptionin MLLMs\. We clarify the scope of our review, outline its organization, and summarize our main contributions, aiming to serve as a roadmap for future research in unified vision\-language perception\.

### 1\.1Scope and Definitions

This section defines the scope of our survey, focusing on how MLLMs perceive images*at the level of specific regions or instances in response to natural\-language queries*\.

Concretely, a method falls within our scope if it satisfies the following:

- 1\.The input includes at least one natural 2D image or video clip\.
- 2\.The core challenge lies in extracting or distinguishing visual evidence \(objects, regions, attributes, relations\), rather than in purely symbolic or mathematical reasoning\.
- 3\.Solving the task requires correctly perceiving specific regions or instances under natural\-language guidance \(e\.g\., referring expression comprehension, visual grounding, and region\-level question answering\)\.

Naturally, we*explicitly exclude*three categories of work: \(1\) tasks primarily driven by logical or symbolic reasoning rather than perceptual processing \(e\.g\., mathematical problem solving\[[27](https://arxiv.org/html/2606.26196#bib.bib203)\]\); \(2\) methods targeting highly specialized applications in vertical domains \(e\.g\., robotic vision–language control\[[150](https://arxiv.org/html/2606.26196#bib.bib204)\]\); and \(3\) approaches whose gains arise solely from global architectural scaling \(e\.g\., increasing input resolution\[[42](https://arxiv.org/html/2606.26196#bib.bib210)\]or model parameter count\[[91](https://arxiv.org/html/2606.26196#bib.bib211)\]\) without introducing mechanisms that enhance localized perception\.

### 1\.2Paper Organization

We categorize the evolution of perception capability into five stages: from structure\-driven modular optimization \(Stages I and II\), to dynamic perception paradigms \(Stage III\), and ultimately to unified frameworks that synergize instruction, adaptivity, and reinforcement learning \(Stage IVA and Stage IVB\), offering prospects for the next generation of multimodal intelligence\. For an intuitive overview, the organizational framework of our survey is illustrated in Fig\.[2](https://arxiv.org/html/2606.26196#S2.F2)\.

Specifically, the rest of this survey is organized as follows\. Sec\.[2](https://arxiv.org/html/2606.26196#S2)\(Preliminary\) introduces our five\-stage taxonomy and contrasts our unified perception focus with prior, task\-specific surveys\. Sec\.[3](https://arxiv.org/html/2606.26196#S3)\(Stage I: Encoder\-Centric Optimization\) reviews early efforts to strengthen local perception via Region\-Aware Modules and Integrated Perception Subnetworks\. Sec\.[4](https://arxiv.org/html/2606.26196#S4)\(Stage II: Decoder\-Centric Optimization\) examines how auxiliary decoding shifts perception from region\-level to pixel\-level through Auxiliary Decoders, Multi\-Decoder Architectures, and Specialized Decoding Strategies\. Sec\.[5](https://arxiv.org/html/2606.26196#S5)\(Stage III: Dynamic Perception via Adaptive Processing\) explores adaptive, context\-driven perception across External Tool Scheduling, LLM\-Centric Routing, and Code\-as\-Policy for Perception Tasks\.

Sec\.[6](https://arxiv.org/html/2606.26196#S6)\(Stage IVA: Architecture\-Free Strategies for Perception Enhancement\) delves into two non\-architectural approaches: Instruction\-Based Strategies and RL\-Based Strategies\. Sec\.[7](https://arxiv.org/html/2606.26196#S7)\(Stage IVB: Towards Unified Perception\) provides a forward\-looking perspective on the convergence of instruction, adaptivity, and RL via Why Unification Matters, Signs of Convergence and Towards a New Generation of Perception\-Centric Agents\. Finally, Sec\.[8](https://arxiv.org/html/2606.26196#S8)offers a concise summary of our key findings and outlines directions for future research\.

### 1\.3Main Contributions

- 1\.First systematic survey:To the best of our knowledge, this is the first survey to systematically analyze vision\-language perception in MLLMs as an intrinsic, unified capability, providing an actionable roadmap for future research\.
- 2\.Paradigm taxonomy and comprehensive review:We propose a five\-stage taxonomy of paradigm evolution in MLLM perception and comprehensively review representative works at each stage, highlighting their key innovations\.
- 3\.Emerging challenges and future directions:We provide an in\-depth discussion of current challenges and outline promising directions for future development in perception\-centric multimodal intelligence\.

## 2Preliminary

### 2\.1Taxonomy

In this section, we introduce our five\-stage taxonomy that characterizes the evolutionary trajectory of perception capabilities in MLLMs\. This framework is designed to capture the paradigm shift from structure—with targeted, architecture\-driven enhancements—to synergy, where perception emerges from integrated, adaptive, and unified strategies\.

Our taxonomy begins with encoder\-centric, structure\-driven approaches developed for specific scenarios such as simple image understanding tasks in Visual Question Answering \(VQA\), where perception improvements are achieved either by enhancing the internal structure of the vision encoder or by modifying the model architecture to treat the encoder as an integrated perceptual subnetwork\. It then moves to decoder\-centric strategies, which further advance fine\-grained perception by employing specialized decoding designs that increase the granularity of the model’s perceptual capabilities—from region\-level to pixel\-level understanding\. The third stage, dynamic perception, focuses on context\-aware and adaptive processing, enabling models to flexibly interact with images multiple times to acquire richer and more comprehensive information\.

The most recent advances represent two distinct but related directions: architecture\-free strategies, which focus on optimizing perception for specific tasks without explicit architectural changes; and unified perception frameworks, which emphasize the model’s ability to operate in complex, open\-ended scenarios by integrating instruction, adaptivity, and reinforcement learning\. Despite their different emphases, both directions—like the earlier stages—continue to advance local perception capabilities as a central theme\.

Importantly, our taxonomy should not be viewed as a strictly linear progression\. Many of these stages are intrinsically intertwined, with considerable overlap and mutual influence between them\. We categorize each representative work according to its primary contribution to the evolution of perception in MLLMs, rather than its chronological order, to better reflect the conceptual landscape and research focus of the field\. To provide a clear and intuitive overview, we illustrate these five stages and their relationships in Fig\.[1](https://arxiv.org/html/2606.26196#S2.F1)\.

![Refer to caption](https://arxiv.org/html/2606.26196v1/overview.jpg)Figure 1:An overview of the evolving paradigms for enhancing perception in multimodal large language models\. The arrow indicates the overall trajectory of evolution, moving from structural modifications to synergy\-driven integration\.
### 2\.2Differences from Related Work

Recent surveys have already reviewed multimodal large language models from multiple perspectives, including architectures, evaluation, perception, and reasoning\. Our work is most closely related to these efforts butdiffers in both scope and organizing perspective\. In short, they can be grouped into the following categories:

- 1\.Comprehensive surveys on the evolution of MLLMs: Liang et al\.\[[86](https://arxiv.org/html/2606.26196#bib.bib16)\]provide a comprehensive guide to MLLMs and examine their architectures, applications, and impact, covering topics such as training methods, architectural components, and practical applications in various fields\. Caffagni et al\.\[[5](https://arxiv.org/html/2606.26196#bib.bib18)\]review earlier stages of MLLM development and analyze their architectural choices, multimodal alignment strategies, and training techniques\. Similarly, Zhang et al\.\[[208](https://arxiv.org/html/2606.26196#bib.bib19)\]review the evolution of MLLMs over time, focusing on the performance of selected models on mainstream benchmarks and summarizing key training recipes\. These surveys chart the overall evolution of MLLMs from a broad, time\-oriented perspective\. In contrast, our work restricts the scope to the narrower problem of vision–language perception, with a particular focus on region\- or instance\-level understanding under natural\-language guidance\.
- 2\.Task\-specific surveys: Sapkota et al\.\[[130](https://arxiv.org/html/2606.26196#bib.bib207)\]present an in\-depth review of object detection with MLLMs, analyzing how these models leverage natural language and visual features to perform object localization and category recognition, and comparing their adaptability and efficiency to conventional deep detectors\. Shen et al\.\[[138](https://arxiv.org/html/2606.26196#bib.bib206)\]focus on reasoning segmentation for images and videos, summarizing existing methods, benchmarks, and applications across diverse domains\. These surveys are all centered on specific vision tasks and, although illustrative, do not capture how perception modules evolve toward synergistic improvements within a unified vision\-language paradigm\.
- 3\.Chain\-of\-thought and mathematical reasoning surveys: Wang et al\.\[[162](https://arxiv.org/html/2606.26196#bib.bib276)\]present a survey of multimodal chain\-of\-thought\(MCoT\) reasoning, clarifying definitions, summarizing related MCoT methods, and organizing them by application scenarios\. Chen et al\.\[[21](https://arxiv.org/html/2606.26196#bib.bib209)\]focus on long chain\-of\-thought for reasoning LLMs, proposing a taxonomy that distinguishes long versus short CoT, analyzing phenomena such as overthinking and test\-time scaling, and discussing future directions including multimodal extensions\. These works take reasoning as the primary organizing axis and largely treat visual signals as contextual input, whereas our survey centers on how vision\-language perception itself evolves\. Zhou et al\.\[[227](https://arxiv.org/html/2606.26196#bib.bib30)\]deconstruct vision\-language interactive reasoning into a foundational Perception layer for accurate visual information extraction and fine\-grained alignment, and a higher\-order Cognition layer for proactive, multi\-step, goal\-oriented reasoning built upon this perceptual basis\. Guided by this two\-layer framework, their survey analyzes bottlenecks and methods at both layers, with a primary focus on the hierarchical relationship between visual inputs and logical reasoning within interactive observe–think–verify loops\. In contrast, our survey adopts an evolutionary, perception\-centered perspective: we concentrate on paradigm shifts in perception architectures themselves, tracing the trajectory from modular, structure\-driven designs to a unified, synergy\-driven vision\-language perception paradigm\.

Taken together, existing surveys either provide broad overviews of MLLMs, offer vertical reviews of individual perception tasks, or organize the landscape around chain\-of\-thought and interactive reasoning\. By contrast, our survey treats vision–language perception itself as the central organizing axis: we precisely restrict the scope to region\- and instance\-level perception under natural\-language guidance and trace its paradigm shifts across five stages, from encoder/decoder\-centric designs to dynamic and architecture\-free strategies, culminating in unified, synergy\-driven vision–language perception\.

\{forest\}

Figure 2:Organization of our five\-stage evolutionary framework\.

## 3Stage I: Enhancing Perception via Encoder\-Centric Optimization

![Refer to caption](https://arxiv.org/html/2606.26196v1/encoder.jpg)Figure 3:Encoder\-Centric Optimization Strategies for Multimodal Perception in MLLMs\. \(a\) Region\-aware modules enhance perceptual capability by generating universal ROI proposals within the encoder, performing post\-encoding feature fusion and alignment, or applying task\-specific optimizations at the projector/connector level to better adapt the model to particular visual contexts\. \(b\) Integrated perception subnetworks achieve multi\-level information fusion by incorporating multiple visual expert subnetworks directly within the encoder stage\.Stage I focuses on enhancing perception by modifying the encoder itself\. Early works pursue encoder\-internal optimizations, introducing components such as region\-aware modules and projector\- or connector\- level adjustments to strengthen fine\-grained, local perception \(§[3\.1](https://arxiv.org/html/2606.26196#S3.SS1)\)\. In parallel, another line of research treats the encoder as an integrated perception subnetwork, aiming to improve perceptual capacity by either expanding the variety of encoders or composing multiple specialized encoder branches \(§[3\.2](https://arxiv.org/html/2606.26196#S3.SS2)\)\. As provided in Fig\.[3](https://arxiv.org/html/2606.26196#S3.F3), these strategies together elevate MLLMs’ perceptual granularity from an image\-level view to region\-level understanding, laying the groundwork for more advanced paradigms that extract, localize, and reason about visual information with finer precision\.

### 3\.1Region\-Aware Modules

A central challenge in multimodal perception is enabling models to selectively attend to and process meaningful regions within an image, rather than relying solely on global representations\. Region\-aware modules address this by equipping the encoder with mechanisms to enhance local, fine\-grained perception\. We group region\-aware modules into three parts: Universal Region\-of\-Interest Proposal \(§[3\.1\.1](https://arxiv.org/html/2606.26196#S3.SS1.SSS1)\), Query\- or Prompt\-Aware Encoders \(§[3\.1\.2](https://arxiv.org/html/2606.26196#S3.SS1.SSS2)\), Projector\-level or Connector\-level Optimizations \(§[3\.1\.3](https://arxiv.org/html/2606.26196#S3.SS1.SSS3)\)\.

#### 3\.1\.1Universal Region\-of\-Interest Proposal

Some works propose universal region\-of\-interest \(ROI\) generation methods\[[46](https://arxiv.org/html/2606.26196#bib.bib214)\]that are decoupled from the large language model, thereby enabling more flexible and generalizable region selection, independent of task\-specific semantics\. MG\-LLaVA\[[222](https://arxiv.org/html/2606.26196#bib.bib3)\]designs a multi\-granularity encoder with separate pathways for different resolutions and object\-level ROIs, with the object\-level ROI generation incorporating both a tagging model\[[217](https://arxiv.org/html/2606.26196#bib.bib216)\]and an open\-vocabulary detector\[[111](https://arxiv.org/html/2606.26196#bib.bib217)\]\. ChatRex\[[61](https://arxiv.org/html/2606.26196#bib.bib4)\]integrates a universal proposal network\[[46](https://arxiv.org/html/2606.26196#bib.bib214)\]within the MLLM, projecting regression\-predicted boxes into the LLM and reframing box prediction as a retrieval task\. Similarly, Groma\[[103](https://arxiv.org/html/2606.26196#bib.bib5)\]integrates a Deformable DETR\[[235](https://arxiv.org/html/2606.26196#bib.bib215)\]as its core region proposal module, and combines this with specialized region tokens to achieve localized visual tokenization\. Apart from that, some works focus on specific scenarios: ASM\[[163](https://arxiv.org/html/2606.26196#bib.bib98)\]performs ROI selection for panoptic visual recognition, followed by a location\-aware tokenizer to encode selected regions\. Artemis\[[122](https://arxiv.org/html/2606.26196#bib.bib219)\]proposes an ROI tracking and selection mechanism for video\-based referring, obtaining a list of boxes that represent spatiotemporally aware ROIs\.

#### 3\.1\.2Query\- or Prompt\-Aware Encoders

Several approaches enhance region awareness by making region selection adaptive to various prompts—whether language\- or vision\-based—thus enabling more context\-sensitive and interactive perception\.

Some works focus specifically on visual prompts, designing dedicated modules to encode or fuse visual region cues as part of the input\. To handle visual prompts at the mask level, Osprey\[[201](https://arxiv.org/html/2606.26196#bib.bib14)\]and Finecaption\[[50](https://arxiv.org/html/2606.26196#bib.bib195)\]use a Mask\-Aware Visual Extractor that ingests both the input image and referring masks to obtain mask\-region perception\. VP\-MLLM\[[89](https://arxiv.org/html/2606.26196#bib.bib35)\]introduces a Visual Prompt Encoder to accommodate and recognize various types of visual prompts as input\.

Moving toward free\-form visual prompts, Ferret\[[191](https://arxiv.org/html/2606.26196#bib.bib31)\]introduces a Spatial\-Aware Visual Sampler that enables the acquisition of visual features from regions of arbitrary shape, accommodating varying spatial sparsity and thus supporting the processing of free\-form region inputs\. VPP\-LLaVA\[[146](https://arxiv.org/html/2606.26196#bib.bib36)\]introduces two complementary mechanisms: a global Visual Position Prompt, which overlays a learnable, axis\-like tensor onto the input image to provide structured spatial cues, and a local Visual Position Prompt, which incorporates position\-aware queries to enable fine\-grained localization\.

Beyond these specialized vision\-aware modules, RegionGPT\[[41](https://arxiv.org/html/2606.26196#bib.bib92)\]employs dual encoder branches—patch\-merge for image\-level features and mask\-pooling for region\-level features—to improve region specificity, while GPT4ROI\[[212](https://arxiv.org/html/2606.26196#bib.bib93)\]refines spatial instruction tuning by interleaving spatial instructions and explicit region references within a single token sequence, combining RoIAlign\-extracted region features with language embeddings for precise region localization\. In contrast with the above approaches that design dedicated modules for handling visual prompts, ViP\-LLaVA\[[7](https://arxiv.org/html/2606.26196#bib.bib13)\]forgoes any extra modules and directly fuses visual prompts with the original image within the vision encoder using an alpha\-blending mechanism, thus supporting free\-form visual prompt integration\.

Other works concentrate on language prompts\. Lion\[[16](https://arxiv.org/html/2606.26196#bib.bib6)\]integrates a Recognize Anything Model \(RAM\)\[[217](https://arxiv.org/html/2606.26196#bib.bib216)\]for extracting image tags as soft prompts and employs a Mixture\-of\-Adapters with a router mechanism\. The router dynamically fuses image\-level or region\-level features from different visual branches, enabling adaptive perception across diverse scenarios\. FlexCap\[[33](https://arxiv.org/html/2606.26196#bib.bib15)\]targets the captioning task by accepting object\-box coordinates and directly linearly projecting them into the LLM during training\.

In addition, some works\[[92](https://arxiv.org/html/2606.26196#bib.bib113),[116](https://arxiv.org/html/2606.26196#bib.bib118)\]enhance fine\-grained perception without modifying the encoder architecture itself, but rather through instruction tuning or prompt engineering\. We discuss these strategies in Stage IVA \(§[6](https://arxiv.org/html/2606.26196#S6)\)\.

#### 3\.1\.3Projector\-level or Connector\-level Optimizations

Other methods achieve region awareness via projector\-level or connector\-level optimizations\. Honeybee\[[12](https://arxiv.org/html/2606.26196#bib.bib11)\]proposes a novel projector design that is both flexible and locality\-enhanced\. ParGo\[[151](https://arxiv.org/html/2606.26196#bib.bib12)\]introduces a Partial\-Global projector that aligns two separately pre\-trained models by integrating partial and global views, mitigating overemphasis on prominent regions\.

Connector‑level approaches have likewise been developed to bridge vision encoders and LLMs: Groundhog\[[215](https://arxiv.org/html/2606.26196#bib.bib2)\]enhances holistic segmentation by employing a Masked Feature Extractor as a connector, together with a Mask Proposal Model and a Mask Retrieval Head\. It treats segmentation as a retrieval task over class\-agnostic entity mask proposals, thereby improving perceptual performance\. For video streams, CountLLM\[[189](https://arxiv.org/html/2606.26196#bib.bib100)\]inserts a periodicity transformer between the video encoder and LLM to achieve periodicity‑aware alignment for repetitive action counting; Elysium\[[153](https://arxiv.org/html/2606.26196#bib.bib169)\]employs a T‑Selector module to enable the MLLM to process a larger number of frames without significant performance degradation; and TimeChat\[[128](https://arxiv.org/html/2606.26196#bib.bib151)\]designs a Sliding Video Q‑Former with a Time‑Aware Frame Encoder that first extracts spatial tokens per frame, binds them with timestamp descriptions, and then uses a moving sliding window to establish temporal relations across varied‑length video tokens\.

### 3\.2Integrated Perception Subnetworks

Beyond internal encoder modifications, another line of research treats the encoder as an integrated perception subnetwork\. These approaches aim to enhance perceptual capacity by either increasing the number and diversity of encoder branches \(§[3\.2\.1](https://arxiv.org/html/2606.26196#S3.SS2.SSS1)\), or by composing multiple specialized encoders within the model architecture \(§[3\.2\.2](https://arxiv.org/html/2606.26196#S3.SS2.SSS2)\)\.

#### 3\.2\.1Specialized Task Encoders

Some works focus on enhancing perception by introducing multiple specialized encoders, each tailored for specific task types\. Eagle\[[139](https://arxiv.org/html/2606.26196#bib.bib103)\], MouSi\[[35](https://arxiv.org/html/2606.26196#bib.bib104)\]and MoME\[[137](https://arxiv.org/html/2606.26196#bib.bib220)\]employ a mixture of specialized vision encoders, introducing distinct vision experts for different scenarios such as OCR, detection, and segmentation\. Visual features produced by these experts are fused and then projected into the MLLM, allowing the model to adaptively leverage multiple visual perspectives\. Other works target more specific scenarios: Sherlock\[[104](https://arxiv.org/html/2606.26196#bib.bib8)\]addresses video anomaly perception by introducing a Global\-Local Spatial\-Enhanced Mixture of Expert module with a Spatial Imbalance Regulator\. The former module includes four spatial experts for extracting spatial information and an expert gate to balance global and local spatial cues, effectively tackling the challenge of global\-local spatial modeling\. Similarly, Points\[[97](https://arxiv.org/html/2606.26196#bib.bib7)\]employs a dual\-encoder branch that fuses representations from an OCR module and a general module\. RexSeek\[[62](https://arxiv.org/html/2606.26196#bib.bib33)\]builds upon ChatRex\[[61](https://arxiv.org/html/2606.26196#bib.bib4)\]by integrating a person detection model\.

Building on these multi\-expert encoders, other works introduce auxiliary branches to incorporate external vision modules alongside the primary encoder\. MoAI\[[73](https://arxiv.org/html/2606.26196#bib.bib111)\]introduces a bypass branch for the vision encoder, directly invoking conventional computer vision models to process the input image\. The results are verbalized and compressed using a dedicated compressor before being embedded into the MLLM\. Myriad\[[81](https://arxiv.org/html/2606.26196#bib.bib9)\]leverages pre\-trained industrial anomaly detection \(IAD\) models to generate anomaly maps, which are fed into a specially designed vision encoder\. ROD\-MLLM\[[190](https://arxiv.org/html/2606.26196#bib.bib34)\]introduces an additional open\-vocabulary detection \(OVD\) module to recall candidate objects for ROI alignment\.

Moreover, IVE\[[47](https://arxiv.org/html/2606.26196#bib.bib105)\]combines both strategies: it calls specific visual tools to extract structured data as hard prompts and utilizes multi\-task encoders to obtain richer visual information, aggregating available visual information through a mixture\-of\-experts mechanism in both training and inference stages\.

#### 3\.2\.2Multiple Visual\-modal Encoders

A related direction enriches perception by integrating specialized visual modalities \(e\.g\., depth, segmentation\) within the encoder architecture\. SpatialRGPT\[[26](https://arxiv.org/html/2606.26196#bib.bib32)\]introduces a flexible plugin module for 3D spatial awareness, combining masks/boxes feature extractors with RGB and depth connectors to capture spatially diverse information\. Similarly, VCoder\[[57](https://arxiv.org/html/2606.26196#bib.bib38)\]fuses multiple perception modalities during encoding, yielding richer and more comprehensive region representations\.

## 4Stage II: Enhancing Perception via Decoder\-Centric Optimization

![Refer to caption](https://arxiv.org/html/2606.26196v1/decoder.jpg)Figure 4:Decoder‐Centric Optimization Strategies for Multimodal Perception in MLLMs\. \(a\) Auxiliary decoder modules enhance perception by integrating task\-specific decoding heads alongside vision encoders\. \(b\) Multi\-decoder architectures employ multiple parallel task\-specific heads \(e\.g\., segmentation, detection, and others\) to support diverse perception tasks\. \(c\) Specialized decoding strategies leverage structured text outputs, which can either guide downstream modules \(e\.g\., SAM\) or serve as direct task outputs \(e\.g\., masks or boxes\)\.While encoder modifications have advanced MLLMs to capture finer\-grained region\-level features, their analysis typically remains at the region level, limiting precise localization capabilities\. Stage II marks a pivotal shift: by introducing auxiliary decoders, MLLMs make a substantial leap from region\-level to true pixel\-level perception\. Although these decoders differ in their architectures and target tasks, they can be viewed as proxy objectives that leverage the MLLM’s vision\-language backbone to adapt to diverse, task\-specific requirements—thereby enhancing both specialized and general perception capabilities\. As shown in Fig\.[4](https://arxiv.org/html/2606.26196#S4.F4), we organize decoder\-centric approaches into three categories: Auxiliary Decoder \(§[4\.1](https://arxiv.org/html/2606.26196#S4.SS1)\), Multi\-Decoder Architectures \(§[4\.2](https://arxiv.org/html/2606.26196#S4.SS2)\), and Specialized Decoding Strategies \(§[4\.3](https://arxiv.org/html/2606.26196#S4.SS3)\)\.

### 4\.1Auxiliary Decoder

This subsection surveys auxiliary decoder approaches that advance MLLM perception from region\-level analysis to fine\-grained pixel\-level understanding\. In the following, we detail these advances in three parts: Early Auxiliary Decoder: The LISA Paradigm \(§[4\.1\.1](https://arxiv.org/html/2606.26196#S4.SS1.SSS1)\), Temporal and Multi\-Image Auxiliary Decoder \(§[4\.1\.2](https://arxiv.org/html/2606.26196#S4.SS1.SSS2)\), and Auxiliary Decoder for Multi\-Granularity Perception \(§[4\.1\.3](https://arxiv.org/html/2606.26196#S4.SS1.SSS3)\)\.

#### 4\.1\.1Early Auxiliary Decoder: The LISA Paradigm

A foundational line of early work introduces segmentation as an auxiliary decoding task in MLLMs, with LISA\[[71](https://arxiv.org/html/2606.26196#bib.bib64)\]serving as a pioneering example\. LISA enables the MLLM to generate a special⟨seg⟩\\langle seg\\rangletoken, whose last\-layer embedding is decoded into a pixel\-level segmentation mask using the Segment Anything Model \(SAM\)\[[69](https://arxiv.org/html/2606.26196#bib.bib222)\]\. This marks the first end\-to\-end segmentation capability within the MLLM framework\.

Building on this foundation, a growing body of work has extended the auxiliary decoder paradigm in multiple directions\. One prominent line of research focuses on enhancing multimodal input capabilities\. GLaMM\[[126](https://arxiv.org/html/2606.26196#bib.bib39)\]builds on LISA by incorporating multiple encoders—global, regional, and grounding image encoders—thereby supporting both textual and visual prompts as well as multi\-region pixel\-level grounding\. PerceptionGPT\[[117](https://arxiv.org/html/2606.26196#bib.bib70)\]follows a similar route, using lightweight visual encoders and decoders to project visual features into the LLM’s embedding space as discrete tokens, allowing visual information to be handled alongside text\. OMG\-LLaVA\[[213](https://arxiv.org/html/2606.26196#bib.bib41)\]employs a universal perception module\[[79](https://arxiv.org/html/2606.26196#bib.bib224)\]as its visual encoder, integrating image information, perception priors, and visual prompts into unified visual tokens for the LLM\. AnyRef\[[45](https://arxiv.org/html/2606.26196#bib.bib42)\]proposes a Unified Referring Representation method to encode references from multiple modalities into LLM\-aligned embeddings\. A Refocusing Mechanism is further employed to enrich these embeddings with grounded textual cues, offering enhanced representational power\.

Another thread of research seeks to broaden the task scope of segmentation\. PSALM\[[219](https://arxiv.org/html/2606.26196#bib.bib63)\]expands on LISA’s framework by introducing a conditional prompt: the output embedding of this prompt serves as classifier weights for predicting segmentation mask categories\. This approach enables the model to generalize from referring segmentation task to various segmentation tasks\. ChatterBox\[[147](https://arxiv.org/html/2606.26196#bib.bib75)\]targets multi\-round referring and grounding tasks by using the⟨gnd⟩\\langle gnd\\rangletoken as an LLM query to guide DINO\[[210](https://arxiv.org/html/2606.26196#bib.bib223)\]in decoding object locations\.

In parallel, several efforts aim to enhance localization without compromising the original dialogue and reasoning abilities of the MLLM\. LISA\+\+\[[186](https://arxiv.org/html/2606.26196#bib.bib52)\]enhances reasoning instance segmentation ability and more natural text responses with the ability of Segmentation in Dialogue\. LLaVASeg\[[188](https://arxiv.org/html/2606.26196#bib.bib54)\]employs a chain\-of\-thought prompting strategy with lightweight adapters to instruct MLLMs to segment user\-specified target regions, maintaining their dialogue capabilities while equipping them with strong reasoning\-driven segmentation ability\. Similarly, MIRSA\[[6](https://arxiv.org/html/2606.26196#bib.bib57)\]combines segmentation with multi\-turn interaction, along with LLM\-based reasoning quality evaluation metrics\. SegLLM\[[165](https://arxiv.org/html/2606.26196#bib.bib66)\]addresses multi\-round interactive image reasoning segmentation by introducing a mask\-aware decoding scheme and four special tokens, which enables the model to generate new masks based on both current LLM outputs and the memory of previous masks\.

Beyond expanding capability and versatility, some works pursue enhanced segmentation accuracy and efficiency\. PixelLM\[[129](https://arxiv.org/html/2606.26196#bib.bib44)\]proposes a lightweight decoder and a comprehensive segmentation codebook\. The decoder utilizes the hidden embeddings of codebook tokens, encoding detailed and target\-specific information to produce segmentation masks quickly and accurately within the MLLM\. ROSE\[[44](https://arxiv.org/html/2606.26196#bib.bib50)\]targets open\-set dense segmentation by treating each image patch as an independent region\-of\-interest candidate\. This design enables the model to perform patch\-wise perception, mask and category decoding, thereby achieving both dense and sparse mask predictions\. GSVA\[[178](https://arxiv.org/html/2606.26196#bib.bib65)\]augments the decoder with multiple⟨seg⟩\\langle seg\\rangletokens and a dedicated⟨rej⟩\\langle rej\\rangletoken, allowing the model to explicitly reject null targets\. Similarly, LaSagnA\[[168](https://arxiv.org/html/2606.26196#bib.bib67)\]employs both⟨seg⟩\\langle seg\\rangleand⟨rej⟩\\langle rej\\rangletokens to effectively handle complex queries\.

#### 4\.1\.2Temporal and Multi\-Image Auxiliary Decoder

Some works extend the auxiliary decoder beyond single\-frame inputs to sequences or multi\-image contexts, enabling pixel\-level segmentation and tracking across time and views\.

A strand of research pursues universal segmentation across images and videos by explicitly extracting temporal structure from video\. InstructSeg\[[170](https://arxiv.org/html/2606.26196#bib.bib59)\]employs an object\-aware video perceiver to extract temporal and object cues from reference frames and uses vision\-guided multi\-granularity text fusion to integrate global and detailed prompt information, enabling both referring and reasoning segmentation across images and videos\. VRS\-HQ\[[39](https://arxiv.org/html/2606.26196#bib.bib60)\]introduces Temporal Dynamic Aggregation and token\-driven keyframe selection, and augments the conventional⟨seg⟩\\langle seg\\rangletoken with an additional temporal⟨tak⟩\\langle tak\\rangletoken\. ThinkVideo\[[67](https://arxiv.org/html/2606.26196#bib.bib62)\]introduces a multi\-agent, zero\-shot framework that leverages the chain\-of\-thought \(CoT\) reasoning abilities of MLLMs for enhanced video segmentation\. The method extracts object selectivities for keyframes and links a reasoning segmentation model with the SAM2\[[127](https://arxiv.org/html/2606.26196#bib.bib262)\]video processor to produce mask sequences\. ViLLa\[[225](https://arxiv.org/html/2606.26196#bib.bib48)\]employs a key segment extractor, context synthesizer, and hierarchical temporal synchronizer to align video\-level and frame\-level segmentation tokens, producing coherent multi\-frame segmentations\. VISA\[[183](https://arxiv.org/html/2606.26196#bib.bib76)\]introduces a text\-guided frame sampler to select the most distinctive frame as the segmentation target, along with corresponding reference frames\.

A complementary direction seeks to cast heterogeneous tasks into a single interaction paradigm\. Sa2VA\[[199](https://arxiv.org/html/2606.26196#bib.bib43)\]unifies various tasks—spanning static and dynamic visual understanding—under a single instruction\-tuning process, encoding all inputs \(text, visual prompts, images, and videos\) as token embeddings for comprehensive multimodal grounding\. HyperSeg\[[169](https://arxiv.org/html/2606.26196#bib.bib73)\]augments an MLLM with a universal segmentation framework that couples a hybrid entity\-recognition module with a fine\-grained visual perceiver\. With an additional temporal adapter, the framework extends to challenging video tasks\.

Distinct from the two branches above, another line of work tailors auxiliary decoders to video\-specific scenarios\. TrackGPT\[[229](https://arxiv.org/html/2606.26196#bib.bib58)\]integrates segmentation into tracking by introducing a self\-correcting rethinking mechanism that revises predictions deviating from the intended instruction, and a cross\-frame referring propagation module that leverages cues from adjacent frames to ensure robust and accurate tracking\. MoRA\[[28](https://arxiv.org/html/2606.26196#bib.bib61)\]targets motion\-grounded video reasoning by introducing a⟨loc⟩\\langle loc\\rangletoken for temporal information embedding and a temporal localization head that decodes binary temporal masks to refine the raw outputs from SAM\. ReVIOSa\[[65](https://arxiv.org/html/2606.26196#bib.bib46)\]targets interaction\-aware referring video object segmentation and introduces two special tokens,⟨seg\_act⟩\\langle seg\\\_act\\rangleand⟨seg\_tar⟩\\langle seg\\\_tar\\rangle, to separately segment the referred subject and the interacting object\. GLUS\[[87](https://arxiv.org/html/2606.26196#bib.bib47)\]integrates global and local reasoning into a single MLLM for referring video object segmentation with pre\-trained memory modules, and introduces plug\-and\-play self\-refinement via key\-frame selection and object\-contrastive loss\.

Beyond videos, PRIMA\[[149](https://arxiv.org/html/2606.26196#bib.bib77)\]targets multi\-image pixel\-grounded reasoning by combining a DINOv2\-based vision encoder for dense semantics with a Q\-Former’s selective cross\-attention, dynamically generating masks for objects and parts referenced in natural language queries\. CALICO\[[112](https://arxiv.org/html/2606.26196#bib.bib226)\]addresses part\-focused semantic co\-segmentation across multiple images by integrating a Correspondence Extraction Module, which captures semantically rich part correspondences, and Correspondence Adaptation Modules, which inject this information into the MLLM’s representation to enable efficient and accurate co\-segmentation\.

#### 4\.1\.3Auxiliary Decoder for Multi\-Granularity Perception

Beyond conventional single\-scale segmentation, several approaches have introduced auxiliary decoders designed for multi\-granularity perception, further enhancing the ability of MLLMs to process both coarse global context and fine local details\. UniRES\+\+\[[93](https://arxiv.org/html/2606.26196#bib.bib78)\]is among the first MLLMs explicitly designed for multi\-granularity referring expression segmentation\. It features a Multi\-Granularity Vision Flow for capturing multi\-level visual features, a Grounding Encoder for foundational representations, an LLM backbone, dynamic Multi\-Granularity Feature Exploitation for adaptive feature selection, and a Pixel Decoder for segmentation mask generation\. Similarly,M2SAM^\{2\}\\text\{SA\}\[[58](https://arxiv.org/html/2606.26196#bib.bib176)\]addresses multi\-target, object\-level, and part\-level reasoning segmentation by introducing early local feature fusion and employing multiple⟨seg⟩\\langle seg\\rangletokens\. Likewise, MGLMM\[[200](https://arxiv.org/html/2606.26196#bib.bib228)\]introduces multiple⟨seg⟩\\langle seg\\rangletokens along with an additional projector to better align image features with the linguistic modality, enabling seamless switching between multi\-granularity segmentation and captioning tasks\.

### 4\.2Multi\-Decoder Architectures

In this line of work, the MLLM serves as a central backbone, equipped with multiple task\-specific decoders to accomplish diverse vision tasks within a unified framework\.

NExT\-Chat\[[206](https://arxiv.org/html/2606.26196#bib.bib71)\]and u\-LLaVA\[[180](https://arxiv.org/html/2606.26196#bib.bib177)\]are two early efforts exploring multi\-task perception within MLLMs\. Specifically, NExT\-Chat\[[206](https://arxiv.org/html/2606.26196#bib.bib71)\]prompts the LMM to output location embeddings, which are then decoded through a box decoder and a mask decoder to support a range of captioning and grounding tasks\. Similarly, u\-LLaVA\[[180](https://arxiv.org/html/2606.26196#bib.bib177)\]incorporates both a pixel\-level decoder and a region\-level decoder, along with corresponding projectors, to enable multi\-level visual understanding\.

Subsequent works further advance toward handling more complex entities and employing a broader range of decoders\. Lumen\[[64](https://arxiv.org/html/2606.26196#bib.bib72)\]introduces a peak point selection mechanism that parses the heatmap into a set of points, each representing the center of an identified object or keypoint\. These points are then processed through a box decoder and a promptable mask decoder, enabling task\-specific decoding\. VisionLLM v2\[[173](https://arxiv.org/html/2606.26196#bib.bib69)\]introduces a super link mechanism that enables flexible information flow between the MLLM and various task\-specific decoders\. This design unifies visual perception, understanding, and generation within a single, cohesive framework\. Vitron\[[37](https://arxiv.org/html/2606.26196#bib.bib45)\]extends the multi\-decoder architecture to both images and videos, enabling an LLM\-to\-decoder instruction\-passing mechanism that operates over both discrete textual inputs and continuous signal embeddings\. REF\-VLM\[[144](https://arxiv.org/html/2606.26196#bib.bib51)\]boosts multi\-task performance by integrating Mask\-Guided Aggregation, a Latent Embeddings Router, and Parallel Group Hungarian Matching\. It further introduces a Triplet\-Based Referring Paradigm, which decouples key visual decoding dimensions using symbolic delimiters to enhance output structure and interpretability\.

### 4\.3Specialized Decoding Strategies

This line of work focuses on task\-specific decoding strategies designed for particular perception challenges\. Here, decoding refers to the process in which the MLLM generates structured text outputs that are either directly mapped to task\-specific results—such as segmentation masks, bounding boxes, or object relations \(§[4\.3\.1](https://arxiv.org/html/2606.26196#S4.SS3.SSS1)\), or used as prompts to drive dedicated downstream vision modules \(§[4\.3\.2](https://arxiv.org/html/2606.26196#S4.SS3.SSS2)\)\. We summarize representative approaches in both directions as follows\.

#### 4\.3\.1Direct Structured Output for Task Completion

Several methods leverage the MLLM’s ability to directly output structured results for vision tasks\. BuboGPT\[[223](https://arxiv.org/html/2606.26196#bib.bib198)\]addresses grounding by instructing the LLM to output region captions in text, which are then matched to entities detected by RAM\[[217](https://arxiv.org/html/2606.26196#bib.bib216)\]and Grounding DINO\[[95](https://arxiv.org/html/2606.26196#bib.bib260)\]; a subsequent entity\-matching module aligns the textual outputs with semantic regions in the image\. LLaVA\-Grounding\[[211](https://arxiv.org/html/2606.26196#bib.bib68)\]also focuses on grounding, directly projecting the language form response into a grounding model to produce bounding boxes\. ASMv2\[[163](https://arxiv.org/html/2606.26196#bib.bib98)\]further explores region\-level and relation\-aware predictions, where the MLLM is tasked with outputting object coordinates and predicting inter\-object relations using special⟨pre⟩\\langle pre\\rangletokens\.

For segmentation, LLaFS\[[230](https://arxiv.org/html/2606.26196#bib.bib91)\]enables end\-to\-end few\-shot segmentation by letting the LLM produce segmentation polygons and constructing a region\-attribute table to provide multi\-modal guidance simulating human visual mechanisms\. LlamaSeg\[[30](https://arxiv.org/html/2606.26196#bib.bib89)\]reformulates image segmentation as a visual generation problem, representing masks as visual tokens and employing a LLaMA\-style\[[148](https://arxiv.org/html/2606.26196#bib.bib261)\]Transformer to predict them directly from image inputs\. VistaLLM\[[118](https://arxiv.org/html/2606.26196#bib.bib80)\]introduces an instruction\-guided image tokenizer, unifying multi\-level vision\-language tasks within a general\-purpose framework\. Specifically for segmentation, it employs a gradient\-aware adaptive sampling technique to efficiently represent segmentation masks as sequences of points\. UFO\[[145](https://arxiv.org/html/2606.26196#bib.bib74)\]unifies diverse fine\-grained perception tasks within the same open\-ended language interface as vision\-language tasks, without relying on task\-specific decoders\. Additionally, by reformulating segmentation as an embedding retrieval problem, it flexibly supports mask prediction within this unified framework\. RAS\[[10](https://arxiv.org/html/2606.26196#bib.bib229)\]addresses omnimodal referring expression segmentation by first generating candidate mask groups with a segmentation model, and then tasking the mask\-centric LMM to select the appropriate mask from this pool according to vision\-language prompts\. HiMTok\[[161](https://arxiv.org/html/2606.26196#bib.bib175)\]introduces an efficient hierarchical mask tokenizer that represents mask images as hierarchical tokens, which adopts a three\-stage training procedure and eliminates the need for any external segmentation foundation model\.

#### 4\.3\.2Guiding Downstream Modules via Structured Prompts

Other works exploit the MLLM’s structured text outputs as prompts or intermediate guidance for external modules\. SeSaMe\[[176](https://arxiv.org/html/2606.26196#bib.bib56)\]adopts a cascading approach for the False Premise Correction task, splitting the process into “see”, “say”, and “segment” stages—where the first two stages generate language prompts that drive LISA\[[71](https://arxiv.org/html/2606.26196#bib.bib64)\]for final segmentation\. LLM\-Seg\[[157](https://arxiv.org/html/2606.26196#bib.bib10)\]reformulates segmentation as a selection task: the LLM, in tandem with mask proposals generated by SAM and DINO, produces⟨seg⟩\\langle seg\\rangletokens that interact with a fusion module and MLP for threshold\-based mask selection\. HRSeg\[[90](https://arxiv.org/html/2606.26196#bib.bib268)\]further extends LLM\-Seg\[[157](https://arxiv.org/html/2606.26196#bib.bib10)\]to high\-resolution settings by introducing two additional modules\. SAM4MLLM\[[24](https://arxiv.org/html/2606.26196#bib.bib81)\]tackles referring expression segmentation by letting the LLM predict prompt points for SAM, thus achieving segmentation without modifying the original MLLM architecture\. GroundedVideoLLM\[[152](https://arxiv.org/html/2606.26196#bib.bib168)\]extends this idea to the temporal domain, introducing special temporal tokens that share an embedding space with the LLM and employing a segment\-wise encoding strategy for temporal awareness\. Text4Seg\[[72](https://arxiv.org/html/2606.26196#bib.bib94)\]introduces text\-as\-mask paradigm\. It encodes segmentation masks as semantic descriptors—a textual sequence—allowing the LLM to autoregressively predict the class for each image patch, with SAM as mask refiner\. READ\[[120](https://arxiv.org/html/2606.26196#bib.bib55)\]introduces the Similarity as Points \(SasP\) module, in which⟨seg⟩\\langle seg\\rangleand image tokens are used to compute similarity scores that identify salient locations in the image\. These selected points are encoded as sparse embeddings, which are then transformed into continuous attention maps using Gaussian\-weighted interpolation, guiding the model toward more precise segmentation and reasoning\.

## 5Stage III: Dynamic Perception via Adaptive Processing

![Refer to caption](https://arxiv.org/html/2606.26196v1/dynamic_perception.jpg)Figure 5:Dynamic Perception via Adaptive Processing\. \(a\) External Tool Scheduling: The LLM decides which vision tools to invoke \(e\.g\., OCR, localization, etc\.\) and integrates their outputs back into its reasoning loop\. \(b\) LLM\-Centric Routing: The LLM emits structured tags \(e\.g\., visual token, box token\) to direct specific image regions into corresponding vision modules, whose results are fed back for a final answer\. \(c\) Code\-as\-Policy for Perception Tasks: The LLM generates executable Python‐style functions \(e\.g\., compute\_depth, find, exists\), runs them on the image, and uses their outputs to produce the final response\.Stage I and Stage II, whether through encoder\-centric or decoder\-centric modifications, have enabled MLLMs to achieve strong perceptual capabilities, progressing from region\-level to pixel\-level understanding\. However, a core limitation of these approaches is their reliance on statically encoding image information in a single pass, restricting the amount and richness of visual input that the model can access\.

Stage III marks a paradigm shift, introducing dynamic perception to MLLMs\. Instead of static, fixed\-pipeline processing, MLLMs now adopt adaptive, context\-aware perception strategies\. This new dynamic framework empowers MLLMs with greater flexibility and enhanced perceptual abilities across a diverse array of multimodal tasks\. As illustrated in Fig\.[5](https://arxiv.org/html/2606.26196#S5.F5), we categorize dynamic perception methods into three major paradigms: External Tool Scheduling \(§[5\.1](https://arxiv.org/html/2606.26196#S5.SS1)\) , LLM\-Centric Routing \(§[5\.2](https://arxiv.org/html/2606.26196#S5.SS2)\), and Code\-as\-Policy \(§[5\.3](https://arxiv.org/html/2606.26196#S5.SS3)\)\.

### 5\.1External Tool Scheduling

This section introduces multi\-stage tool scheduling, in which external tools are invoked and their outputs—returned in specific formats—are integrated to provide additional visual information, thereby enhancing the execution of vision\-language tasks\. We organize these works into two categories: Fixed Step Scheduling \(§[5\.1\.1](https://arxiv.org/html/2606.26196#S5.SS1.SSS1)\) and Dynamic Expert Selection and Execution \(§[5\.1\.2](https://arxiv.org/html/2606.26196#S5.SS1.SSS2)\)\.

#### 5\.1\.1Fixed Step Scheduling

Some early works decompose vision\-language tasks into multiple fixed stages, leveraging chain\-of\-thought prompting\[[171](https://arxiv.org/html/2606.26196#bib.bib230)\]or orchestrating multiple LLM agents to complete the task along a predefined workflow\. LLaVA\-Plus\[[94](https://arxiv.org/html/2606.26196#bib.bib110)\]extends LLaVA\[[92](https://arxiv.org/html/2606.26196#bib.bib113)\]by integrating a large and diverse suite of external tools, which can be selectively composed and activated to solve real\-world compositional tasks\. The system supports API parameterization for invoking corresponding function arguments as needed\. As a result, LLaVA\-Plus demonstrates four major categories of individual skills—including understanding, external knowledge retrieval, visual prompting, and generation—as well as a variety of composed skills through flexible tool chaining and model orchestration\.

For task\-specific methods, DetToolChain\[[177](https://arxiv.org/html/2606.26196#bib.bib108)\]tackles zero\-shot object detection with a detection prompting toolkit paradigm, where MLLMs are guided to perceive objects through region focusing, progressive prediction refinement, and contextual inference\.P2GP^\{2\}\\text\{G\}\[[18](https://arxiv.org/html/2606.26196#bib.bib178)\]presents a plug\-and\-play framework for grounded reasoning in high\-resolution, text\-rich images\. The agent adaptively decides whether to retrieve additional OCR or detection results as clues, and then fuses these clues into a multimodal prompt to support accurate, grounded reasoning with the MLLM\. ThinkFirst\[[66](https://arxiv.org/html/2606.26196#bib.bib97)\]introduces a training\-free reasoning segmentation framework that uses CoT\-guided prompts to analyze images, generating a detailed summary that, combined with the user query, directs the segmentation model\. VTPrompt\[[63](https://arxiv.org/html/2606.26196#bib.bib102)\]leverages GPT to extract key concepts from textual questions, which are then used to guide a detection model\. The detection results are subsequently fed back to GPT for step\-by\-step reasoning, leading to improved zero\-shot object\-oriented perception performance\.

#### 5\.1\.2Dynamic Expert Selection and Execution

Some methods implement dynamic perception and planning in reasoning by first selecting the appropriate expert or tool for each task, then followed by targeted execution\. IoT\[[228](https://arxiv.org/html/2606.26196#bib.bib109)\]uses Image\-of\-Thought prompting to let the MLLM autonomously plan and execute image processing steps, generating visual rationales that are combined with text to form comprehensive multimodal explanations\. VipAct\[[218](https://arxiv.org/html/2606.26196#bib.bib161)\]proposes an agent framework that enhances VLMs by integrating multi\-agent collaboration and vision expert models\. The framework features an orchestrator agent responsible for planning and coordination, a set of specialized agents for handling specific tasks, and vision expert models that deliver high\-precision perceptual information\. Similarly, TACO\[[107](https://arxiv.org/html/2606.26196#bib.bib156)\]extends the CoT paradigm to chains\-of\-thought\-and\-action, leveraging a variety of pre\-designed vision\-centric and vision\-language tools\. During inference, the model executes intermediate steps by invoking these external tools, integrating both reasoning and action outputs to generate coherent responses\.

### 5\.2LLM\-Centric Routing

The previous stage focused on scheduling, where the LLM served as a coordinator—whether following fixed\-step pipelines or planning before selecting experts, the model did not actively engage in perception\. In contrast, this stage emphasizes LLM\-driven active routing of perception, where the LLM autonomously selects and routes perceptual processes based on context, enabling more proactive and dynamic control over multimodal information flow\. We elaborate in three parts: Step\-by\-step visual reasoning \(§[5\.2\.1](https://arxiv.org/html/2606.26196#S5.SS2.SSS1)\), Multi\-round iterative visual search \(§[5\.2\.2](https://arxiv.org/html/2606.26196#S5.SS2.SSS2)\) , and Detail enhancement via re\-encoding \(§[5\.2\.3](https://arxiv.org/html/2606.26196#S5.SS2.SSS3)\)\.

#### 5\.2\.1Step\-by\-step visual reasoning

Some studies focus on step\-by\-step visual understanding by employing chain\-of\-thought\-like reasoning\.

Focus\-and\-zoom reasoning on a single image progressively narrows attention: SEAL\[[175](https://arxiv.org/html/2606.26196#bib.bib86)\]integrates an MLLM with a localization module composed of an image backbone, a target localization decoder, and a search cue localization decoder\. The MLLM generates additional⟨loc⟩\\langle loc\\rangletokens, which are fed to the respective decoders to produce target coordinates and search cue heatmaps\. The system iteratively performs visual search until the desired criteria are met\. DualFocus\[[11](https://arxiv.org/html/2606.26196#bib.bib85)\]decomposes the visual reasoning process into macro and micro perspectives\. The model first examines the image from a macro perspective to answer the question and then identifies relevant sub\-regions for focused micro\-level analysis\. CoS\[[101](https://arxiv.org/html/2606.26196#bib.bib84)\]introduces Chain\-of\-Spot that enhances feature extraction by interactively focusing on key regions of interest within the image according to the given instructions\. CoReS\[[3](https://arxiv.org/html/2606.26196#bib.bib53)\]introduces the Chains of Reasoning and Segmenting through a dual\-chain structure that generates multi\-modal, chain\-like outputs to aid the segmentation process\. TextCoT\[[102](https://arxiv.org/html/2606.26196#bib.bib148)\]decomposes text\-rich image understanding into three stages: Image Overview, Coarse Localization, and Fine\-grained Observation\. The first two stages generate a global context description and identify an answer region, which is then cropped and sent back to the LLM for further analysis, ultimately enabling a more accurate final response\.

Action\-centric chains drive progress through explicit visual operations or object\-level structure: CogCoM\[[119](https://arxiv.org/html/2606.26196#bib.bib90)\]proposes Chain of Manipulations, that enables VLMs to solve problems step\-by\-step by actively manipulating visual inputs as evidence without relying on external tools\. VolCano\[[82](https://arxiv.org/html/2606.26196#bib.bib225)\]proposes a multi\-step, visually\- grounded, object\-centric chain\-of\-thought reasoning framework\. This approach constructs object\-centric reasoning paths based on shared object\-level information across modalities, while also providing visually grounded representations of object concepts through interleaved and aligned multimodal features\.

Beyond single image, CMMCoT\[[209](https://arxiv.org/html/2606.26196#bib.bib158)\]targets multi\-image understanding and mimics human\-like slow thinking by integrating coordinate\-guided visual token extraction and a Retrieval\-based Image Feature Reasoning Enhancement Module\. It mitigates error accumulation and enhances cross\-image visual concept tracking\.

#### 5\.2\.2Multi\-round iterative visual search

Other works emphasize decomposing perception into multiple rounds of visual search, progressively refining the analysis through iterative exploration\. ZoomEye\[[136](https://arxiv.org/html/2606.26196#bib.bib87)\]introduces a tree search algorithm designed to navigate the hierarchical and visual nature of images to capture relevant information, enabling MLLMs to simulate human zooming actions by searching along the image tree\. Similarly, DyFo\[[75](https://arxiv.org/html/2606.26196#bib.bib83)\]introduces a training\-free dynamic visual search approach based on Monte Carlo Tree Search, enabling the model to navigate visual spaces using focus nodes that amplify critical information and filter out irrelevant input\. Zoom\-Refine\[[198](https://arxiv.org/html/2606.26196#bib.bib181)\]targets high\-resolution image understanding by combining MLLM\-guided localized zoom—focusing on relevant details—with explicit self\-refinement, where the model critically re\-evaluates its initial assessment using the high\-resolution crop as evidence against the broader context\.

#### 5\.2\.3Detail enhancement via re\-encoding

A third line of research aims to extract fine\-grained details within a single dialogue turn by leveraging re\-encoding or re\-aligning mechanisms for enhanced visual comprehension\. For re\-encoding strategies, VPT\[[196](https://arxiv.org/html/2606.26196#bib.bib82)\]enables MLLMs to autonomously control visual perception by introducing special⟨ctr⟩\\langle ctr\\rangletokens, which guide the re\-encoding of visual information for adaptive and task\-driven processing\. Argus\[[108](https://arxiv.org/html/2606.26196#bib.bib96)\]incorporates a grounding\-driven visual attention re\-engagement mechanism that leverages object\-centric grounding as visual chain\-of\-thought signals\. By conditioning on multimodal input instructions, the model dynamically grounds the most relevant regions of interest, enabling more effective goal\-directed visual attention during complex multimodal reasoning tasks\. COVLM\[[76](https://arxiv.org/html/2606.26196#bib.bib79)\]introduces a set of communication tokens that enable the LLM to dynamically interact with both the vision encoder and detection network during decoding\. By generating communication tokens alongside visual entities and relations, the LLM can request relevant regions from the detection network, which are then fed back for more context\-aware language generation\. LLaVA\-Aurora\[[4](https://arxiv.org/html/2606.26196#bib.bib234)\]advances this paradigm by distilling fine\-grained visual perception into MLLMs: the model learns to predict discrete Perception Token sequences—generated by a VQVAE—directly from images as intermediate reasoning steps, which are then incorporated into multi\-task training\.

As for re\-aligning methods, ClawMachine\[[105](https://arxiv.org/html/2606.26196#bib.bib95)\]unifies diverse fine\-grained referential tasks by directly annotating visual entities with corresponding token collectives, thus removing the need for additional syntactic structures and seamlessly integrating visual and language tokens within a unified auto\-regressive sequence\. LIRA\[[83](https://arxiv.org/html/2606.26196#bib.bib231)\]equips a comprehension\-based LMM with segmentation ability by employing a Semantic\-Enhanced Feature Extractor, which performs dynamic cropping to capture fine\-grained image features, and by aligning these representations with textual descriptions via an interleaved local visual coupling module\. VGR\[[155](https://arxiv.org/html/2606.26196#bib.bib159)\]introduces a selective feature replay module, which enables the MLLM to selectively attend to informative regions and retrieve image tokens from a memory pool, thereby enriching visual clues for reasoning\.

### 5\.3Code\-as\-Policy for Perception Tasks

A small number of works employ language model generated programs, executing code to accomplish perception tasks\.

VisPorg\[[43](https://arxiv.org/html/2606.26196#bib.bib106)\]uses the in\-context learning ability of large language models to generate Python\-like modular programs\. Each line of the generated program invokes one of several off\-the\-shelf computer vision models, image processing subroutines, or python functions to produce intermediate outputs that may be consumed by subsequent parts of the program\. Similarly, ViperGPT\[[143](https://arxiv.org/html/2606.26196#bib.bib107)\]introduces a simple framework for solving complex visual queries by integrating code\-generation models into vision with an API and the Python interpreter\. VisualSketchpad\[[48](https://arxiv.org/html/2606.26196#bib.bib232)\]introduces a framework that equips MLLMs with the ability to generate intermediate sketches via Python code, enabling collaboration with vision specialists for visual reasoning tasks\. Recently, PyVision\[[221](https://arxiv.org/html/2606.26196#bib.bib270)\]takes advantage of the advanced coding and multimodal understanding capabilities of modern MLLMs\[[1](https://arxiv.org/html/2606.26196#bib.bib271)\], generating Python code to build and run complex tools, which enables more general and flexible reasoning\.

For task\-specific methods, LogiCode\[[216](https://arxiv.org/html/2606.26196#bib.bib112)\]targets industrial logical anomaly detection by integrating LLMs for code generation and logical reasoning\. ReFocus\[[38](https://arxiv.org/html/2606.26196#bib.bib157)\]targets structured image understanding by enhancing MLLMs with visual reasoning as an intermediate step and providing an interface for generating visual artifacts using Python\-based image editing tools\.

## 6Stage IVA: Architecture\-Free Strategies for Perception Enhancement

![Refer to caption](https://arxiv.org/html/2606.26196v1/instruction_rl.jpg)Figure 6:Architecture\-Free Strategies for Perception Enhancement\. \(a\) Instruction\-Based Strategies: the LLM is guided by task\-specific instructions to directly generate structured outputs \(e\.g\., bounding box coordinates\), without relying on explicit architectural modules\. \(b\) RL\-Based Strategies: the LLM produces multiple reasoning chains and candidate answers, which are optimized via reward signals that reflect task correctness and output format; a KL penalty against a reference policy helps maintain response quality during learning\.After Stage III, the perceptual capabilities of MLLMs have been greatly enhanced\. Subsequent developments diverge into two main branches\. One branch focuses on further improving performance for specific tasks without modifying the model architecture, primarily through advanced training strategies—most notably, the adoption of reinforcement learning methods \(§[6](https://arxiv.org/html/2606.26196#S6)\)\. The other branch moves beyond task\-specific optimization, aiming instead to develop robust and versatile MLLMs with comprehensive perceptual abilities by integrating multiple approaches in a unified framework\(§[7](https://arxiv.org/html/2606.26196#S7)\)\.

In this stage, we focus on approaches that enhance perception capabilities without modifying the model architecture\. As illustrated in Fig\.[6](https://arxiv.org/html/2606.26196#S6.F6), we categorize these works into Instruction\-Based Strategies for Architecture\-Free Perception \(§[6\.1](https://arxiv.org/html/2606.26196#S6.SS1)\) and RL\-Based Strategies for Architecture\-Free Perception \(§[6\.2](https://arxiv.org/html/2606.26196#S6.SS2)\)\.

### 6\.1Instruction\-Based Strategies for Architecture\-Free Perception

In this subsection, we review early instruction\-based methods that enhance perception capabilities through instruction tuning, enabling models to achieve finer\-grained perceptual performance without modifying their underlying architecture\.

A number of early representative works laid the foundation for instruction\-based, architecture\-free perception in MLLMs\. LLaVA\[[92](https://arxiv.org/html/2606.26196#bib.bib113)\]serves as a pioneering work, utilizing language\-only GPT\-4 to generate multimodal instruction\-following data\. This enables the subsequent training of LLaVA, a multimodal model adept at following human intent for diverse visual tasks\. Kosmos\-2\[[116](https://arxiv.org/html/2606.26196#bib.bib118)\]further enhances multimodal grounding by introducing a novel training data format: it encodes bounding box coordinates as location tokens, which are appended to the corresponding text spans\. This design acts as a hyperlink in the training data, seamlessly linking image regions to their textual descriptions and facilitating more effective vision\-language alignment during model training\. Similarly, Shikra\[[19](https://arxiv.org/html/2606.26196#bib.bib115)\]and Pink\[[182](https://arxiv.org/html/2606.26196#bib.bib218)\]support spatial coordinate input and output by directly incorporating bounding box information into the question\-answering process, enabling natural language interaction with spatially grounded content\. Apart from that, VPD\[[49](https://arxiv.org/html/2606.26196#bib.bib233)\]synthesizes training data for VLMs by generating programs that leverage external specialist models and tools\. These LLM\-generated programs are then used within an instruction tuning framework to distill cross\-modal reasoning abilities and expert skills into multimodal models\.

Several works further improve task\-specific performance through tailored instruction tuning strategies\. VTimeLLM\[[51](https://arxiv.org/html/2606.26196#bib.bib152)\]targets fine\-grained video moment understanding and reasoning by introducing a three\-stage, temporally aware training framework\. GroundingGPT\[[84](https://arxiv.org/html/2606.26196#bib.bib197)\]addresses multi\-modal grounding with a three\-stage, coarse\-to\-fine training strategy, supported by stage\-specific datasets to effectively enhance grounding capabilities\. LocVLM\[[125](https://arxiv.org/html/2606.26196#bib.bib114)\]focuses on spatial reasoning by encoding image coordinates into language and proposing three instruction fine\-tuning objectives, thereby equipping MLLMs with the ability to reason about spatial composition through embedded image coordinates in text\. Migician\[[80](https://arxiv.org/html/2606.26196#bib.bib266)\]focuses on multi\-image grounding by constructing a large\-scale dataset for training across six representative tasks, and then fine\-tuning the model on free\-form instruction data\.

### 6\.2RL\-Based Strategies for Architecture\-Free Perception

Recently, following the introduction of OpenAI\-O1\[[56](https://arxiv.org/html/2606.26196#bib.bib236)\]and DeepSeek\-R1\[[40](https://arxiv.org/html/2606.26196#bib.bib235)\], a series of studies have explored the application of reinforcement learning \(RL\) in the vision\-language domain, leading to notable progress in architecture\-free perception enhancement\. We elaborate in R1\-Style Reinforcement Learning \(§[6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1)\) and Other Reinforcement Learning Methods \(§[6\.2\.2](https://arxiv.org/html/2606.26196#S6.SS2.SSS2)\)\.

#### 6\.2\.1R1\-Style Reinforcement Learning

Following the success of DeepSeek’s work\[[133](https://arxiv.org/html/2606.26196#bib.bib237),[40](https://arxiv.org/html/2606.26196#bib.bib235)\], several studies have introduced Group Relative Policy Optimization \(GRPO\)\[[133](https://arxiv.org/html/2606.26196#bib.bib237)\]into visual tasks, employing task\-specific, rule\-based reward design to provide stable and interpretable reward signals\. This approach has notably improved the performance of MLLMs on specific visual tasks\.

Among these, VLM\-R1\[[135](https://arxiv.org/html/2606.26196#bib.bib120)\]and Seg\-Zero\[[98](https://arxiv.org/html/2606.26196#bib.bib128)\]stand out as representative works, focusing respectively on detection and segmentation tasks in computer vision\. VLM\-R1\[[135](https://arxiv.org/html/2606.26196#bib.bib120)\]introduces R1\-style reinforcement learning into the domains of referring expression comprehension and open\-vocabulary object detection, building on open\-r1\[[34](https://arxiv.org/html/2606.26196#bib.bib238)\]\. By incorporating dedicated accuracy and format rewards, VLM\-R1 demonstrates substantial gains in both task performance and out\-of\-domain generalization\. Seg\-Zero\[[98](https://arxiv.org/html/2606.26196#bib.bib128)\]targets reasoning segmentation and adopts a decoupled architecture consisting of a reasoning model and a segmentation model\. By employing a purely reinforcement learning algorithm, Seg\-Zero exhibits emergent reasoning abilities\.

Building on these seeds, a line of work extends verifiable\-reward RL to a wider set of static vision\-language tasks\. R1\-V\[[20](https://arxiv.org/html/2606.26196#bib.bib239)\]addresses object\-counting using GRPO, while Visual\-RFT\[[100](https://arxiv.org/html/2606.26196#bib.bib122)\]and DIP\-R1\[[115](https://arxiv.org/html/2606.26196#bib.bib136)\]apply GRPO\-based reinforcement learning to enhance visual perception and grounding capabilities\. UniVG\-R1\[[2](https://arxiv.org/html/2606.26196#bib.bib129)\]generalizes this approach to universal visual grounding while Perception\-R1\[[193](https://arxiv.org/html/2606.26196#bib.bib123)\]extends it to several representative downstream perception tasks\. Vision\-R1\[[205](https://arxiv.org/html/2606.26196#bib.bib121)\]proposes a vision criterion\-driven reward and progressive rule refinement for improved object localization\. Ground\-R1\[[9](https://arxiv.org/html/2606.26196#bib.bib131)\]targets grounded visual reasoning and decouples evidence region generation from answer synthesis, enabling interpretable reasoning via format\-constrained grounding and reward\-driven response generation\. Rex\-Thinker\[[60](https://arxiv.org/html/2606.26196#bib.bib140)\]reformulates grounded object referring as a planning–action–summarization problem to achieve grounded and interpretable predictions\. The model first detects candidate objects and then performs step\-by\-step verification against the referring expression through a structured planning, action, and summarization reasoning process\. Zhang et al\.\[[207](https://arxiv.org/html/2606.26196#bib.bib172)\]further extends this reasoning ability to multi\-image grounding task while Curr\-ReFT\[[29](https://arxiv.org/html/2606.26196#bib.bib127)\]extends both reasoning and OOD generalization capabilities to small\-scale MLLMs\. GRIT\[[36](https://arxiv.org/html/2606.26196#bib.bib154)\]enables models to generate visually grounded reasoning chains by interleaving natural language with explicit bounding box coordinates that reference relevant image regions, utilizing a specially designed GRPO algorithm\. Similarly, SATORI\[[134](https://arxiv.org/html/2606.26196#bib.bib164)\]decomposes VQA into three verifiable stages—global image captioning, region localization, and answer prediction—to provide step\-by\-step explicit reward signals\. Griffon\-R\[[204](https://arxiv.org/html/2606.26196#bib.bib144)\]also adopts a staged approach, decomposing the reasoning process into understanding, thinking, and answering\. Likewise, TreeVGR\[[154](https://arxiv.org/html/2606.26196#bib.bib258)\]introduces explicit bounding boxes into the reasoning chain and assigns a Traceable Evidence Reward based on box IoU throughout the process\. DeepPerception\[[106](https://arxiv.org/html/2606.26196#bib.bib126)\]demonstrates that integrating cognitive visual perception—enhanced by external knowledge and reasoning—significantly boosts fine\-grained perception in MLLMs\. Relation\-R1\[[77](https://arxiv.org/html/2606.26196#bib.bib263)\]extends these verifiable\-reward GRPO methods to N\-ary relation detection within a unified relation comprehension framework\.

Another line of works focus on segmentation\-centric advances and evidence\-traceable rewards\. On the segmentation front, PixelThink\[[160](https://arxiv.org/html/2606.26196#bib.bib133)\]introduces an efficiency\-aware reasoning scheme built on Seg\-Zero\[[98](https://arxiv.org/html/2606.26196#bib.bib128)\]that dynamically regulates reasoning length based on task difficulty and model uncertainty\. Seg\-R1\[[192](https://arxiv.org/html/2606.26196#bib.bib170)\]and SAM\-R1\[[52](https://arxiv.org/html/2606.26196#bib.bib135)\]incorporate GRPO into generating point and bounding box prompts in the next\-token fashion, which are then used to guide SAM2\[[127](https://arxiv.org/html/2606.26196#bib.bib262)\]in producing segmentation masks\. VisionReasoner\[[99](https://arxiv.org/html/2606.26196#bib.bib138)\]introduces a unified framework for reasoning across diverse visual perception tasks, leveraging tailored reward functions and training strategies to enable comprehensive multi\-task visual reasoning within a single model\. ALToLLM\[[158](https://arxiv.org/html/2606.26196#bib.bib173)\]incorporates an adaptive\-length mask tokenizer into the MLLM, allowing adaptive mask token generation for object segmentation tasks, with GRPO applied during training\.

For anomaly detection, Anomaly\-R1\[[14](https://arxiv.org/html/2606.26196#bib.bib125)\]and OmniAD\[[220](https://arxiv.org/html/2606.26196#bib.bib139)\]leverage GRPO to enhance industrial anomaly detection performance, while LAD\-Reasoner\[[78](https://arxiv.org/html/2606.26196#bib.bib143)\]extends conventional anomaly detection by incorporating logical reasoning with a human\-interpretable reasoning process\.

In the video domain and multi\-image reasoning, DeepVideo\-R1\[[114](https://arxiv.org/html/2606.26196#bib.bib147)\]employs regressive GRPO and difficulty\-aware data augmentation to address complex video reasoning challenges\. STAR\-R1\[[85](https://arxiv.org/html/2606.26196#bib.bib153)\]adopts a single\-stage pure RL paradigm for transformation\-driven reasoning, MiCo\[[22](https://arxiv.org/html/2606.26196#bib.bib171)\]leverages inherent image constraints to incentivize multi\-image reasoning in MLLMs, and VideoCap\-R1\[[110](https://arxiv.org/html/2606.26196#bib.bib196)\]utilizes structured thinking and dual reward mechanisms for nuanced video captioning\. VersaVid\-R1\[[23](https://arxiv.org/html/2606.26196#bib.bib141)\]further extends the capabilities of MLLMs to multiple\-choice and open\-ended question answering tasks within the Reason\-Then\-Respond paradigm\. AvatarShield\[[181](https://arxiv.org/html/2606.26196#bib.bib243)\]targets Human\-Centric Video Forgery Detection and integrates GRPO into the training\. VAU\-R1\[[232](https://arxiv.org/html/2606.26196#bib.bib137)\]further demonstrates the effectiveness of R1\-style reinforcement learning\.

#### 6\.2\.2Other Reinforcement Learning Methods

Beyond R1\-style frameworks, other works explore alternative reinforcement learning approaches to optimize model performance on specific vision\-language tasks\. POPEN\[[231](https://arxiv.org/html/2606.26196#bib.bib134)\]targets reasoning segmentation by introducing a novel preference\-based optimization approach and leveraging ensemble methods with task\-specific designs\. PerPO\[[236](https://arxiv.org/html/2606.26196#bib.bib119)\]introduces Perceptual Preference Optimization to align with the human perception process for discriminative tasks\. VideoChat\-TPO\[[184](https://arxiv.org/html/2606.26196#bib.bib145)\]proposes Task Preference Optimization and enhances fine\-grained spatio\-temporal perception in videos by introducing task\-specific heads\. VisReP\[[68](https://arxiv.org/html/2606.26196#bib.bib240)\]treats the LLM as a policy and applies reinforced self\-training, utilizing feedback from visual program execution to progressively improve the model’s visual program synthesis abilities\. Insight\-V\[[32](https://arxiv.org/html/2606.26196#bib.bib242)\]builds a multi\-agent system that decomposes visual reasoning tasks into reasoning and summarization with a two\-stage training pipeline\. VisRL\[[25](https://arxiv.org/html/2606.26196#bib.bib124)\]applies reinforcement learning to intention\-driven visual perception tasks by leveraging self\-generated data and self\-assigned rewards\. The model is iteratively updated using step\-level Direct Preference Optimization \(DPO\)\[[123](https://arxiv.org/html/2606.26196#bib.bib241)\], establishing a learning process that is much closer to human\-like visual understanding\. SegAgent\[[233](https://arxiv.org/html/2606.26196#bib.bib174)\]reformulates segmentation as a multi\-step decision process, imitating human annotation trajectories with agent\-based models and fine\-tuning MLLMs on these paths\.

## 7Stage IVB: Towards Unified Perception — A Perspective on the Convergence of Instruction, Adaptivity, and RL

![Refer to caption](https://arxiv.org/html/2606.26196v1/case_study.jpg)Figure 7:Case study of OpenAI o3’s thinking with images, reaching the correct answer after 42 seconds of reasoning\. The question is from V\* Bench\.Stage IVB builds on all previous stages and takes a further step by integrating more complex capabilities and control mechanisms\. Rather than adhering to the external tool scheduling view of Stage III, it instead aims for a more unified yet flexible perception–reasoning paradigm, in which diverse perceptual operations and tool calls can be composed and reused across tasks\.

We organize this chapter as follows: we first provide a brief review and analysis of the evolution of overall structural paradigms, from architectural modifications to collaborative paradigm shifts, while also clarifying the limitations inherent in each development stage \(see §[7\.1](https://arxiv.org/html/2606.26196#S7.SS1)\)\. Next, we examine recent research directions that build upon these foundations by integrating instruction tuning, adaptivity, and reinforcement learning, which have driven notable progress in the field \(see §[7\.2](https://arxiv.org/html/2606.26196#S7.SS2)\)\. Finally, we look ahead to the future, discussing both the outstanding challenges and the promising opportunities for the next generation of multimodal perception agents \(see §[7\.3](https://arxiv.org/html/2606.26196#S7.SS3)\)\.

### 7\.1Why Unification Matters: The Need for a Converged Paradigm

In this subsection, we begin with a brief recap of the key advances and limitations at each stage \(§[7\.1\.1](https://arxiv.org/html/2606.26196#S7.SS1.SSS1)\), and then empirically propose three essential criteria that are critical for achieving true unification in vision\-language perception \(§[7\.1\.2](https://arxiv.org/html/2606.26196#S7.SS1.SSS2)\)\.

#### 7\.1\.1A Brief Retrospect

A retrospective analysis of Stage I \(encoder\-centric\) and Stage II \(decoder\-centric\) reveals a clear progression in perceptual granularity, advancing from image\-level to region\-level and ultimately to pixel\-level understanding\. However, these advances are still fundamentally limited by their reliance on one\-shot, static encoding of visual information\. Stage III introduces a shift toward dynamic and adaptive perception, yet the underlying scheduling strategies are largely heuristic or reliant on external tools, lacking a true closed loop with the model’s internal representations and optimization processes\. Stage IVA further pushes the boundaries by leveraging reinforcement learning and verifiable rewards to improve performance without altering model architecture\. Nevertheless, most of these approaches are still confined to isolated or fixed task types—such as detection, segmentation, or counting—and fall short of achieving a unified perception–reasoning cycle in open\-ended scenarios\.

#### 7\.1\.2The Imperative for Unification

Given these limitations, we argue that the next phase must move beyond structural modifications and collaborative designs toward a genuinely unified paradigm for perception\. Such a paradigm should at least satisfy the following criteria:

- 1\.Unified intermediate representation: Capable of naturally bridging text, regions, and pixels;
- 2\.Unified task interface: Replacing task\-specific heads with structured outputs, programmatic actions, or verifiable rewards that reinterpret each task as an instance of a shared perception–reasoning cycle;
- 3\.Unified resource\-awareness: Incorporating cost\-aware perception–reasoning scheduling\.

As a natural next step, we turn to the emergent signs of convergence already present in the literature, where instruction tuning, adaptive perception, and reinforcement learning are being integrated into closed\-loop frameworks\. These early explorations point the way toward truly unified paradigms for vision\-language perception\.

### 7\.2Signs of Convergence: Early Explorations in Unification

Recently, a handful of pioneering studies have emerged, with the OpenAI o3\[[113](https://arxiv.org/html/2606.26196#bib.bib251)\]model representing one of the most notable advances\. Although the technical details and training pipeline of o3 have not been released, both technical blog analyses and our empirical observations suggest that o3 tackles complex perception\-driven tasks by combining planning, dynamic tool selection and usage \(e\.g\., zoom\-in, crop, contrast adjustment\), Python code execution, and internet search capabilities\. As shown in Fig\.[7](https://arxiv.org/html/2606.26196#S7.F7), o3 provides a compelling case of unified vision\-language perception integrating reasoning, external tool orchestration, and adaptive processing\.

#### 7\.2\.1Zoom\-In and Attention\-Focused Strategies

Several studies have attempted to replicate aspects of o3, with a subset specifically centering on zoom\-in and adaptive attention mechanisms\. Notably, Active\-o3\[[234](https://arxiv.org/html/2606.26196#bib.bib130)\]adopts a GRPO\-based reinforcement learning framework that empowers MLLMs with active perception\. Through a dual\-policy design for sensing and action, and by optimizing with task\-aware and exploratory rewards, Active\-o3 enables MLLMs to reason about where to attend and how to interact with visual input more effectively, closely mirroring o3’s zoom\-in search strategy\. Similarly, Kumar et al\.\[[70](https://arxiv.org/html/2606.26196#bib.bib146)\]extend this approach to resource\-constrained settings, focusing on small\-scale MLLMs with limited parameter counts\. DeepEyes\[[226](https://arxiv.org/html/2606.26196#bib.bib88)\]empowers MLLMs with native “thinking with images” capabilities through end\-to\-end reinforcement learning, seamlessly blending visual inputs with textual reasoning and forming iMCoT with a built\-in zoom\-in tool—all without requiring cold\-start supervised fine\-tuning or relying on separate specialized models as external tools\. CoF\[[214](https://arxiv.org/html/2606.26196#bib.bib160)\]utilizes a Chain\-of\-Focus method, enabling MLLMs to adaptively focus and zoom in on key image regions based on visual cues and given questions, through a two\-stage training pipeline\. PixelReasoner\[[141](https://arxiv.org/html/2606.26196#bib.bib182)\]employs a two\-stage approach—using instruction tuning to learn visual operations, then reinforcement learning with a curiosity\-driven reward to balance pixel\-space and textual reasoning\. In contrast to the above methods that explicitly inject visual cues, Look\-Back\[[187](https://arxiv.org/html/2606.26196#bib.bib265)\]enables models to autonomously decide when and how to refocus on visual inputs by emitting special ⟨back⟩ tokens, allowing them to re\-evaluate and correct their reasoning against the image\.

#### 7\.2\.2Adaptive Tool\-Use Frameworks

In addition to zoom\-in strategies inspired by o3, another group of works extends the o3 paradigm by focusing on adaptive tool selection and orchestration\. OpenThinkIMG\[[142](https://arxiv.org/html/2606.26196#bib.bib163)\]introduces an end\-to\-end tool\-augmented MLLM framework, anchored by V\-ToolRL—a reinforcement learning approach that enables adaptive and efficient tool selection via direct interaction and reward feedback, surpassing static trajectory imitation\. VisTA\[[53](https://arxiv.org/html/2606.26196#bib.bib179)\]introduces a reinforcement learning framework that enables visual agents to autonomously select effective external tools, learning adaptive tool\-selection strategies without explicit supervision\. VRAG\-RL\[[159](https://arxiv.org/html/2606.26196#bib.bib162)\]introduces a framework for training MLLMs to reason about, retrieve, and understand visually rich information via a visual perception action space—including selection, cropping, and scaling of regions of interest—and a comprehensive reward mechanism\.

#### 7\.2\.3Advanced Reasoning and Refinement Techniques

Beyond zoom\-in and tool\-use frameworks, several studies introduce additional mechanisms and innovations\. VisionThink\[[185](https://arxiv.org/html/2606.26196#bib.bib40)\]leverages the LLM\-as\-Judge strategy and a tailored reward function for General VQA that enhances efficiency and performance, by initially processing a downsampled image and using reinforcement learning to selectively upscale to higher resolution when needed\. GThinker\[[203](https://arxiv.org/html/2606.26196#bib.bib253)\]proposes a cue\-guided rethinking mechanism, enabling the model to move beyond rigid templates\. This design supports flexible, question\-driven reasoning and allows for robust handling of imperfect visual cues through reflective and knowledge\-grounded thinking\. Open\-Vision\-Reasoner\[[172](https://arxiv.org/html/2606.26196#bib.bib255)\]extends prior linguistic cognitive behaviors into the visual domain, introducing visual reflection, divide\-and\-conquer, visual verification, and goal\-driven visual tracing\. The framework adopts a lightweight Proximal Policy Optimization \(PPO\)\[[132](https://arxiv.org/html/2606.26196#bib.bib256)\]algorithm with Generalized Advantage Estimation \(GAE\)\[[131](https://arxiv.org/html/2606.26196#bib.bib257)\]to enable effective and efficient visual reasoning\.

### 7\.3Towards a New Generation of Perception\-Centric Agents: Opportunities and Challenges

Despite significant advancements in the multimodal perception capabilities of MLLMs, the field is simultaneously confronting unprecedented challenges\. In this section, we first address three critical scientific issues that have emerged\.

The Limitations of Conventional Evaluation\. Traditional datasets are increasingly struggling to meet current research demands\. On one hand, proxy tasks such as static detection and segmentation are often*limited by the precision of human labeling*, making it difficult to accurately measure the nuanced performance of MLLMs\[[8](https://arxiv.org/html/2606.26196#bib.bib25)\]\. On the other hand, these datasets frequently*remain decoupled from practical downstream applications*\[[121](https://arxiv.org/html/2606.26196#bib.bib21),[59](https://arxiv.org/html/2606.26196#bib.bib22)\]\. Furthermore, the performance of large models on general\-purpose benchmarks has begun to*reach a point of saturation*\[[74](https://arxiv.org/html/2606.26196#bib.bib20)\]\. This naturally leads to our first inquiry:*How can we precisely define and evaluate the performance of MLLMs? Are traditional proxy tasks still valid, and to what extent do they truly represent real\-world utility?*

The ”Generalization vs\. Fine\-tuning” Dilemma\.With the rapid iteration of foundational multimodal models—where major versions are updated approximately every year—the benefits of fine\-tuning on medium\-scale computer vision datasets \(such as COCO\[[88](https://arxiv.org/html/2606.26196#bib.bib202)\]\) have become remarkably limited\. In many cases, fine\-tuning can*severely compromise the generalization capabilities inherited from pre\-training*\[[174](https://arxiv.org/html/2606.26196#bib.bib23),[55](https://arxiv.org/html/2606.26196#bib.bib24),[202](https://arxiv.org/html/2606.26196#bib.bib26)\]\. Frequently, researchers find that substantial*performance gains can be achieved simply by waiting for the next iteration of a foundation model rather than through task\-specific optimization*\. As foundational capabilities improve, many legacy problems are resolved ”automatically,” posing a significant challenge for researchers in identifying meaningful long\-term research directions\.

Redefining Multimodality and Representation\.From a broader perspective, the very definition of ”Multimodal” remains a point of contention\. Current research predominantly categorizes data by modality \(vision, language, audio, etc\.\), which fosters a fragmented perspective\. As emphasized in the preceding chapters of this survey, capabilities acquired solely from language\-modality pre\-training do not necessarily transfer seamlessly to other modalities\[[197](https://arxiv.org/html/2606.26196#bib.bib28)\]\. While current models achieve impressive results by scaling data—primarily through language pre\-training supplemented by large\-scale multimodal alignment—this scaling paradigm will inevitably encounter a bottleneck\. Whether a truly unified representation space exists across all modalities remains a subject requiring deeper exploration\[[164](https://arxiv.org/html/2606.26196#bib.bib27),[54](https://arxiv.org/html/2606.26196#bib.bib29)\]\.

Furthermore, in the short term, MLLMs continue to face specific technical bottlenecks, which we will discuss in the following subsections\.

#### 7\.3\.1Dependence on High\-Quality Data Curation

Most current efforts have focused on narrow, task\-specific settings, and models remain far from achieving truly “general” perception\. In practice, endowing models with more complex, general\-purpose perceptual abilities requires integrating heterogeneous data sources\[[226](https://arxiv.org/html/2606.26196#bib.bib88),[203](https://arxiv.org/html/2606.26196#bib.bib253),[172](https://arxiv.org/html/2606.26196#bib.bib255)\]\. Moreover, when models incorporate additional operations—such as active zooming or multi\-step interaction—these data demands become even more stringent\[[234](https://arxiv.org/html/2606.26196#bib.bib130),[214](https://arxiv.org/html/2606.26196#bib.bib160),[141](https://arxiv.org/html/2606.26196#bib.bib182)\], often rendering the process prohibitively expensive and resource\-intensive\. Future work may therefore explore more cost\-effective strategies for scaling datasets\[[31](https://arxiv.org/html/2606.26196#bib.bib274)\]or investigate self\-play–style paradigms\[[140](https://arxiv.org/html/2606.26196#bib.bib273),[179](https://arxiv.org/html/2606.26196#bib.bib272)\]to reduce reliance on large\-scale manual annotation\.

#### 7\.3\.2Lack of General and Fine\-Grained Rewards

As reinforcement‐learning–based approaches—particularly GRPO—have become more prevalent, model performance increasingly depends on carefully crafted reward functions\. To date, most studies employ task‐specific reward schemes that, while well‐suited to their individual objectives, hinder MLLMs from offering a unified solution across diverse visual tasks\. Moreover, GRPO itself suffers from intrinsic drawbacks, including the vanishing advantage problem in long‐horizon reasoning and stability issues\[[195](https://arxiv.org/html/2606.26196#bib.bib259),[224](https://arxiv.org/html/2606.26196#bib.bib275)\]\. Besides, in contrast to the binary correctness metrics commonly used in language reasoning, perception tasks demand dense, fine‐grained supervision\. Looking ahead, we advocate for the development of reward frameworks that are \(1\) task‐agnostic\[[167](https://arxiv.org/html/2606.26196#bib.bib277)\], enabling broad applicability; and \(2\) hierarchically fine‐grained\[[162](https://arxiv.org/html/2606.26196#bib.bib276)\], providing supervision at multiple levels of abstraction\. Such designs will be critical for advancing robust, unified vision–language perception in open‐world settings\.

#### 7\.3\.3High Computation Costs

Recent zoom\-in and tool\-scheduling approaches have delivered impressive accuracy gains, but at the cost of considerable compute\. For example, DeepEyes\[[226](https://arxiv.org/html/2606.26196#bib.bib88)\]and GThinker\[[203](https://arxiv.org/html/2606.26196#bib.bib253)\]both train 7B\-parameter models on4 nodes × 8 GPUs, while still conceding limitations: the authors of DeepEyes note that they “only utilized Qwen2\.5\-VL\-7B, whose fundamental capability is constrained by its small size,” and that their current pipeline “supports only thecropoperation, whereas real scenarios demand a richer toolset such as web search or drawing auxiliary lines\.” Similar resource hurdles—and the inability to integrate a broader repertoire of visual tools—are echoed across the literature\. Closing this efficiency gap will require the development of more cost\-effective training methods and more efficient model architectures\[[224](https://arxiv.org/html/2606.26196#bib.bib275),[15](https://arxiv.org/html/2606.26196#bib.bib278)\]\.

## 8Conclusion

In this survey, we present the first systematic review of unified vision\-language perception in multimodal large language models\. By defining perception as an intrinsically integrated vision\-language capability, we systematically trace the evolution of perception paradigms in MLLMs across five key stages, from encoder\-centric and decoder\-centric optimizations to dynamic, adaptive processing and architecture\-free strategies\. We further discuss the convergence of instruction tuning, adaptivity, and reinforcement learning, highlighting their emerging role in shaping the next generation of perception\-centric agents\. Despite rapid progress, significant challenges remain, including \(i\) heavy reliance on curated multi\-task data, \(ii\) absence of general, fine\-grained rewards, and \(iii\) prohibitive computational costs without effective cost\-aware scheduling\. We hope that our taxonomy and analysis will provide the community with a clearer roadmap, inspire new directions, and accelerate the development of MLLMs toward truly general, unified multimodal intelligence\.

## CRediT authorship contribution statement

Haoxiang Sun: Conceptualization, Methodology, Formal analysis, Investigation, Writing \- Original Draft, Visualization;Tao Wang: Conceptualization, Methodology, Validation, Formal analysis, Writing \- Review & Editing, Funding acquisition, Project administration, Supervision;Li Yuan: Writing \- Review & Editing, Supervision;Jian Zhao: Writing \- Review & Editing, Supervision;Jiancheng Lv: Resources, Writing \- Review & Editing, Funding acquisition\.

## Statement on AI Writing Assistance

During the preparation of this work, the authors used ChatGPT to improve the clarity and correct grammatical errors in the manuscript\. After using this tool, the authors carefully reviewed and edited the content as needed and take full responsibility for the content of the publication\. Additionally, ChatGPT\-4o was employed to generate visualizations for demonstration purposes\.

## Declaration of competing interest

The authors declare that there are no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper\.

## Data availability

This article did not generate new datasets or code\. All datasets and tools discussed are publicly available; additional information can be obtained from the corresponding author upon reasonable request\.

## Acknowledgment

This work is supported by the National Science Foundation of China under Grant 62506249, the National Major Scientific Instruments and Equipments Development Project of National Natural Science Foundation of China under Grant 62427820, the Natural Science Foundation of Sichuan under grant 2024NSFSC1462, and the Fundamental Research Funds for the Central Universities under grant YJ202342\.

## References

- \[1\]Anthropic\(2025\)Introducing claude 4,\.Cited by:[§5\.3](https://arxiv.org/html/2606.26196#S5.SS3.p2.1)\.
- \[2\]S\. Bai, M\. Li, Y\. Liu, J\. Tang, H\. Zhang, L\. Sun, X\. Chu, and Y\. Tang\(2025\)Univg\-r1: reasoning guided universal visual grounding with reinforcement learning\.arXiv preprint arXiv:2505\.14231\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p3.1)\.
- \[3\]X\. Bao, S\. Sun, S\. Ma, K\. Zheng, Y\. Guo, G\. Zhao, Y\. Zheng, and X\. Wang\(2024\)Cores: orchestrating the dance of reasoning and segmentation\.InEuropean Conference on Computer Vision,pp\. 187–204\.Cited by:[§5\.2\.1](https://arxiv.org/html/2606.26196#S5.SS2.SSS1.p2.1)\.
- \[4\]M\. Bigverdi, Z\. Luo, C\. Hsieh, E\. Shen, D\. Chen, L\. G\. Shapiro, and R\. Krishna\(2025\)Perception tokens enhance visual reasoning in multimodal language models\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 3836–3845\.Cited by:[§5\.2\.3](https://arxiv.org/html/2606.26196#S5.SS2.SSS3.p1.1)\.
- \[5\]D\. Caffagni, F\. Cocchi, L\. Barsellotti, N\. Moratelli, S\. Sarto, L\. Baraldi, M\. Cornia, and R\. Cucchiara\(2024\)The revolution of multimodal large language models: a survey\.arXiv preprint arXiv:2402\.12451\.Cited by:[item 1](https://arxiv.org/html/2606.26196#S2.I1.i1.p1.1)\.
- \[6\]D\. Cai, X\. Yang, Y\. Liu, D\. Wang, S\. Feng, Y\. Zhang, and S\. Poria\(2025\)Pixel\-level reasoning segmentation via multi\-turn conversations\.arXiv preprint arXiv:2502\.09447\.Cited by:[§4\.1\.1](https://arxiv.org/html/2606.26196#S4.SS1.SSS1.p4.1)\.
- \[7\]M\. Cai, H\. Liu, S\. K\. Mustikovela, G\. P\. Meyer, Y\. Chai, D\. Park, and Y\. J\. Lee\(2024\)Vip\-llava: making large multimodal models understand arbitrary visual prompts\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 12914–12923\.Cited by:[§3\.1\.2](https://arxiv.org/html/2606.26196#S3.SS1.SSS2.p4.1)\.
- \[8\]L\. Cao, V\. Buchner, Z\. Senane, and F\. Yang\(2024\)Introducing genception for multimodal llm benchmarking: you may bypass annotations\.InProceedings of the 4th workshop on trustworthy natural language processing \(trustNLP 2024\),pp\. 196–201\.Cited by:[§7\.3](https://arxiv.org/html/2606.26196#S7.SS3.p2.1)\.
- \[9\]M\. Cao, H\. Zhao, C\. Zhang, X\. Chang, I\. Reid, and X\. Liang\(2025\)Ground\-r1: incentivizing grounded visual reasoning via reinforcement learning\.arXiv preprint arXiv:2505\.20272\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p3.1)\.
- \[10\]S\. Cao, Z\. Wei, J\. Kuen, K\. Liu, L\. Zhang, J\. Gu, H\. Jung, L\. Gui, and Y\. Wang\(2025\)Refer to anything with vision\-language prompts\.arXiv preprint arXiv:2506\.05342\.Cited by:[§4\.3\.1](https://arxiv.org/html/2606.26196#S4.SS3.SSS1.p2.1)\.
- \[11\]Y\. Cao, P\. Zhang, X\. Dong, D\. Lin, and J\. Wang\(2024\)Dualfocus: integrating macro and micro perspectives in multi\-modal large language models\.arXiv preprint arXiv:2402\.14767\.Cited by:[§5\.2\.1](https://arxiv.org/html/2606.26196#S5.SS2.SSS1.p2.1)\.
- \[12\]J\. Cha, W\. Kang, J\. Mun, and B\. Roh\(2024\)Honeybee: locality\-enhanced projector for multimodal llm\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 13817–13827\.Cited by:[§3\.1\.3](https://arxiv.org/html/2606.26196#S3.SS1.SSS3.p1.1)\.
- \[13\]W\. Chai, E\. Song, Y\. Du, C\. Meng, V\. Madhavan, O\. Bar\-Tal, J\. Hwang, S\. Xie, and C\. D\. Manning\(2024\)Auroracap: efficient, performant video detailed captioning and a new benchmark\.arXiv preprint arXiv:2410\.03051\.Cited by:[§1](https://arxiv.org/html/2606.26196#S1.p1.1)\.
- \[14\]Y\. Chao, J\. Liu, J\. Tang, and G\. Wu\(2025\)Anomalyr1: a grpo\-based end\-to\-end mllm for industrial anomaly detection\.arXiv preprint arXiv:2504\.11914\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p5.1)\.
- \[15\]A\. Chen, A\. Li, B\. Gong, B\. Jiang, B\. Fei, B\. Yang, B\. Shan, C\. Yu, C\. Wang, C\. Zhu,et al\.\(2025\)MiniMax\-m1: scaling test\-time compute efficiently with lightning attention\.arXiv preprint arXiv:2506\.13585\.Cited by:[§7\.3\.3](https://arxiv.org/html/2606.26196#S7.SS3.SSS3.p1.1)\.
- \[16\]G\. Chen, L\. Shen, R\. Shao, X\. Deng, and L\. Nie\(2024\)Lion: empowering multimodal large language model with dual\-level visual knowledge\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 26540–26550\.Cited by:[§3\.1\.2](https://arxiv.org/html/2606.26196#S3.SS1.SSS2.p5.1)\.
- \[17\]J\. Chen, T\. Liang, S\. Siu, Z\. Wang, K\. Wang, Y\. Wang, Y\. Ni, W\. Zhu, Z\. Jiang, B\. Lyu,et al\.\(2024\)Mega\-bench: scaling multimodal evaluation to over 500 real\-world tasks\.arXiv preprint arXiv:2410\.10563\.Cited by:[§1](https://arxiv.org/html/2606.26196#S1.p2.1)\.
- \[18\]J\. Chen, Y\. Liu, D\. Li, X\. An, W\. Deng, Z\. Feng, Y\. Zhao, and Y\. Xie\(2024\)Plug\-and\-play grounding of reasoning in multimodal large language models\.arXiv preprint arXiv:2403\.19322\.Cited by:[§5\.1\.1](https://arxiv.org/html/2606.26196#S5.SS1.SSS1.p2.1)\.
- \[19\]K\. Chen, Z\. Zhang, W\. Zeng, R\. Zhang, F\. Zhu, and R\. Zhao\(2023\)Shikra: unleashing multimodal llm’s referential dialogue magic\.arXiv preprint arXiv:2306\.15195\.Cited by:[§6\.1](https://arxiv.org/html/2606.26196#S6.SS1.p2.1)\.
- \[20\]L\. Chen, L\. Li, H\. Zhao, Y\. Song, and Vinci\(2025\)R1\-v: reinforcing super generalization ability in vision\-language models with less than $3\.Note:Accessed: 2025\-02\-02[https://github\.com/Deep\-Agent/R1\-V](https://github.com/Deep-Agent/R1-V)Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p3.1)\.
- \[21\]Q\. Chen, L\. Qin, J\. Liu, D\. Peng, J\. Guan, P\. Wang, M\. Hu, Y\. Zhou, T\. Gao, and W\. Che\(2025\)Towards reasoning era: a survey of long chain\-of\-thought for reasoning large language models\.arXiv preprint arXiv:2503\.09567\.Cited by:[§1](https://arxiv.org/html/2606.26196#S1.p3.1),[item 3](https://arxiv.org/html/2606.26196#S2.I1.i3.p1.1)\.
- \[22\]X\. Chen, M\. Zhu, S\. Liu, X\. Wu, X\. Xu, Y\. Liu, X\. Bai, and H\. Zhao\(2025\)MiCo: multi\-image contrast for reinforcement visual reasoning\.arXiv preprint arXiv:2506\.22434\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p6.1)\.
- \[23\]X\. Chen, Y\. Zhang, Y\. Guan, B\. Zeng, Y\. Shi, S\. Yang, P\. Wan, Q\. Liu, L\. Wang, and T\. Tan\(2025\)VersaVid\-r1: a versatile video understanding and reasoning model from question answering to captioning tasks\.arXiv preprint arXiv:2506\.09079\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p6.1)\.
- \[24\]Y\. Chen, W\. Li, C\. Sun, Y\. F\. Wang, and C\. Chen\(2024\)SAM4MLLM: enhance multi\-modal large language model for referring expression segmentation\.InEuropean Conference on Computer Vision,pp\. 323–340\.Cited by:[§4\.3\.2](https://arxiv.org/html/2606.26196#S4.SS3.SSS2.p1.2)\.
- \[25\]Z\. Chen, X\. Luo, and D\. Li\(2025\)Visrl: intention\-driven visual perception via reinforced reasoning\.arXiv preprint arXiv:2503\.07523\.Cited by:[§6\.2\.2](https://arxiv.org/html/2606.26196#S6.SS2.SSS2.p1.1)\.
- \[26\]A\. Cheng, H\. Yin, Y\. Fu, Q\. Guo, R\. Yang, J\. Kautz, X\. Wang, and S\. Liu\(2024\)SpatialRGPT: grounded spatial reasoning in vision language models\.arXiv preprint arXiv:2406\.01584\.Cited by:[§3\.2\.2](https://arxiv.org/html/2606.26196#S3.SS2.SSS2.p1.1)\.
- \[27\]K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§1\.1](https://arxiv.org/html/2606.26196#S1.SS1.p3.1)\.
- \[28\]A\. Deng, T\. Chen, S\. Yu, T\. Yang, L\. Spencer, Y\. Tian, A\. S\. Mian, M\. Bansal, and C\. Chen\(2025\)Motion\-grounded video reasoning: understanding and perceiving motion at pixel level\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 8625–8636\.Cited by:[§4\.1\.2](https://arxiv.org/html/2606.26196#S4.SS1.SSS2.p4.3)\.
- \[29\]H\. Deng, D\. Zou, R\. Ma, H\. Luo, Y\. Cao, and Y\. Kang\(2025\)Boosting the generalization and reasoning of vision language models with curriculum reinforcement learning\.arXiv preprint arXiv:2503\.07065\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p3.1)\.
- \[30\]J\. Deng, T\. Weng, T\. Yang, W\. Luo, Z\. Li, and W\. Jiang\(2025\)LlamaSeg: image segmentation via autoregressive mask generation\.arXiv preprint arXiv:2505\.19422\.Cited by:[§4\.3\.1](https://arxiv.org/html/2606.26196#S4.SS3.SSS1.p2.1)\.
- \[31\]H\. Dong, Z\. Kang, W\. Yin, X\. Liang, C\. Feng, and J\. Ran\(2025\)Scalable vision language model training via high quality data curation\.arXiv preprint arXiv:2501\.05952\.Cited by:[§7\.3\.1](https://arxiv.org/html/2606.26196#S7.SS3.SSS1.p1.1)\.
- \[32\]Y\. Dong, Z\. Liu, H\. Sun, J\. Yang, W\. Hu, Y\. Rao, and Z\. Liu\(2025\)Insight\-v: exploring long\-chain visual reasoning with multimodal large language models\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 9062–9072\.Cited by:[§6\.2\.2](https://arxiv.org/html/2606.26196#S6.SS2.SSS2.p1.1)\.
- \[33\]D\. Dwibedi, V\. Jain, J\. J\. Tompson, A\. Zisserman, and Y\. Aytar\(2024\)Flexcap: describe anything in images in controllable detail\.Advances in Neural Information Processing Systems37,pp\. 111172–111198\.Cited by:[§3\.1\.2](https://arxiv.org/html/2606.26196#S3.SS1.SSS2.p5.1)\.
- \[34\]H\. FaceOpen r1: a fully open reproduction of deepseek\-r1, january 2025\.URL https://github\. com/huggingface/open\-r1,pp\. 9\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p2.1)\.
- \[35\]X\. Fan, T\. Ji, C\. Jiang, S\. Li, S\. Jin, S\. Song, J\. Wang, B\. Hong, L\. Chen, G\. Zheng,et al\.\(2024\)Mousi: poly\-visual\-expert vision\-language models\.arXiv preprint arXiv:2401\.17221\.Cited by:[§3\.2\.1](https://arxiv.org/html/2606.26196#S3.SS2.SSS1.p1.1)\.
- \[36\]Y\. Fan, X\. He, D\. Yang, K\. Zheng, C\. Kuo, Y\. Zheng, S\. J\. Narayanaraju, X\. Guan, and X\. E\. Wang\(2025\)GRIT: teaching mllms to think with images\.arXiv preprint arXiv:2505\.15879\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p3.1)\.
- \[37\]H\. Fei, S\. Wu, H\. Zhang, T\. Chua, and S\. Yan\(2024\)Vitron: a unified pixel\-level vision llm for understanding, generating, segmenting, editing\.arXiv preprint arXiv:2412\.19806\.Cited by:[§4\.2](https://arxiv.org/html/2606.26196#S4.SS2.p3.1)\.
- \[38\]X\. Fu, M\. Liu, Z\. Yang, J\. Corring, Y\. Lu, J\. Yang, D\. Roth, D\. Florencio, and C\. Zhang\(2025\)ReFocus: visual editing as a chain of thought for structured image understanding\.arXiv preprint arXiv:2501\.05452\.Cited by:[§5\.3](https://arxiv.org/html/2606.26196#S5.SS3.p3.1)\.
- \[39\]S\. Gong, Y\. Zhuge, L\. Zhang, Z\. Yang, P\. Zhang, and H\. Lu\(2025\)The devil is in temporal token: high quality video reasoning segmentation\.arXiv preprint arXiv:2501\.08549\.Cited by:[§4\.1\.2](https://arxiv.org/html/2606.26196#S4.SS1.SSS2.p2.2)\.
- \[40\]D\. Guo, D\. Yang, H\. Zhang, J\. Song, R\. Zhang, R\. Xu, Q\. Zhu, S\. Ma, P\. Wang, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p1.1),[§6\.2](https://arxiv.org/html/2606.26196#S6.SS2.p1.1)\.
- \[41\]Q\. Guo, S\. De Mello, H\. Yin, W\. Byeon, K\. C\. Cheung, Y\. Yu, P\. Luo, and S\. Liu\(2024\)Regiongpt: towards region understanding vision language model\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 13796–13806\.Cited by:[§3\.1\.2](https://arxiv.org/html/2606.26196#S3.SS1.SSS2.p4.1)\.
- \[42\]Z\. Guo, R\. Xu, Y\. Yao, J\. Cui, Z\. Ni, C\. Ge, T\. Chua, Z\. Liu, and G\. Huang\(2024\)Llava\-uhd: an lmm perceiving any aspect ratio and high\-resolution images\.InEuropean Conference on Computer Vision,pp\. 390–406\.Cited by:[§1\.1](https://arxiv.org/html/2606.26196#S1.SS1.p3.1)\.
- \[43\]T\. Gupta and A\. Kembhavi\(2023\)Visual programming: compositional visual reasoning without training\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 14953–14962\.Cited by:[§5\.3](https://arxiv.org/html/2606.26196#S5.SS3.p2.1)\.
- \[44\]K\. Han, Y\. Hu, M\. Qu, H\. Shi, Y\. Zhao, and Y\. Wei\(2024\)ROSE: revolutionizing open\-set dense segmentation with patch\-wise perceptual large multimodal model\.arXiv preprint arXiv:2412\.00153\.Cited by:[§4\.1\.1](https://arxiv.org/html/2606.26196#S4.SS1.SSS1.p5.4)\.
- \[45\]J\. He, Y\. Wang, L\. Wang, H\. Lu, J\. He, J\. Lan, B\. Luo, and X\. Xie\(2024\)Multi\-modal instruction tuned llms with fine\-grained visual perception\.InProceedings of the ieee/cvf conference on computer vision and pattern recognition,pp\. 13980–13990\.Cited by:[§4\.1\.1](https://arxiv.org/html/2606.26196#S4.SS1.SSS1.p2.1)\.
- \[46\]K\. He, G\. Gkioxari, P\. Dollár, and R\. Girshick\(2017\)Mask r\-cnn\.InProceedings of the IEEE international conference on computer vision,pp\. 2961–2969\.Cited by:[§3\.1\.1](https://arxiv.org/html/2606.26196#S3.SS1.SSS1.p1.1)\.
- \[47\]X\. He, L\. Wei, L\. Xie, and Q\. Tian\(2024\)Incorporating visual experts to resolve the information loss in multimodal large language models\.arXiv preprint arXiv:2401\.03105\.Cited by:[§3\.2\.1](https://arxiv.org/html/2606.26196#S3.SS2.SSS1.p3.1)\.
- \[48\]Y\. Hu, W\. Shi, X\. Fu, D\. Roth, M\. Ostendorf, L\. Zettlemoyer, N\. A\. Smith, and R\. Krishna\(2024\)Visual sketchpad: sketching as a visual chain of thought for multimodal language models\.Advances in Neural Information Processing Systems37,pp\. 139348–139379\.Cited by:[§5\.3](https://arxiv.org/html/2606.26196#S5.SS3.p2.1)\.
- \[49\]Y\. Hu, O\. Stretcu, C\. Lu, K\. Viswanathan, K\. Hata, E\. Luo, R\. Krishna, and A\. Fuxman\(2024\)Visual program distillation: distilling tools and programmatic reasoning into vision\-language models\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 9590–9601\.Cited by:[§6\.1](https://arxiv.org/html/2606.26196#S6.SS1.p2.1)\.
- \[50\]H\. Hua, Q\. Liu, L\. Zhang, J\. Shi, S\. Y\. Kim, Z\. Zhang, Y\. Wang, J\. Zhang, Z\. Lin, and J\. Luo\(2025\)Finecaption: compositional image captioning focusing on wherever you want at any granularity\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 24763–24773\.Cited by:[§3\.1\.2](https://arxiv.org/html/2606.26196#S3.SS1.SSS2.p2.1)\.
- \[51\]B\. Huang, X\. Wang, H\. Chen, Z\. Song, and W\. Zhu\(2024\)Vtimellm: empower llm to grasp video moments\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 14271–14280\.Cited by:[§6\.1](https://arxiv.org/html/2606.26196#S6.SS1.p3.1)\.
- \[52\]J\. Huang, Z\. Xu, J\. Zhou, T\. Liu, Y\. Xiao, M\. Ou, B\. Ji, X\. Li, and K\. Yuan\(2025\)SAM\-r1: leveraging sam for reward feedback in multimodal segmentation via reinforcement learning\.arXiv preprint arXiv:2505\.22596\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p4.1)\.
- \[53\]Z\. Huang, Y\. Ji, A\. S\. Rajan, Z\. Cai, W\. Xiao, J\. Hu, and Y\. J\. Lee\(2025\)VisualToolAgent \(vista\): a reinforcement learning framework for visual tool selection\.arXiv preprint arXiv:2505\.20289\.Cited by:[§7\.2\.2](https://arxiv.org/html/2606.26196#S7.SS2.SSS2.p1.1)\.
- \[54\]M\. Huh, B\. Cheung, T\. Wang, and P\. Isola\(2024\)The platonic representation hypothesis\.arXiv preprint arXiv:2405\.07987\.Cited by:[§7\.3](https://arxiv.org/html/2606.26196#S7.SS3.p4.1)\.
- \[55\]H\. Hwang, N\. D\. Son, and D\. Kim\(2026\)Model\-dowser: data\-free importance probing to mitigate catastrophic forgetting in multimodal large language models\.arXiv preprint arXiv:2602\.04509\.Cited by:[§7\.3](https://arxiv.org/html/2606.26196#S7.SS3.p3.1)\.
- \[56\]A\. Jaech, A\. Kalai, A\. Lerer, A\. Richardson, A\. El\-Kishky, A\. Low, A\. Helyar, A\. Madry, A\. Beutel, A\. Carney,et al\.\(2024\)Openai o1 system card\.arXiv preprint arXiv:2412\.16720\.Cited by:[§6\.2](https://arxiv.org/html/2606.26196#S6.SS2.p1.1)\.
- \[57\]J\. Jain, J\. Yang, and H\. Shi\(2024\)Vcoder: versatile vision encoders for multimodal large language models\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 27992–28002\.Cited by:[§3\.2\.2](https://arxiv.org/html/2606.26196#S3.SS2.SSS2.p1.1)\.
- \[58\]D\. Jang, Y\. Cho, S\. Lee, T\. Kim, and D\. Kim\(2025\)Mmr: a large\-scale benchmark dataset for multi\-target and multi\-granularity reasoning segmentation\.arXiv preprint arXiv:2503\.13881\.Cited by:[§4\.1\.3](https://arxiv.org/html/2606.26196#S4.SS1.SSS3.p1.3)\.
- \[59\]M\. Jiang, J\. Gao, J\. Zhan, and D\. Wang\(2025\)Mac: a live benchmark for multimodal large language models in scientific understanding\.arXiv preprint arXiv:2508\.15802\.Cited by:[§7\.3](https://arxiv.org/html/2606.26196#S7.SS3.p2.1)\.
- \[60\]Q\. Jiang, X\. Chen, Z\. Zeng, J\. Yu, and L\. Zhang\(2025\)Rex\-thinker: grounded object referring via chain\-of\-thought reasoning\.arXiv preprint arXiv:2506\.04034\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p3.1)\.
- \[61\]Q\. Jiang, G\. Luo, Y\. Yang, Y\. Xiong, Y\. Chen, Z\. Zeng, T\. Ren, and L\. Zhang\(2024\)Chatrex: taming multimodal llm for joint perception and understanding\.arXiv preprint arXiv:2411\.18363\.Cited by:[§3\.1\.1](https://arxiv.org/html/2606.26196#S3.SS1.SSS1.p1.1),[§3\.2\.1](https://arxiv.org/html/2606.26196#S3.SS2.SSS1.p1.1)\.
- \[62\]Q\. Jiang, L\. Wu, Z\. Zeng, T\. Ren, Y\. Xiong, Y\. Chen, Q\. Liu, and L\. Zhang\(2025\)Referring to any person\.arXiv preprint arXiv:2503\.08507\.Cited by:[§3\.2\.1](https://arxiv.org/html/2606.26196#S3.SS2.SSS1.p1.1)\.
- \[63\]S\. Jiang, Y\. Zhang, C\. Zhou, Y\. Jin, Y\. Feng, J\. Wu, and Z\. Liu\(2024\)Joint visual and text prompting for improved object\-centric perception with multimodal large language models\.arXiv preprint arXiv:2404\.04514\.Cited by:[§5\.1\.1](https://arxiv.org/html/2606.26196#S5.SS1.SSS1.p2.1)\.
- \[64\]Y\. Jiao, S\. Chen, Z\. Jie, J\. Chen, L\. Ma, and Y\. Jiang\(2024\)Lumen: unleashing versatile vision\-centric capabilities of large multimodal models\.arXiv preprint arXiv:2403\.07304\.Cited by:[§4\.2](https://arxiv.org/html/2606.26196#S4.SS2.p3.1)\.
- \[65\]W\. Jin, S\. Kim, and S\. Kim\(2025\)InterRVOS: interaction\-aware referring video object segmentation\.arXiv preprint arXiv:2506\.02356\.Cited by:[§4\.1\.2](https://arxiv.org/html/2606.26196#S4.SS1.SSS2.p4.3)\.
- \[66\]S\. Kao, Y\. Tai, and C\. Tang\(2025\)Think before you segment: high\-quality reasoning segmentation with gpt chain of thoughts\.arXiv preprint arXiv:2503\.07503\.Cited by:[§5\.1\.1](https://arxiv.org/html/2606.26196#S5.SS1.SSS1.p2.1)\.
- \[67\]S\. Kao, Y\. Tai, and C\. Tang\(2025\)ThinkVideo: high\-quality reasoning video segmentation with chain of thoughts\.arXiv preprint arXiv:2505\.18561\.Cited by:[§4\.1\.2](https://arxiv.org/html/2606.26196#S4.SS1.SSS2.p2.2)\.
- \[68\]Z\. Khan, V\. K\. BG, S\. Schulter, Y\. Fu, and M\. Chandraker\(2024\)Self\-training large language models for improved visual program synthesis with visual reinforcement\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 14344–14353\.Cited by:[§6\.2\.2](https://arxiv.org/html/2606.26196#S6.SS2.SSS2.p1.1)\.
- \[69\]A\. Kirillov, E\. Mintun, N\. Ravi, H\. Mao, C\. Rolland, L\. Gustafson, T\. Xiao, S\. Whitehead, A\. C\. Berg, W\. Lo,et al\.\(2023\)Segment anything\.InProceedings of the IEEE/CVF international conference on computer vision,pp\. 4015–4026\.Cited by:[§4\.1\.1](https://arxiv.org/html/2606.26196#S4.SS1.SSS1.p1.1)\.
- \[70\]S\. Kumar, B\. Zhao, L\. Dirac, and P\. Varshavskaya\(2025\)Reinforcing vlms to use tools for detailed visual reasoning under resource constraints\.arXiv preprint arXiv:2506\.14821\.Cited by:[§7\.2\.1](https://arxiv.org/html/2606.26196#S7.SS2.SSS1.p1.1)\.
- \[71\]X\. Lai, Z\. Tian, Y\. Chen, Y\. Li, Y\. Yuan, S\. Liu, and J\. Jia\(2024\)Lisa: reasoning segmentation via large language model\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 9579–9589\.Cited by:[§1](https://arxiv.org/html/2606.26196#S1.p1.1),[§4\.1\.1](https://arxiv.org/html/2606.26196#S4.SS1.SSS1.p1.1),[§4\.3\.2](https://arxiv.org/html/2606.26196#S4.SS3.SSS2.p1.2)\.
- \[72\]M\. Lan, C\. Chen, Y\. Zhou, J\. Xu, Y\. Ke, X\. Wang, L\. Feng, and W\. Zhang\(2024\)Text4seg: reimagining image segmentation as text generation\.arXiv preprint arXiv:2410\.09855\.Cited by:[§4\.3\.2](https://arxiv.org/html/2606.26196#S4.SS3.SSS2.p1.2)\.
- \[73\]B\. Lee, B\. Park, C\. Won Kim, and Y\. Man Ro\(2024\)Moai: mixture of all intelligence for large language and vision models\.InEuropean Conference on Computer Vision,pp\. 273–302\.Cited by:[§3\.2\.1](https://arxiv.org/html/2606.26196#S3.SS2.SSS1.p2.1)\.
- \[74\]C\. Li, X\. Li, Z\. Zhang, Y\. Tian, Z\. Jia, X\. Liu, X\. Min, J\. Wang, H\. Duan, K\. Chen,et al\.\(2025\)Information density principle for mllm benchmarks\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 4167–4177\.Cited by:[§7\.3](https://arxiv.org/html/2606.26196#S7.SS3.p2.1)\.
- \[75\]G\. Li, J\. Xu, Y\. Zhao, and Y\. Peng\(2025\)Dyfo: a training\-free dynamic focus visual search for enhancing lmms in fine\-grained visual understanding\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 9098–9108\.Cited by:[§5\.2\.2](https://arxiv.org/html/2606.26196#S5.SS2.SSS2.p1.1)\.
- \[76\]J\. Li, D\. Chen, Y\. Hong, Z\. Chen, P\. Chen, Y\. Shen, and C\. Gan\(2023\)Covlm: composing visual entities and relationships in large language models via communicative decoding\.arXiv preprint arXiv:2311\.03354\.Cited by:[§5\.2\.3](https://arxiv.org/html/2606.26196#S5.SS2.SSS3.p1.1)\.
- \[77\]L\. Li, W\. Chen, J\. Li, and L\. Chen\(2025\)Relation\-r1: cognitive chain\-of\-thought guided reinforcement learning for unified relational comprehension\.arXiv preprint arXiv:2504\.14642\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p3.1)\.
- \[78\]W\. Li, G\. Chu, J\. Chen, G\. Xie, C\. Shan, and F\. Zhao\(2025\)Lad\-reasoner: tiny multimodal models are good reasoners for logical anomaly detection\.arXiv preprint arXiv:2504\.12749\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p5.1)\.
- \[79\]X\. Li, H\. Yuan, W\. Li, H\. Ding, S\. Wu, W\. Zhang, Y\. Li, K\. Chen, and C\. C\. Loy\(2024\)Omg\-seg: is one model good enough for all segmentation?\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 27948–27959\.Cited by:[§4\.1\.1](https://arxiv.org/html/2606.26196#S4.SS1.SSS1.p2.1)\.
- \[80\]Y\. Li, H\. Huang, C\. Chen, K\. Huang, C\. Huang, Z\. Guo, Z\. Liu, J\. Xu, Y\. Li, R\. Li,et al\.\(2025\)Migician: revealing the magic of free\-form multi\-image grounding in multimodal large language models\.arXiv preprint arXiv:2501\.05767\.Cited by:[§6\.1](https://arxiv.org/html/2606.26196#S6.SS1.p3.1)\.
- \[81\]Y\. Li, H\. Wang, S\. Yuan, M\. Liu, D\. Zhao, Y\. Guo, C\. Xu, G\. Shi, and W\. Zuo\(2023\)Myriad: large multimodal model by applying vision experts for industrial anomaly detection\.arXiv preprint arXiv:2310\.19070\.Cited by:[§3\.2\.1](https://arxiv.org/html/2606.26196#S3.SS2.SSS1.p2.1)\.
- \[82\]Z\. Li, R\. Luo, J\. Zhang, M\. Qiu, X\. Huang, and Z\. Wei\(2024\)Vocot: unleashing visually grounded multi\-step reasoning in large multi\-modal models\.arXiv preprint arXiv:2405\.16919\.Cited by:[§5\.2\.1](https://arxiv.org/html/2606.26196#S5.SS2.SSS1.p3.1)\.
- \[83\]Z\. Li, B\. Yang, Q\. Liu, S\. Zhang, Z\. Ma, L\. Yin, L\. Deng, Y\. Sun, Y\. Liu, and X\. Bai\(2025\)LIRA: inferring segmentation in large multi\-modal models with local interleaved region assistance\.arXiv preprint arXiv:2507\.06272\.Cited by:[§5\.2\.3](https://arxiv.org/html/2606.26196#S5.SS2.SSS3.p2.1)\.
- \[84\]Z\. Li, Q\. Xu, D\. Zhang, H\. Song, Y\. Cai, Q\. Qi, R\. Zhou, J\. Pan, Z\. Li, V\. T\. Vu,et al\.\(2024\)Groundinggpt: language enhanced multi\-modal grounding model\.arXiv preprint arXiv:2401\.06071\.Cited by:[§6\.1](https://arxiv.org/html/2606.26196#S6.SS1.p3.1)\.
- \[85\]Z\. Li, Z\. Ma, M\. Li, S\. Li, Y\. Rong, T\. Xu, Z\. Zhang, D\. Zhao, and W\. Huang\(2025\)STAR\-r1: spacial transformation reasoning by reinforcing multimodal llms\.arXiv preprint arXiv:2505\.15804\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p6.1)\.
- \[86\]C\. X\. Liang, P\. Tian, C\. H\. Yin, Y\. Yua, W\. An\-Hou, L\. Ming, T\. Wang, Z\. Bi, and M\. Liu\(2024\)A comprehensive survey and guide to multimodal large language models in vision\-language tasks\.arXiv preprint arXiv:2411\.06284\.Cited by:[item 1](https://arxiv.org/html/2606.26196#S2.I1.i1.p1.1)\.
- \[87\]L\. Lin, X\. Yu, Z\. Pang, and Y\. Wang\(2025\)Glus: global\-local reasoning unified into a single large language model for video segmentation\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 8658–8667\.Cited by:[§4\.1\.2](https://arxiv.org/html/2606.26196#S4.SS1.SSS2.p4.3)\.
- \[88\]T\. Lin, M\. Maire, S\. Belongie, J\. Hays, P\. Perona, D\. Ramanan, P\. Dollár, and C\. L\. Zitnick\(2014\)Microsoft coco: common objects in context\.InEuropean conference on computer vision,pp\. 740–755\.Cited by:[§1](https://arxiv.org/html/2606.26196#S1.p2.1),[§7\.3](https://arxiv.org/html/2606.26196#S7.SS3.p3.1)\.
- \[89\]W\. Lin, X\. Wei, R\. An, P\. Gao, B\. Zou, Y\. Luo, S\. Huang, S\. Zhang, and H\. Li\(2024\)Draw\-and\-understand: leveraging visual prompts to enable mllms to comprehend what you want\.arXiv preprint arXiv:2403\.20271\.Cited by:[§3\.1\.2](https://arxiv.org/html/2606.26196#S3.SS1.SSS2.p2.1)\.
- \[90\]W\. Lin, Y\. Ma, X\. Sun, S\. He, J\. Ji, L\. Cao, and R\. Ji\(2025\)HRSeg: high\-resolution visual perception and enhancement for reasoning segmentation\.arXiv preprint arXiv:2507\.12883\.Cited by:[§4\.3\.2](https://arxiv.org/html/2606.26196#S4.SS3.SSS2.p1.2)\.
- \[91\]D\. Liu, R\. Zhang, L\. Qiu, S\. Huang, W\. Lin, S\. Zhao, S\. Geng, Z\. Lin, P\. Jin, K\. Zhang,et al\.\(2024\)Sphinx\-x: scaling data and parameters for a family of multi\-modal large language models\.arXiv preprint arXiv:2402\.05935\.Cited by:[§1\.1](https://arxiv.org/html/2606.26196#S1.SS1.p3.1)\.
- \[92\]H\. Liu, C\. Li, Q\. Wu, and Y\. J\. Lee\(2023\)Visual instruction tuning\.Advances in neural information processing systems36,pp\. 34892–34916\.Cited by:[§3\.1\.2](https://arxiv.org/html/2606.26196#S3.SS1.SSS2.p6.1),[§5\.1\.1](https://arxiv.org/html/2606.26196#S5.SS1.SSS1.p1.1),[§6\.1](https://arxiv.org/html/2606.26196#S6.SS1.p2.1)\.
- \[93\]J\. Liu, W\. Wang, Y\. Zhang, Y\. Tang, X\. He, L\. Guo, T\. Yue, and X\. Wang\(2025\)Towards unified referring expression segmentation across omni\-level visual target granularities\.arXiv preprint arXiv:2504\.01954\.Cited by:[§4\.1\.3](https://arxiv.org/html/2606.26196#S4.SS1.SSS3.p1.3)\.
- \[94\]S\. Liu, H\. Cheng, H\. Liu, H\. Zhang, F\. Li, T\. Ren, X\. Zou, J\. Yang, H\. Su, J\. Zhu,et al\.\(2024\)Llava\-plus: learning to use tools for creating multimodal agents\.InEuropean Conference on Computer Vision,pp\. 126–142\.Cited by:[§5\.1\.1](https://arxiv.org/html/2606.26196#S5.SS1.SSS1.p1.1)\.
- \[95\]S\. Liu, Z\. Zeng, T\. Ren, F\. Li, H\. Zhang, J\. Yang, Q\. Jiang, C\. Li, J\. Yang, H\. Su,et al\.\(2024\)Grounding dino: marrying dino with grounded pre\-training for open\-set object detection\.InEuropean conference on computer vision,pp\. 38–55\.Cited by:[§4\.3\.1](https://arxiv.org/html/2606.26196#S4.SS3.SSS1.p1.1)\.
- \[96\]Y\. Liu, H\. Duan, Y\. Zhang, B\. Li, S\. Zhang, W\. Zhao, Y\. Yuan, J\. Wang, C\. He, Z\. Liu,et al\.\(2024\)Mmbench: is your multi\-modal model an all\-around player?\.InEuropean conference on computer vision,pp\. 216–233\.Cited by:[§1](https://arxiv.org/html/2606.26196#S1.p1.1)\.
- \[97\]Y\. Liu, Z\. Zhao, Z\. Zhuang, L\. Tian, X\. Zhou, and J\. Zhou\(2024\)Points: improving your vision\-language model with affordable strategies\.arXiv preprint arXiv:2409\.04828\.Cited by:[§3\.2\.1](https://arxiv.org/html/2606.26196#S3.SS2.SSS1.p1.1)\.
- \[98\]Y\. Liu, B\. Peng, Z\. Zhong, Z\. Yue, F\. Lu, B\. Yu, and J\. Jia\(2025\)Seg\-zero: reasoning\-chain guided segmentation via cognitive reinforcement\.arXiv preprint arXiv:2503\.06520\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p2.1),[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p4.1)\.
- \[99\]Y\. Liu, T\. Qu, Z\. Zhong, B\. Peng, S\. Liu, B\. Yu, and J\. Jia\(2025\)VisionReasoner: unified visual perception and reasoning via reinforcement learning\.arXiv preprint arXiv:2505\.12081\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p4.1)\.
- \[100\]Z\. Liu, Z\. Sun, Y\. Zang, X\. Dong, Y\. Cao, H\. Duan, D\. Lin, and J\. Wang\(2025\)Visual\-rft: visual reinforcement fine\-tuning\.arXiv preprint arXiv:2503\.01785\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p3.1)\.
- \[101\]Z\. Liu, Y\. Dong, Y\. Rao, J\. Zhou, and J\. Lu\(2024\)Chain\-of\-spot: interactive reasoning improves large vision\-language models\.arXiv preprint arXiv:2403\.12966\.Cited by:[§5\.2\.1](https://arxiv.org/html/2606.26196#S5.SS2.SSS1.p2.1)\.
- \[102\]B\. Luan, H\. Feng, H\. Chen, Y\. Wang, W\. Zhou, and H\. Li\(2024\)Textcot: zoom in for enhanced multimodal text\-rich image understanding\.arXiv preprint arXiv:2404\.09797\.Cited by:[§5\.2\.1](https://arxiv.org/html/2606.26196#S5.SS2.SSS1.p2.1)\.
- \[103\]C\. Ma, Y\. Jiang, J\. Wu, Z\. Yuan, and X\. Qi\(2024\)Groma: localized visual tokenization for grounding multimodal large language models\.InEuropean Conference on Computer Vision,pp\. 417–435\.Cited by:[§3\.1\.1](https://arxiv.org/html/2606.26196#S3.SS1.SSS1.p1.1)\.
- \[104\]J\. Ma, J\. Wang, J\. Luo, P\. Yu, and G\. Zhou\(2025\)Sherlock: towards multi\-scene video abnormal event extraction and localization via a global\-local spatial\-sensitive llm\.InProceedings of the ACM on Web Conference 2025,pp\. 4004–4013\.Cited by:[§3\.2\.1](https://arxiv.org/html/2606.26196#S3.SS2.SSS1.p1.1)\.
- \[105\]T\. Ma, L\. Xie, Y\. Tian, B\. Yang, and Q\. Ye\(2024\)ClawMachine: learning to fetch visual tokens for referential comprehension\.arXiv preprint arXiv:2406\.11327\.Cited by:[§5\.2\.3](https://arxiv.org/html/2606.26196#S5.SS2.SSS3.p2.1)\.
- \[106\]X\. Ma, Z\. Ding, Z\. Luo, C\. Chen, Z\. Guo, D\. F\. Wong, X\. Feng, and M\. Sun\(2025\)Deepperception: advancing r1\-like cognitive visual perception in mllms for knowledge\-intensive visual grounding\.arXiv preprint arXiv:2503\.12797\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p3.1)\.
- \[107\]Z\. Ma, J\. Zhang, Z\. Liu, J\. Zhang, J\. Tan, M\. Shu, J\. C\. Niebles, S\. Heinecke, H\. Wang, C\. Xiong,et al\.\(2024\)TACO: learning multi\-modal action models with synthetic chains\-of\-thought\-and\-action\.arXiv preprint arXiv:2412\.05479\.Cited by:[§5\.1\.2](https://arxiv.org/html/2606.26196#S5.SS1.SSS2.p1.1)\.
- \[108\]Y\. Man, D\. Huang, G\. Liu, S\. Sheng, S\. Liu, L\. Gui, J\. Kautz, Y\. Wang, and Z\. Yu\(2025\)Argus: vision\-centric reasoning with grounded chain\-of\-thought\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 14268–14280\.Cited by:[§5\.2\.3](https://arxiv.org/html/2606.26196#S5.SS2.SSS3.p1.1)\.
- \[109\]A\. Masry, D\. X\. Long, J\. Q\. Tan, S\. Joty, and E\. Hoque\(2022\)Chartqa: a benchmark for question answering about charts with visual and logical reasoning\.arXiv preprint arXiv:2203\.10244\.Cited by:[§1](https://arxiv.org/html/2606.26196#S1.p1.1)\.
- \[110\]D\. Meng, R\. Huang, Z\. Dai, X\. Li, Y\. Xu, J\. Zhang, Z\. Huang, M\. Zhang, L\. Zhang, Y\. Liu,et al\.\(2025\)VideoCap\-r1: enhancing mllms for video captioning via structured thinking\.arXiv preprint arXiv:2506\.01725\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p6.1)\.
- \[111\]M\. Minderer, A\. Gritsenko, and N\. Houlsby\(2023\)Scaling open\-vocabulary object detection\.Advances in Neural Information Processing Systems36,pp\. 72983–73007\.Cited by:[§3\.1\.1](https://arxiv.org/html/2606.26196#S3.SS1.SSS1.p1.1)\.
- \[112\]K\. A\. Nguyen, A\. Juvekar, T\. Yu, M\. Wahed, and I\. Lourentzou\(2025\)CALICO: part\-focused semantic co\-segmentation with large vision\-language models\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 4550–4561\.Cited by:[§4\.1\.2](https://arxiv.org/html/2606.26196#S4.SS1.SSS2.p5.1)\.
- \[113\]OpenAI\(2025\-04\-16\)Thinking with images\.Research ReleaseOpenAI\.External Links:[Link](https://openai.com/index/thinking-with-images/)Cited by:[§7\.2](https://arxiv.org/html/2606.26196#S7.SS2.p1.1)\.
- \[114\]J\. Park, J\. Na, J\. Kim, and H\. J\. Kim\(2025\)DeepVideo\-r1: video reinforcement fine\-tuning via difficulty\-aware regressive grpo\.arXiv preprint arXiv:2506\.07464\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p6.1)\.
- \[115\]S\. Park, H\. Kim, J\. Kim, S\. Kim, and Y\. M\. Ro\(2025\)DIP\-r1: deep inspection and perception with rl looking through and understanding complex scenes\.arXiv preprint arXiv:2505\.23179\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p3.1)\.
- \[116\]Z\. Peng, W\. Wang, L\. Dong, Y\. Hao, S\. Huang, S\. Ma, and F\. Wei\(2023\)Kosmos\-2: grounding multimodal large language models to the world\.arXiv preprint arXiv:2306\.14824\.Cited by:[§3\.1\.2](https://arxiv.org/html/2606.26196#S3.SS1.SSS2.p6.1),[§6\.1](https://arxiv.org/html/2606.26196#S6.SS1.p2.1)\.
- \[117\]R\. Pi, L\. Yao, J\. Gao, J\. Zhang, and T\. Zhang\(2024\)Perceptiongpt: effectively fusing visual perception into llm\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 27124–27133\.Cited by:[§4\.1\.1](https://arxiv.org/html/2606.26196#S4.SS1.SSS1.p2.1)\.
- \[118\]S\. Pramanick, G\. Han, R\. Hou, S\. Nag, S\. Lim, N\. Ballas, Q\. Wang, R\. Chellappa, and A\. Almahairi\(2024\)Jack of all tasks master of many: designing general\-purpose coarse\-to\-fine vision\-language model\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 14076–14088\.Cited by:[§4\.3\.1](https://arxiv.org/html/2606.26196#S4.SS3.SSS1.p2.1)\.
- \[119\]J\. Qi, M\. Ding, W\. Wang, Y\. Bai, Q\. Lv, W\. Hong, B\. Xu, L\. Hou, J\. Li, Y\. Dong,et al\.\(2025\)Cogcom: a visual language model with chain\-of\-manipulations reasoning\.InICLR,Cited by:[§5\.2\.1](https://arxiv.org/html/2606.26196#S5.SS2.SSS1.p3.1)\.
- \[120\]R\. Qian, X\. Yin, and D\. Dou\(2025\)Reasoning to attend: try to understand how¡ seg¿ token works\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 24722–24731\.Cited by:[§4\.3\.2](https://arxiv.org/html/2606.26196#S4.SS3.SSS2.p1.2)\.
- \[121\]C\. Qiang, Z\. Wei, X\. Han, Z\. Wang, S\. Li, X\. Lan, J\. Jiao, and Z\. Han\(2025\)VER\-bench: evaluating mllms on reasoning with fine\-grained visual evidence\.InProceedings of the 33rd ACM International Conference on Multimedia,pp\. 12698–12705\.Cited by:[§7\.3](https://arxiv.org/html/2606.26196#S7.SS3.p2.1)\.
- \[122\]J\. Qiu, Y\. Zhang, X\. Tang, L\. Xie, T\. Ma, P\. Yan, D\. Doermann, Q\. Ye, and Y\. Tian\(2024\)Artemis: towards referential understanding in complex videos\.Advances in Neural Information Processing Systems37,pp\. 114321–114347\.Cited by:[§3\.1\.1](https://arxiv.org/html/2606.26196#S3.SS1.SSS1.p1.1)\.
- \[123\]R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn\(2023\)Direct preference optimization: your language model is secretly a reward model\.Advances in neural information processing systems36,pp\. 53728–53741\.Cited by:[§6\.2\.2](https://arxiv.org/html/2606.26196#S6.SS2.SSS2.p1.1)\.
- \[124\]M\. A\. L\. Ralph, E\. Jefferies, K\. Patterson, and T\. T\. Rogers\(2017\)The neural and computational bases of semantic cognition\.Nature reviews neuroscience18\(1\),pp\. 42–55\.Cited by:[§1](https://arxiv.org/html/2606.26196#S1.p2.1)\.
- \[125\]K\. Ranasinghe, S\. N\. Shukla, O\. Poursaeed, M\. S\. Ryoo, and T\. Lin\(2024\)Learning to localize objects improves spatial reasoning in visual\-llms\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 12977–12987\.Cited by:[§6\.1](https://arxiv.org/html/2606.26196#S6.SS1.p3.1)\.
- \[126\]H\. Rasheed, M\. Maaz, S\. Shaji, A\. Shaker, S\. Khan, H\. Cholakkal, R\. M\. Anwer, E\. Xing, M\. Yang, and F\. S\. Khan\(2024\)Glamm: pixel grounding large multimodal model\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 13009–13018\.Cited by:[§4\.1\.1](https://arxiv.org/html/2606.26196#S4.SS1.SSS1.p2.1)\.
- \[127\]N\. Ravi, V\. Gabeur, Y\. Hu, R\. Hu, C\. Ryali, T\. Ma, H\. Khedr, R\. Rädle, C\. Rolland, L\. Gustafson,et al\.\(2024\)Sam 2: segment anything in images and videos\.arXiv preprint arXiv:2408\.00714\.Cited by:[§4\.1\.2](https://arxiv.org/html/2606.26196#S4.SS1.SSS2.p2.2),[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p4.1)\.
- \[128\]S\. Ren, L\. Yao, S\. Li, X\. Sun, and L\. Hou\(2024\)Timechat: a time\-sensitive multimodal large language model for long video understanding\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 14313–14323\.Cited by:[§3\.1\.3](https://arxiv.org/html/2606.26196#S3.SS1.SSS3.p2.1)\.
- \[129\]Z\. Ren, Z\. Huang, Y\. Wei, Y\. Zhao, D\. Fu, J\. Feng, and X\. Jin\(2024\)Pixellm: pixel reasoning with large multimodal model\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 26374–26383\.Cited by:[§4\.1\.1](https://arxiv.org/html/2606.26196#S4.SS1.SSS1.p5.4)\.
- \[130\]R\. Sapkota and M\. Karkee\(2025\)Object detection with multimodal large vision\-language models: an in\-depth review\.Available at SSRN 5233953\.Cited by:[§1](https://arxiv.org/html/2606.26196#S1.p3.1),[item 2](https://arxiv.org/html/2606.26196#S2.I1.i2.p1.1)\.
- \[131\]J\. Schulman, P\. Moritz, S\. Levine, M\. Jordan, and P\. Abbeel\(2015\)High\-dimensional continuous control using generalized advantage estimation\.arXiv preprint arXiv:1506\.02438\.Cited by:[§7\.2\.3](https://arxiv.org/html/2606.26196#S7.SS2.SSS3.p1.1)\.
- \[132\]J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov\(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§7\.2\.3](https://arxiv.org/html/2606.26196#S7.SS2.SSS3.p1.1)\.
- \[133\]Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p1.1)\.
- \[134\]C\. Shen, W\. Wei, X\. Qu, and Y\. Cheng\(2025\)Satori\-r1: incentivizing multimodal reasoning with spatial grounding and verifiable rewards\.arXiv preprint arXiv:2505\.19094\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p3.1)\.
- \[135\]H\. Shen, P\. Liu, J\. Li, C\. Fang, Y\. Ma, J\. Liao, Q\. Shen, Z\. Zhang, K\. Zhao, Q\. Zhang,et al\.\(2025\)Vlm\-r1: a stable and generalizable r1\-style large vision\-language model\.arXiv preprint arXiv:2504\.07615\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p2.1)\.
- \[136\]H\. Shen, K\. Zhao, T\. Zhao, R\. Xu, Z\. Zhang, M\. Zhu, and J\. Yin\(2024\)ZoomEye: enhancing multimodal llms with human\-like zooming capabilities through tree\-based image exploration\.arXiv preprint arXiv:2411\.16044\.Cited by:[§5\.2\.2](https://arxiv.org/html/2606.26196#S5.SS2.SSS2.p1.1)\.
- \[137\]L\. Shen, G\. Chen, R\. Shao, W\. Guan, and L\. Nie\(2024\)Mome: mixture of multimodal experts for generalist multimodal large language models\.Advances in neural information processing systems37,pp\. 42048–42070\.Cited by:[§3\.2\.1](https://arxiv.org/html/2606.26196#S3.SS2.SSS1.p1.1)\.
- \[138\]Y\. Shen, C\. Li, F\. Xiong, J\. Jeong, T\. Wang, M\. Latman, and M\. Unberath\(2025\)Reasoning segmentation for images and videos: a survey\.arXiv preprint arXiv:2505\.18816\.Cited by:[§1](https://arxiv.org/html/2606.26196#S1.p3.1),[item 2](https://arxiv.org/html/2606.26196#S2.I1.i2.p1.1)\.
- \[139\]M\. Shi, F\. Liu, S\. Wang, S\. Liao, S\. Radhakrishnan, Y\. Zhao, D\. Huang, H\. Yin, K\. Sapra, Y\. Yacoob,et al\.\(2024\)Eagle: exploring the design space for multimodal llms with mixture of encoders\.arXiv preprint arXiv:2408\.15998\.Cited by:[§3\.2\.1](https://arxiv.org/html/2606.26196#S3.SS2.SSS1.p1.1)\.
- \[140\]D\. Silver, T\. Hubert, J\. Schrittwieser, I\. Antonoglou, M\. Lai, A\. Guez, M\. Lanctot, L\. Sifre, D\. Kumaran, T\. Graepel,et al\.\(2017\)Mastering chess and shogi by self\-play with a general reinforcement learning algorithm\.arXiv preprint arXiv:1712\.01815\.Cited by:[§7\.3\.1](https://arxiv.org/html/2606.26196#S7.SS3.SSS1.p1.1)\.
- \[141\]A\. Su, H\. Wang, W\. Ren, F\. Lin, and W\. Chen\(2025\)Pixel reasoner: incentivizing pixel\-space reasoning with curiosity\-driven reinforcement learning\.arXiv preprint arXiv:2505\.15966\.Cited by:[§7\.2\.1](https://arxiv.org/html/2606.26196#S7.SS2.SSS1.p1.1),[§7\.3\.1](https://arxiv.org/html/2606.26196#S7.SS3.SSS1.p1.1)\.
- \[142\]Z\. Su, L\. Li, M\. Song, Y\. Hao, Z\. Yang, J\. Zhang, G\. Chen, J\. Gu, J\. Li, X\. Qu,et al\.\(2025\)Openthinkimg: learning to think with images via visual tool reinforcement learning\.arXiv preprint arXiv:2505\.08617\.Cited by:[§7\.2\.2](https://arxiv.org/html/2606.26196#S7.SS2.SSS2.p1.1)\.
- \[143\]D\. Surís, S\. Menon, and C\. Vondrick\(2023\)Vipergpt: visual inference via python execution for reasoning\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 11888–11898\.Cited by:[§5\.3](https://arxiv.org/html/2606.26196#S5.SS3.p2.1)\.
- \[144\]Y\. Tai, L\. Zhu, Z\. Chen, Y\. Ding, Y\. Dong, X\. Liu, and G\. Guo\(2025\)REF\-vlm: triplet\-based referring paradigm for unified visual decoding\.arXiv preprint arXiv:2503\.07413\.Cited by:[§4\.2](https://arxiv.org/html/2606.26196#S4.SS2.p3.1)\.
- \[145\]H\. Tang, C\. Xie, H\. Wang, X\. Bao, T\. Weng, P\. Li, Y\. Zheng, and L\. Wang\(2025\)Ufo: a unified approach to fine\-grained visual perception via open\-ended language interface\.arXiv preprint arXiv:2503\.01342\.Cited by:[§4\.3\.1](https://arxiv.org/html/2606.26196#S4.SS3.SSS1.p2.1)\.
- \[146\]W\. Tang, Y\. Sun, Q\. Gu, and Z\. Li\(2025\)Visual position prompt for mllm based visual grounding\.arXiv preprint arXiv:2503\.15426\.Cited by:[§3\.1\.2](https://arxiv.org/html/2606.26196#S3.SS1.SSS2.p3.1)\.
- \[147\]Y\. Tian, T\. Ma, L\. Xie, J\. Qiu, X\. Tang, Y\. Zhang, J\. Jiao, Q\. Tian, and Q\. Ye\(2024\)Chatterbox: multi\-round multimodal referring and grounding\.arXiv preprint arXiv:2401\.13307\.Cited by:[§4\.1\.1](https://arxiv.org/html/2606.26196#S4.SS1.SSS1.p3.1)\.
- \[148\]H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar,et al\.\(2023\)Llama: open and efficient foundation language models\.arXiv preprint arXiv:2302\.13971\.Cited by:[§4\.3\.1](https://arxiv.org/html/2606.26196#S4.SS3.SSS1.p2.1)\.
- \[149\]M\. Wahed, K\. A\. Nguyen, A\. S\. Juvekar, X\. Li, X\. Zhou, V\. Shah, T\. Yu, P\. Yanardag, and I\. Lourentzou\(2024\)PRIMA: multi\-image vision\-language models for reasoning segmentation\.arXiv preprint arXiv:2412\.15209\.Cited by:[§4\.1\.2](https://arxiv.org/html/2606.26196#S4.SS1.SSS2.p5.1)\.
- \[150\]H\. R\. Walke, K\. Black, T\. Z\. Zhao, Q\. Vuong, C\. Zheng, P\. Hansen\-Estruch, A\. W\. He, V\. Myers, M\. J\. Kim, M\. Du,et al\.\(2023\)Bridgedata v2: a dataset for robot learning at scale\.InConference on Robot Learning,pp\. 1723–1736\.Cited by:[§1\.1](https://arxiv.org/html/2606.26196#S1.SS1.p3.1)\.
- \[151\]A\. Wang, B\. Shan, W\. Shi, K\. Lin, X\. Fei, G\. Tang, L\. Liao, J\. Tang, C\. Huang, and W\. Zheng\(2025\)Pargo: bridging vision\-language with partial and global views\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 7491–7499\.Cited by:[§3\.1\.3](https://arxiv.org/html/2606.26196#S3.SS1.SSS3.p1.1)\.
- \[152\]H\. Wang, Z\. Xu, Y\. Cheng, S\. Diao, Y\. Zhou, Y\. Cao, Q\. Wang, W\. Ge, and L\. Huang\(2024\)Grounded\-videollm: sharpening fine\-grained temporal grounding in video large language models\.arXiv preprint arXiv:2410\.03290\.Cited by:[§4\.3\.2](https://arxiv.org/html/2606.26196#S4.SS3.SSS2.p1.2)\.
- \[153\]H\. Wang, Y\. Ye, Y\. Wang, Y\. Nie, and C\. Huang\(2024\)Elysium: exploring object\-level perception in videos via mllm\.InEuropean Conference on Computer Vision,pp\. 166–185\.Cited by:[§3\.1\.3](https://arxiv.org/html/2606.26196#S3.SS1.SSS3.p2.1)\.
- \[154\]H\. Wang, X\. Li, Z\. Huang, A\. Wang, J\. Wang, T\. Zhang, J\. Zheng, S\. Bai, Z\. Kang, J\. Feng,et al\.\(2025\)Traceable evidence enhanced visual grounded reasoning: evaluation and methodology\.arXiv preprint arXiv:2507\.07999\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p3.1)\.
- \[155\]J\. Wang, Z\. Kang, H\. Wang, H\. Jiang, J\. Li, B\. Wu, Y\. Wang, J\. Ran, X\. Liang, C\. Feng,et al\.\(2025\)VGR: visual grounded reasoning\.arXiv preprint arXiv:2506\.11991\.Cited by:[§5\.2\.3](https://arxiv.org/html/2606.26196#S5.SS2.SSS3.p2.1)\.
- \[156\]J\. Wang, L\. Yuan, Y\. Zhang, and H\. Sun\(2024\)Tarsier: recipes for training and evaluating large video description models\.arXiv preprint arXiv:2407\.00634\.Cited by:[§1](https://arxiv.org/html/2606.26196#S1.p1.1)\.
- \[157\]J\. Wang and L\. Ke\(2024\)Llm\-seg: bridging image segmentation and large language model reasoning\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 1765–1774\.Cited by:[§4\.3\.2](https://arxiv.org/html/2606.26196#S4.SS3.SSS2.p1.2)\.
- \[158\]L\. Wang, H\. Lin, S\. Chen, T\. Wang, C\. Cheng, Y\. Zhong, D\. Zheng, and W\. Zhao\(2025\)ALTo: adaptive\-length tokenizer for autoregressive mask generation\.arXiv preprint arXiv:2505\.16495\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p4.1)\.
- \[159\]Q\. Wang, R\. Ding, Y\. Zeng, Z\. Chen, L\. Chen, S\. Wang, P\. Xie, F\. Huang, and F\. Zhao\(2025\)VRAG\-rl: empower vision\-perception\-based rag for visually rich information understanding via iterative reasoning with reinforcement learning\.arXiv preprint arXiv:2505\.22019\.Cited by:[§7\.2\.2](https://arxiv.org/html/2606.26196#S7.SS2.SSS2.p1.1)\.
- \[160\]S\. Wang, G\. Fang, L\. Kong, X\. Li, J\. Xu, S\. Yang, Q\. Li, J\. Zhu, and X\. Wang\(2025\)PixelThink: towards efficient chain\-of\-pixel reasoning\.arXiv preprint arXiv:2505\.23727\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p4.1)\.
- \[161\]T\. Wang, C\. Cheng, L\. Wang, S\. Chen, and W\. Zhao\(2025\)Himtok: learning hierarchical mask tokens for image segmentation with large multimodal model\.arXiv preprint arXiv:2503\.13026\.Cited by:[§4\.3\.1](https://arxiv.org/html/2606.26196#S4.SS3.SSS1.p2.1)\.
- \[162\]T\. Wang, Z\. Jiang, Z\. He, S\. Tong, W\. Yang, Y\. Zheng, Z\. Li, Z\. He, and H\. Gong\(2025\)Towards hierarchical multi\-step reward models for enhanced reasoning in large language models\.arXiv preprint arXiv:2503\.13551\.Cited by:[item 3](https://arxiv.org/html/2606.26196#S2.I1.i3.p1.1),[§7\.3\.2](https://arxiv.org/html/2606.26196#S7.SS3.SSS2.p1.1)\.
- \[163\]W\. Wang, Y\. Ren, H\. Luo, T\. Li, C\. Yan, Z\. Chen, W\. Wang, Q\. Li, L\. Lu, X\. Zhu,et al\.\(2024\)The all\-seeing project v2: towards general relation comprehension of the open world\.InEuropean Conference on Computer Vision,pp\. 471–490\.Cited by:[§3\.1\.1](https://arxiv.org/html/2606.26196#S3.SS1.SSS1.p1.1),[§4\.3\.1](https://arxiv.org/html/2606.26196#S4.SS3.SSS1.p1.1)\.
- \[164\]X\. Wang, X\. Zhang, Z\. Luo, Q\. Sun, Y\. Cui, J\. Wang, F\. Zhang, Y\. Wang, Z\. Li, Q\. Yu,et al\.\(2024\)Emu3: next\-token prediction is all you need\.arXiv preprint arXiv:2409\.18869\.Cited by:[§7\.3](https://arxiv.org/html/2606.26196#S7.SS3.p4.1)\.
- \[165\]X\. Wang, S\. Zhang, S\. Li, K\. Kallidromitis, K\. Li, Y\. Kato, K\. Kozuka, and T\. Darrell\(2024\)SegLLM: multi\-round reasoning segmentation\.arXiv preprint arXiv:2410\.18923\.Cited by:[§4\.1\.1](https://arxiv.org/html/2606.26196#S4.SS1.SSS1.p4.1)\.
- \[166\]Y\. Wang, S\. Wu, Y\. Zhang, S\. Yan, Z\. Liu, J\. Luo, and H\. Fei\(2025\)Multimodal chain\-of\-thought reasoning: a comprehensive survey\.arXiv preprint arXiv:2503\.12605\.Cited by:[§1](https://arxiv.org/html/2606.26196#S1.p3.1)\.
- \[167\]Y\. Wang, Y\. Zang, H\. Li, C\. Jin, and J\. Wang\(2025\)Unified reward model for multimodal understanding and generation\.arXiv preprint arXiv:2503\.05236\.Cited by:[§7\.3\.2](https://arxiv.org/html/2606.26196#S7.SS3.SSS2.p1.1)\.
- \[168\]C\. Wei, H\. Tan, Y\. Zhong, Y\. Yang, and L\. Ma\(2024\)Lasagna: language\-based segmentation assistant for complex queries\.arXiv preprint arXiv:2404\.08506\.Cited by:[§4\.1\.1](https://arxiv.org/html/2606.26196#S4.SS1.SSS1.p5.4)\.
- \[169\]C\. Wei, Y\. Zhong, H\. Tan, Y\. Liu, J\. Hu, D\. Li, Z\. Zhao, and Y\. Yang\(2025\)HyperSeg: hybrid segmentation assistant with fine\-grained visual perceiver\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 8931–8941\.Cited by:[§4\.1\.2](https://arxiv.org/html/2606.26196#S4.SS1.SSS2.p3.1)\.
- \[170\]C\. Wei, Y\. Zhong, H\. Tan, Y\. Zeng, Y\. Liu, Z\. Zhao, and Y\. Yang\(2024\)InstructSeg: unifying instructed visual segmentation with multi\-modal large language models\.arXiv preprint arXiv:2412\.14006\.Cited by:[§4\.1\.2](https://arxiv.org/html/2606.26196#S4.SS1.SSS2.p2.2)\.
- \[171\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§5\.1\.1](https://arxiv.org/html/2606.26196#S5.SS1.SSS1.p1.1)\.
- \[172\]Y\. Wei, L\. Zhao, J\. Sun, K\. Lin, J\. Yin, J\. Hu, Y\. Zhang, E\. Yu, H\. Lv, Z\. Weng,et al\.\(2025\)Open vision reasoner: transferring linguistic cognitive behavior for visual reasoning\.arXiv preprint arXiv:2507\.05255\.Cited by:[§7\.2\.3](https://arxiv.org/html/2606.26196#S7.SS2.SSS3.p1.1),[§7\.3\.1](https://arxiv.org/html/2606.26196#S7.SS3.SSS1.p1.1)\.
- \[173\]J\. Wu, M\. Zhong, S\. Xing, Z\. Lai, Z\. Liu, Z\. Chen, W\. Wang, X\. Zhu, L\. Lu, T\. Lu,et al\.\(2024\)Visionllm v2: an end\-to\-end generalist multimodal large language model for hundreds of vision\-language tasks\.Advances in Neural Information Processing Systems37,pp\. 69925–69975\.Cited by:[§4\.2](https://arxiv.org/html/2606.26196#S4.SS2.p3.1)\.
- \[174\]J\. Wu, Y\. Xiong, X\. Li, Y\. Xia, R\. Wang, Y\. Wang, T\. Yu, S\. Kim, R\. A\. Rossi, L\. Yao,et al\.\(2025\)Mitigating visual knowledge forgetting in mllm instruction\-tuning via modality\-decoupled gradient descent\.arXiv preprint arXiv:2502\.117408\.Cited by:[§7\.3](https://arxiv.org/html/2606.26196#S7.SS3.p3.1)\.
- \[175\]P\. Wu and S\. Xie\(2024\)V?: guided visual search as a core mechanism in multimodal llms\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 13084–13094\.Cited by:[§1](https://arxiv.org/html/2606.26196#S1.p2.1),[§5\.2\.1](https://arxiv.org/html/2606.26196#S5.SS2.SSS1.p2.1)\.
- \[176\]T\. Wu, G\. Biamby, D\. Chan, L\. Dunlap, R\. Gupta, X\. Wang, J\. E\. Gonzalez, and T\. Darrell\(2024\)See say and segment: teaching lmms to overcome false premises\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 13459–13469\.Cited by:[§4\.3\.2](https://arxiv.org/html/2606.26196#S4.SS3.SSS2.p1.2)\.
- \[177\]Y\. Wu, Y\. Wang, S\. Tang, W\. Wu, T\. He, W\. Ouyang, P\. Torr, and J\. Wu\(2024\)Dettoolchain: a new prompting paradigm to unleash detection ability of mllm\.InEuropean Conference on Computer Vision,pp\. 164–182\.Cited by:[§5\.1\.1](https://arxiv.org/html/2606.26196#S5.SS1.SSS1.p2.1)\.
- \[178\]Z\. Xia, D\. Han, Y\. Han, X\. Pan, S\. Song, and G\. Huang\(2024\)Gsva: generalized segmentation via multimodal large language models\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 3858–3869\.Cited by:[§4\.1\.1](https://arxiv.org/html/2606.26196#S4.SS1.SSS1.p5.4)\.
- \[179\]T\. Xiong, X\. Wang, D\. Guo, Q\. Ye, H\. Fan, Q\. Gu, H\. Huang, and C\. Li\(2025\)Llava\-critic: learning to evaluate multimodal models\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 13618–13628\.Cited by:[§7\.3\.1](https://arxiv.org/html/2606.26196#S7.SS3.SSS1.p1.1)\.
- \[180\]J\. Xu, L\. Xu, Y\. Yang, X\. Li, F\. Wang, Y\. Xie, Y\. Huang, and Y\. Li\(2024\)U\-llava: unifying multi\-modal tasks via large language model\.InECAI 2024,pp\. 618–625\.Cited by:[§4\.2](https://arxiv.org/html/2606.26196#S4.SS2.p2.1)\.
- \[181\]Z\. Xu, X\. Zhang, X\. Zhou, and J\. Zhang\(2025\)AvatarShield: visual reinforcement learning for human\-centric video forgery detection\.arXiv preprint arXiv:2505\.15173\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p6.1)\.
- \[182\]S\. Xuan, Q\. Guo, M\. Yang, and S\. Zhang\(2024\)Pink: unveiling the power of referential comprehension for multi\-modal llms\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 13838–13848\.Cited by:[§6\.1](https://arxiv.org/html/2606.26196#S6.SS1.p2.1)\.
- \[183\]C\. Yan, H\. Wang, S\. Yan, X\. Jiang, Y\. Hu, G\. Kang, W\. Xie, and E\. Gavves\(2024\)Visa: reasoning video object segmentation via large language models\.InEuropean Conference on Computer Vision,pp\. 98–115\.Cited by:[§4\.1\.2](https://arxiv.org/html/2606.26196#S4.SS1.SSS2.p2.2)\.
- \[184\]Z\. Yan, Z\. Li, Y\. He, C\. Wang, K\. Li, X\. Li, X\. Zeng, Z\. Wang, Y\. Wang, Y\. Qiao,et al\.\(2025\)Task preference optimization: improving multimodal large language models with vision task alignment\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 29880–29892\.Cited by:[§6\.2\.2](https://arxiv.org/html/2606.26196#S6.SS2.SSS2.p1.1)\.
- \[185\]S\. Yang, J\. Li, X\. Lai, B\. Yu, H\. Zhao, and J\. Jia\(2025\)VisionThink: smart and efficient vision language model via reinforcement learning\.arXiv preprint arXiv:2507\.13348\.Cited by:[§7\.2\.3](https://arxiv.org/html/2606.26196#S7.SS2.SSS3.p1.1)\.
- \[186\]S\. Yang, T\. Qu, X\. Lai, Z\. Tian, B\. Peng, S\. Liu, and J\. Jia\(2023\)LISA\+\+: an improved baseline for reasoning segmentation with large language model\.arXiv preprint arXiv:2312\.17240\.Cited by:[§4\.1\.1](https://arxiv.org/html/2606.26196#S4.SS1.SSS1.p4.1)\.
- \[187\]S\. Yang, Y\. Niu, Y\. Liu, Y\. Ye, B\. Lin, and L\. Yuan\(2025\)Look\-back: implicit visual re\-focusing in mllm reasoning\.arXiv preprint arXiv:2507\.03019\.Cited by:[§7\.2\.1](https://arxiv.org/html/2606.26196#S7.SS2.SSS1.p1.1)\.
- \[188\]Y\. Yang, P\. Jiang, J\. Wang, H\. Zhang, K\. Zhao, J\. Chen, and B\. Li\(2024\)Empowering segmentation ability to multi\-modal large language models\.arXiv preprint arXiv:2403\.14141\.Cited by:[§4\.1\.1](https://arxiv.org/html/2606.26196#S4.SS1.SSS1.p4.1)\.
- \[189\]Z\. Yao, X\. Cheng, Z\. Huang, and L\. Li\(2025\)Countllm: towards generalizable repetitive action counting via large language model\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 19143–19153\.Cited by:[§3\.1\.3](https://arxiv.org/html/2606.26196#S3.SS1.SSS3.p2.1)\.
- \[190\]H\. Yin, Y\. Ren, K\. Yan, S\. Ding, and Y\. Hao\(2025\)ROD\-mllm: towards more reliable object detection in multimodal large language models\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 14358–14368\.Cited by:[§3\.2\.1](https://arxiv.org/html/2606.26196#S3.SS2.SSS1.p2.1)\.
- \[191\]H\. You, H\. Zhang, Z\. Gan, X\. Du, B\. Zhang, Z\. Wang, L\. Cao, S\. Chang, and Y\. Yang\(2023\)Ferret: refer and ground anything anywhere at any granularity\.arXiv preprint arXiv:2310\.07704\.Cited by:[§3\.1\.2](https://arxiv.org/html/2606.26196#S3.SS1.SSS2.p3.1)\.
- \[192\]Z\. You and Z\. Wu\(2025\)Seg\-r1: segmentation can be surprisingly simple with reinforcement learning\.arXiv preprint arXiv:2506\.22624\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p4.1)\.
- \[193\]E\. Yu, K\. Lin, L\. Zhao, J\. Yin, Y\. Wei, Y\. Peng, H\. Wei, J\. Sun, C\. Han, Z\. Ge,et al\.\(2025\)Perception\-r1: pioneering perception policy with reinforcement learning\.arXiv preprint arXiv:2504\.07954\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p3.1)\.
- \[194\]L\. Yu, P\. Poirson, S\. Yang, A\. C\. Berg, and T\. L\. Berg\(2016\)Modeling context in referring expressions\.InEuropean conference on computer vision,pp\. 69–85\.Cited by:[§1](https://arxiv.org/html/2606.26196#S1.p1.1),[§1](https://arxiv.org/html/2606.26196#S1.p2.1)\.
- \[195\]Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, L\. Liu,et al\.\(2025\)Dapo: an open\-source llm reinforcement learning system at scale\.arXiv preprint arXiv:2503\.14476\.Cited by:[§7\.3\.2](https://arxiv.org/html/2606.26196#S7.SS3.SSS2.p1.1)\.
- \[196\]R\. Yu, X\. Ma, and X\. Wang\(2025\)Introducing visual perception token into multimodal large language model\.arXiv preprint arXiv:2502\.17425\.Cited by:[§5\.2\.3](https://arxiv.org/html/2606.26196#S5.SS2.SSS3.p1.1)\.
- \[197\]X\. Yu, Y\. Xin, W\. Zhang, C\. Liu, H\. Zhao, X\. Hu, X\. Yu, Z\. Qiao, H\. Tang, X\. Yang,et al\.\(2026\)Modality gap\-driven subspace alignment training paradigm for multimodal large language models\.arXiv preprint arXiv:2602\.07026\.Cited by:[§7\.3](https://arxiv.org/html/2606.26196#S7.SS3.p4.1)\.
- \[198\]X\. Yu, D\. Guan, M\. Y\. Yang, and Y\. Gu\(2025\)Zoom\-refine: boosting high\-resolution multimodal understanding via localized zoom and self\-refinement\.arXiv preprint arXiv:2506\.01663\.Cited by:[§5\.2\.2](https://arxiv.org/html/2606.26196#S5.SS2.SSS2.p1.1)\.
- \[199\]H\. Yuan, X\. Li, T\. Zhang, Z\. Huang, S\. Xu, S\. Ji, Y\. Tong, L\. Qi, J\. Feng, and M\. Yang\(2025\)Sa2VA: marrying sam2 with llava for dense grounded understanding of images and videos\.arXiv preprint arXiv:2501\.04001\.Cited by:[§4\.1\.2](https://arxiv.org/html/2606.26196#S4.SS1.SSS2.p3.1)\.
- \[200\]X\. Yuan, L\. Zhou, Z\. Sun, Z\. Zhou, and J\. Lan\(2025\)Instruction\-guided multi\-granularity segmentation and captioning with large multimodal model\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 9725–9733\.Cited by:[§4\.1\.3](https://arxiv.org/html/2606.26196#S4.SS1.SSS3.p1.3)\.
- \[201\]Y\. Yuan, W\. Li, J\. Liu, D\. Tang, X\. Luo, C\. Qin, L\. Zhang, and J\. Zhu\(2024\)Osprey: pixel understanding with visual instruction tuning\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 28202–28211\.Cited by:[§3\.1\.2](https://arxiv.org/html/2606.26196#S3.SS1.SSS2.p2.1)\.
- \[202\]Y\. Zhai, S\. Tong, X\. Li, M\. Cai, Q\. Qu, Y\. J\. Lee, and Y\. Ma\(2023\)Investigating the catastrophic forgetting in multimodal large language models\.arXiv preprint arXiv:2309\.10313\.Cited by:[§7\.3](https://arxiv.org/html/2606.26196#S7.SS3.p3.1)\.
- \[203\]Y\. Zhan, Z\. Wu, Y\. Zhu, R\. Xue, R\. Luo, Z\. Chen, C\. Zhang, Y\. Li, Z\. He, Z\. Yang,et al\.\(2025\)GThinker: towards general multimodal reasoning via cue\-guided rethinking\.arXiv preprint arXiv:2506\.01078\.Cited by:[§7\.2\.3](https://arxiv.org/html/2606.26196#S7.SS2.SSS3.p1.1),[§7\.3\.1](https://arxiv.org/html/2606.26196#S7.SS3.SSS1.p1.1),[§7\.3\.3](https://arxiv.org/html/2606.26196#S7.SS3.SSS3.p1.1)\.
- \[204\]Y\. Zhan, H\. Zhao, Y\. Zhu, S\. Zheng, F\. Yang, M\. Tang, and J\. Wang\(2025\)Understand, think, and answer: advancing visual reasoning with large multimodal models\.arXiv preprint arXiv:2505\.20753\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p3.1)\.
- \[205\]Y\. Zhan, Y\. Zhu, S\. Zheng, H\. Zhao, F\. Yang, M\. Tang, and J\. Wang\(2025\)Vision\-r1: evolving human\-free alignment in large vision\-language models via vision\-guided reinforcement learning\.arXiv preprint arXiv:2503\.18013\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p3.1)\.
- \[206\]A\. Zhang, Y\. Yao, W\. Ji, Z\. Liu, and T\. Chua\(2023\)Next\-chat: an lmm for chat, detection and segmentation\.arXiv preprint arXiv:2311\.04498\.Cited by:[§4\.2](https://arxiv.org/html/2606.26196#S4.SS2.p2.1)\.
- \[207\]B\. Zhang, H\. Li, T\. Zhang, C\. Yan, J\. Cai, X\. Jiang, and Y\. Hao\(2025\)Improving the reasoning of multi\-image grounding in mllms via reinforcement learning\.arXiv preprint arXiv:2507\.00748\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p3.1)\.
- \[208\]D\. Zhang, Y\. Yu, J\. Dong, C\. Li, D\. Su, C\. Chu, and D\. Yu\(2024\)Mm\-llms: recent advances in multimodal large language models\.arXiv preprint arXiv:2401\.13601\.Cited by:[item 1](https://arxiv.org/html/2606.26196#S2.I1.i1.p1.1)\.
- \[209\]G\. Zhang, T\. Zhong, Y\. Xia, Z\. Yu, H\. Li, W\. He, F\. Shu, M\. Liu, D\. She, Y\. Wang,et al\.\(2025\)Cmmcot: enhancing complex multi\-image comprehension via multi\-modal chain\-of\-thought and memory augmentation\.arXiv preprint arXiv:2503\.05255\.Cited by:[§5\.2\.1](https://arxiv.org/html/2606.26196#S5.SS2.SSS1.p4.1)\.
- \[210\]H\. Zhang, F\. Li, S\. Liu, L\. Zhang, H\. Su, J\. Zhu, L\. M\. Ni, and H\. Shum\(2022\)Dino: detr with improved denoising anchor boxes for end\-to\-end object detection\.arXiv preprint arXiv:2203\.03605\.Cited by:[§4\.1\.1](https://arxiv.org/html/2606.26196#S4.SS1.SSS1.p3.1)\.
- \[211\]H\. Zhang, H\. Li, F\. Li, T\. Ren, X\. Zou, S\. Liu, S\. Huang, J\. Gao, Leizhang, C\. Li,et al\.\(2024\)Llava\-grounding: grounded visual chat with large multimodal models\.InEuropean Conference on Computer Vision,pp\. 19–35\.Cited by:[§4\.3\.1](https://arxiv.org/html/2606.26196#S4.SS3.SSS1.p1.1)\.
- \[212\]S\. Zhang, P\. Sun, S\. Chen, M\. Xiao, W\. Shao, W\. Zhang, Y\. Liu, K\. Chen, and P\. Luo\(2025\)Gpt4roi: instruction tuning large language model on region\-of\-interest\.InEuropean Conference on Computer Vision,pp\. 52–70\.Cited by:[§3\.1\.2](https://arxiv.org/html/2606.26196#S3.SS1.SSS2.p4.1)\.
- \[213\]T\. Zhang, X\. Li, H\. Fei, H\. Yuan, S\. Wu, S\. Ji, C\. C\. Loy, and S\. Yan\(2024\)Omg\-llava: bridging image\-level, object\-level, pixel\-level reasoning and understanding\.Advances in Neural Information Processing Systems37,pp\. 71737–71767\.Cited by:[§4\.1\.1](https://arxiv.org/html/2606.26196#S4.SS1.SSS1.p2.1)\.
- \[214\]X\. Zhang, Z\. Gao, B\. Zhang, P\. Li, X\. Zhang, Y\. Liu, T\. Yuan, Y\. Wu, Y\. Jia, S\. Zhu,et al\.\(2025\)Chain\-of\-focus: adaptive visual search and zooming for multimodal reasoning via rl\.arXiv preprint arXiv:2505\.15436\.Cited by:[§7\.2\.1](https://arxiv.org/html/2606.26196#S7.SS2.SSS1.p1.1),[§7\.3\.1](https://arxiv.org/html/2606.26196#S7.SS3.SSS1.p1.1)\.
- \[215\]Y\. Zhang, Z\. Ma, X\. Gao, S\. Shakiah, Q\. Gao, and J\. Chai\(2024\)Groundhog: grounding large language models to holistic segmentation\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 14227–14238\.Cited by:[§3\.1\.3](https://arxiv.org/html/2606.26196#S3.SS1.SSS3.p2.1)\.
- \[216\]Y\. Zhang, Y\. Cao, X\. Xu, and W\. Shen\(2024\)Logicode: an llm\-driven framework for logical anomaly detection\.IEEE Transactions on Automation Science and Engineering\.Cited by:[§5\.3](https://arxiv.org/html/2606.26196#S5.SS3.p3.1)\.
- \[217\]Y\. Zhang, X\. Huang, J\. Ma, Z\. Li, Z\. Luo, Y\. Xie, Y\. Qin, T\. Luo, Y\. Li, S\. Liu,et al\.\(2024\)Recognize anything: a strong image tagging model\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 1724–1732\.Cited by:[§3\.1\.1](https://arxiv.org/html/2606.26196#S3.SS1.SSS1.p1.1),[§3\.1\.2](https://arxiv.org/html/2606.26196#S3.SS1.SSS2.p5.1),[§4\.3\.1](https://arxiv.org/html/2606.26196#S4.SS3.SSS1.p1.1)\.
- \[218\]Z\. Zhang, R\. Rossi, T\. Yu, F\. Dernoncourt, R\. Zhang, J\. Gu, S\. Kim, X\. Chen, Z\. Wang, and N\. Lipka\(2024\)VipAct: visual\-perception enhancement via specialized vlm agent collaboration and tool\-use\.arXiv preprint arXiv:2410\.16400\.Cited by:[§5\.1\.2](https://arxiv.org/html/2606.26196#S5.SS1.SSS2.p1.1)\.
- \[219\]Z\. Zhang, Y\. Ma, E\. Zhang, and X\. Bai\(2024\)Psalm: pixelwise segmentation with large multi\-modal model\.InEuropean Conference on Computer Vision,pp\. 74–91\.Cited by:[§4\.1\.1](https://arxiv.org/html/2606.26196#S4.SS1.SSS1.p3.1)\.
- \[220\]S\. Zhao, Y\. Lin, L\. Han, Y\. Zhao, and Y\. Wei\(2025\)OmniAD: detect and understand industrial anomaly via multimodal reasoning\.arXiv preprint arXiv:2505\.22039\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p5.1)\.
- \[221\]S\. Zhao, H\. Zhang, S\. Lin, M\. Li, Q\. Wu, K\. Zhang, and C\. Wei\(2025\)PyVision: agentic vision with dynamic tooling\.arXiv preprint arXiv:2507\.07998\.Cited by:[§5\.3](https://arxiv.org/html/2606.26196#S5.SS3.p2.1)\.
- \[222\]X\. Zhao, X\. Li, H\. Duan, H\. Huang, Y\. Li, K\. Chen, and H\. Yang\(2024\)Mg\-llava: towards multi\-granularity visual instruction tuning\.arXiv preprint arXiv:2406\.17770\.Cited by:[§3\.1\.1](https://arxiv.org/html/2606.26196#S3.SS1.SSS1.p1.1)\.
- \[223\]Y\. Zhao, Z\. Lin, D\. Zhou, Z\. Huang, J\. Feng, and B\. Kang\(2023\)Bubogpt: enabling visual grounding in multi\-modal llms\.arXiv preprint arXiv:2307\.08581\.Cited by:[§4\.3\.1](https://arxiv.org/html/2606.26196#S4.SS3.SSS1.p1.1)\.
- \[224\]C\. Zheng, S\. Liu, M\. Li, X\. Chen, B\. Yu, C\. Gao, K\. Dang, Y\. Liu, R\. Men, A\. Yang,et al\.\(2025\)Group sequence policy optimization\.arXiv preprint arXiv:2507\.18071\.Cited by:[§7\.3\.2](https://arxiv.org/html/2606.26196#S7.SS3.SSS2.p1.1),[§7\.3\.3](https://arxiv.org/html/2606.26196#S7.SS3.SSS3.p1.1)\.
- \[225\]R\. Zheng, L\. Qi, X\. Chen, Y\. Wang, K\. Wang, Y\. Qiao, and H\. Zhao\(2024\)ViLLa: video reasoning segmentation with large language model\.arXiv preprint arXiv:2407\.14500\.Cited by:[§4\.1\.2](https://arxiv.org/html/2606.26196#S4.SS1.SSS2.p2.2)\.
- \[226\]Z\. Zheng, M\. Yang, J\. Hong, C\. Zhao, G\. Xu, L\. Yang, C\. Shen, and X\. Yu\(2025\)DeepEyes: incentivizing” thinking with images” via reinforcement learning\.arXiv preprint arXiv:2505\.14362\.Cited by:[§7\.2\.1](https://arxiv.org/html/2606.26196#S7.SS2.SSS1.p1.1),[§7\.3\.1](https://arxiv.org/html/2606.26196#S7.SS3.SSS1.p1.1),[§7\.3\.3](https://arxiv.org/html/2606.26196#S7.SS3.SSS3.p1.1)\.
- \[227\]C\. Zhou, M\. Wang, Y\. Ma, C\. Wu, W\. Chen, Z\. Qian, X\. Liu, Y\. Zhang, J\. Wang, H\. Xu,et al\.\(2025\)From perception to cognition: a survey of vision\-language interactive reasoning in multimodal large language models\.arXiv preprint arXiv:2509\.25373\.Cited by:[item 3](https://arxiv.org/html/2606.26196#S2.I1.i3.p1.1)\.
- \[228\]Q\. Zhou, R\. Zhou, Z\. Hu, P\. Lu, S\. Gao, and Y\. Zhang\(2024\)Image\-of\-thought prompting for visual reasoning refinement in multimodal large language models\.arXiv preprint arXiv:2405\.13872\.Cited by:[§5\.1\.2](https://arxiv.org/html/2606.26196#S5.SS1.SSS2.p1.1)\.
- \[229\]J\. Zhu, Z\. Cheng, J\. He, C\. Li, B\. Luo, H\. Lu, Y\. Geng, and X\. Xie\(2023\)Tracking with human\-intent reasoning\.arXiv preprint arXiv:2312\.17448\.Cited by:[§4\.1\.2](https://arxiv.org/html/2606.26196#S4.SS1.SSS2.p4.3)\.
- \[230\]L\. Zhu, T\. Chen, D\. Ji, J\. Ye, and J\. Liu\(2024\)Llafs: when large language models meet few\-shot segmentation\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 3065–3075\.Cited by:[§4\.3\.1](https://arxiv.org/html/2606.26196#S4.SS3.SSS1.p2.1)\.
- \[231\]L\. Zhu, T\. Chen, Q\. Xu, X\. Liu, D\. Ji, H\. Wu, D\. W\. Soh, and J\. Liu\(2025\)Popen: preference\-based optimization and ensemble for lvlm\-based reasoning segmentation\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 30231–30240\.Cited by:[§6\.2\.2](https://arxiv.org/html/2606.26196#S6.SS2.SSS2.p1.1)\.
- \[232\]L\. Zhu, Q\. Chen, X\. Shen, and X\. Cun\(2025\)VAU\-r1: advancing video anomaly understanding via reinforcement fine\-tuning\.arXiv preprint arXiv:2505\.23504\.Cited by:[§6\.2\.1](https://arxiv.org/html/2606.26196#S6.SS2.SSS1.p6.1)\.
- \[233\]M\. Zhu, Y\. Tian, H\. Chen, C\. Zhou, Q\. Guo, Y\. Liu, M\. Yang, and C\. Shen\(2025\)Segagent: exploring pixel understanding capabilities in mllms by imitating human annotator trajectories\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 3686–3696\.Cited by:[§6\.2\.2](https://arxiv.org/html/2606.26196#S6.SS2.SSS2.p1.1)\.
- \[234\]M\. Zhu, H\. Zhong, C\. Zhao, Z\. Du, Z\. Huang, M\. Liu, H\. Chen, C\. Zou, J\. Chen, M\. Yang,et al\.\(2025\)Active\-o3: empowering multimodal large language models with active perception via grpo\.arXiv preprint arXiv:2505\.21457\.Cited by:[§7\.2\.1](https://arxiv.org/html/2606.26196#S7.SS2.SSS1.p1.1),[§7\.3\.1](https://arxiv.org/html/2606.26196#S7.SS3.SSS1.p1.1)\.
- \[235\]X\. Zhu, W\. Su, L\. Lu, B\. Li, X\. Wang, and J\. Dai\(2020\)Deformable detr: deformable transformers for end\-to\-end object detection\.arXiv preprint arXiv:2010\.04159\.Cited by:[§3\.1\.1](https://arxiv.org/html/2606.26196#S3.SS1.SSS1.p1.1)\.
- \[236\]Z\. Zhu, L\. Zhao, K\. Lin, J\. Yang, E\. Yu, C\. Liu, H\. Wei, J\. Sun, Z\. Ge, and X\. Zhang\(2025\)PerPO: perceptual preference optimization via discriminative rewarding\.arXiv preprint arXiv:2502\.04371\.Cited by:[§6\.2\.2](https://arxiv.org/html/2606.26196#S6.SS2.SSS2.p1.1)\.
From Structure to Synergy: A Survey of Vision-Language Perception Paradigm Evolution in Multimodal Large Language Models

Similar Articles

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm

Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

Causal Probing for Internal Visual Representations in Multimodal Large Language Models

Submit Feedback

Similar Articles

Watch, Remember, Reason: Human-View Video Understanding with MLLMs
The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm
Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models
From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs
Causal Probing for Internal Visual Representations in Multimodal Large Language Models