CORE: Conflict-Oriented Reasoning for General Multimodal Manipulation Detection

arXiv cs.AI Papers

Summary

Proposes the CORE framework that endows multimodal large language models with explicit conflict-capturing capability for generalizable manipulation detection, adapting to unseen manipulation types with few or zero samples.

arXiv:2606.03066v1 Announce Type: new Abstract: The rapid rise of generative AI has made multimodal fake news increasingly realistic and pervasive, posing severe threats to public trust and social stability. Existing detection methods rely heavily on manipulation-specific models and large-scale labeled data, resulting in poor generalization to emerging manipulation types. We observed that the essence of manipulated misinformation lies in its intrinsic conflicts, \textbf{i.e.,} semantic or physical inconsistencies either across modalities or with common world knowledge. Inspired by this observation, we propose \textbf{C}onflict-\textbf{O}riented \textbf{RE}asoning (\textbf{CORE}) framework, an effective paradigm that learns to endows multimodal large language models (MLLMs) with explicit conflict-capturing capability. To this end, CORE first constructs the Conflict Attribution Corpus (CAC) with fine-grained annotations of conflict factors and sources, providing essential data support for subsequent conflict perception training. By performing conflict-oriented representation enhancement and reasoning based on CAC, CORE achieves robust and generalizable conflict detection, effectively and rapidly adapting to unseen manipulation types with a few samples or in even zero-shot settings. Extensive experiments demonstrate that CORE surpasses state-of-the-art models. The dataset and code are publicly available at https://github.com/shen8424/CORE.
Original Article
View Cached Full Text

Cached at: 06/03/26, 09:42 AM

# Conflict-Oriented Reasoning for General Multimodal Manipulation Detection
Source: [https://arxiv.org/html/2606.03066](https://arxiv.org/html/2606.03066)
Yaxiong WangYujiao WuLechao ChengTianrui HuiNan PuZhihui LiZhun Zhong

###### Abstract

The rapid rise of generative AI has made multimodal fake news increasingly realistic and pervasive, posing severe threats to public trust and social stability\. Existing detection methods rely heavily on manipulation\-specific models and large\-scale labeled data, resulting in poor generalization to emerging manipulation types\. We observed that the essence of manipulated misinformation lies in its intrinsic conflicts,i\.e\.,semantic or physical inconsistencies either across modalities or with common world knowledge\. Inspired by this observation, we proposeConflict\-OrientedREasoning \(CORE\) framework, an effective paradigm that learns to endows multimodal large language models \(MLLMs\) with explicit conflict\-capturing capability\. To this end, CORE first constructs the Conflict Attribution Corpus \(CAC\) with fine\-grained annotations of conflict factors and sources, providing essential data support for subsequent conflict perception training\. By performing conflict\-oriented representation enhancement and reasoning based on CAC, CORE achieves robust and generalizable conflict detection, effectively and rapidly adapting to unseen manipulation types with a few samples or in even zero\-shot settings\. Extensive experiments demonstrate that CORE surpasses state\-of\-the\-art models\. The dataset and code are publicly available at[https://github\.com/shen8424/CORE](https://github.com/shen8424/CORE)\.

Forgery Detection, Multimodal, Conflict Reasoning, MLLM, Manipulation Detection

\\icml@noticeprintedtrue††footnotetext:\\forloop@affilnum1\\c@@affilnum¡\\c@@affiliationcounter0AUTHORERR: Missing \\icmlaffiliation\.\. \\Notice@String

## 1Introduction

The rapid advancement of generative artificial intelligence is profoundly impacting multiple domains\(Haydarovet al\.,[2024b](https://arxiv.org/html/2606.03066#bib.bib7); Liet al\.,[2024](https://arxiv.org/html/2606.03066#bib.bib8); Abdelnabiet al\.,[2022](https://arxiv.org/html/2606.03066#bib.bib9); Jianget al\.,[2020a](https://arxiv.org/html/2606.03066#bib.bib21)\), deeply blurring the boundary between reality and fiction\. In social network, malicious actors can now create highly convincing multimodal fake news, combining manipulated images with deceptive text, at an unprecedented scale and speed\(Yuet al\.,[2024](https://arxiv.org/html/2606.03066#bib.bib11); Haydarovet al\.,[2024a](https://arxiv.org/html/2606.03066#bib.bib12); Jianget al\.,[2020b](https://arxiv.org/html/2606.03066#bib.bib13); Liet al\.,[2020a](https://arxiv.org/html/2606.03066#bib.bib14); Luet al\.,[2023](https://arxiv.org/html/2606.03066#bib.bib10)\)\. These forgeries, ranging from subtle edits of facial attributes to entirely fabricated scenes, pose a serious threat to public trust, social stability\(Luet al\.,[2023](https://arxiv.org/html/2606.03066#bib.bib10); Liet al\.,[2020b](https://arxiv.org/html/2606.03066#bib.bib22); Shaoet al\.,[2022](https://arxiv.org/html/2606.03066#bib.bib23)\)\. As manual verification becomes increasingly difficult, the development of robust automated detection systems is more critical than ever\.

![Refer to caption](https://arxiv.org/html/2606.03066v1/x1.png)Figure 1:While previous methods require extensive data and specialized designs for specific manipulations, they struggle with new types\. Our CORE addresses the core “conflict” in fake news, enabling generalized detection and excellent performance with minimal data\. “Mani\.” means “Manipulation”In response to these challenges, various manipulation detection methods have been developed\(Shaoet al\.,[2023](https://arxiv.org/html/2606.03066#bib.bib17); Shenet al\.,[2025](https://arxiv.org/html/2606.03066#bib.bib19); Zhanget al\.,[2025a](https://arxiv.org/html/2606.03066#bib.bib20); Liuet al\.,[2024](https://arxiv.org/html/2606.03066#bib.bib24); Shaoet al\.,[2024](https://arxiv.org/html/2606.03066#bib.bib26); Beiet al\.,[2024](https://arxiv.org/html/2606.03066#bib.bib25)\), successfully alleviating the rampant spread of multimodal fake news\. However, the success of these methods is predicated on designing models and training paradigms tailored to specific manipulation types and relying on large\-scale, type\-specific training data\. In practice, there is a continuous “arms race” between forgery techniques\(Chenet al\.,[2020](https://arxiv.org/html/2606.03066#bib.bib49); Wanget al\.,[2022a](https://arxiv.org/html/2606.03066#bib.bib51); Patashniket al\.,[2021](https://arxiv.org/html/2606.03066#bib.bib52); Gaoet al\.,[2021](https://arxiv.org/html/2606.03066#bib.bib50)\)and detection methods\. The evolutionary pace of new manipulation methods far outpaces both the cycle of data collection, cleaning, and annotation, and the targeted model design required for each novel type\. As a result, current methods significantly degrades when encountering new manipulation patterns\(Zhanget al\.,[2025c](https://arxiv.org/html/2606.03066#bib.bib56),[2026](https://arxiv.org/html/2606.03066#bib.bib57); Lianet al\.,[2026](https://arxiv.org/html/2606.03066#bib.bib58)\)\. Therefore, the field urgently requires a new paradigm that can move beyond dependencies on data and specific model designs, enabling models to achieve effective adaption with only a few samples to novel manipulation types\(Brownet al\.,[2020](https://arxiv.org/html/2606.03066#bib.bib53); Wanget al\.,[2022b](https://arxiv.org/html/2606.03066#bib.bib54); Madaanet al\.,[2022](https://arxiv.org/html/2606.03066#bib.bib55)\)\.

We observed that the essence of manipulated information lies in its intrinsic “conflict”\. This conflict can manifest as: a semantic contradiction between thecontentandworld knowledge, such as the common\-sense conflict between Trump’s presidential status and football award in the news “Donald Trump wins the football award”; or a conflict at the physical level, such as lighting and shadows, between the manipulated content and the original image/text\. As shown in Figure[1](https://arxiv.org/html/2606.03066#S1.F1), existing methods implicitly capture such conflict with massive training data and specialized model designs, but this over\-reliance leads to overfitting towards specific manipulation patterns, resulting in poor generalization for new types\. In contrast, humans detect deception by activating their knowledge and performing conflict reasoning, enabling robust judgment across diverse manipulation forms\. Motivated by this consideration, we argue thatif a detection model is endowed with explicit conflict\-capturing capability, it can emulate human\-like robustness when facing novel manipulation scenarios, thereby alleviating the long\-standing data dependence and design rigidity of current approaches\.

Following the human reasoning process in detecting manipulations in multimodal misinformation, the ability to capture multimodal conflicts largely depends on a model’s understanding of real\-world knowledge\. Multimodal Large Language Models \(MLLMs\)\(Baiet al\.,[2025](https://arxiv.org/html/2606.03066#bib.bib27); Team,[2025b](https://arxiv.org/html/2606.03066#bib.bib28),[2024b](https://arxiv.org/html/2606.03066#bib.bib29); Guoet al\.,[2025](https://arxiv.org/html/2606.03066#bib.bib30)\), trained on vast multimodal corpora, inherently encode rich world knowledge and thus exhibit strong potential for identifying conflicts in multimodal manipulations\. However, they still fall short in conflict capturing due to the lack of conceptual understanding\. MLLMs often conflate entirely unrelated concepts in the feature space, such as“U\.S\. President”and“Football Award”\(Sec\.[3](https://arxiv.org/html/2606.03066#S3)FigureLABEL:fig:tsnebeforetrain\)\. Owing to this weakness, existing MLLMs, despite their rich world knowledge, still struggle to achieve robust and generalizable misinformation detection \(Table[2](https://arxiv.org/html/2606.03066#S4.T2)\)\.

To overcome the aforementioned limitations and establish a foundation model for general multimodal manipulation detection, we propose the Conflict\-Oriented REasoning \(CORE\) framework\. This framework equips MLLMs with explicit conceptual understanding, thereby enabling conflict detection capabilities\. Training this capability requires explicit, fine\-grained conflict supervision, which existing datasets lack\. To provide this necessary data support, we first construct the Conflict Attribution Corpus \(CAC\)\. Each sample in CAC is annotated with both a conflict factor revealing the specific contradictory content within the misinformation and a conflict source, indicating whether the contradiction arises from the text, image, or underlying world knowledge\. With these fine\-grained annotations, we perform a Conflict\-Perception Training \(CPT\) to perceive multimodal conflicts by enhancing the boundaries between conflicting concepts in the feature space, acquiring human\-like conflict comprehension and detection ability\.

With the acquired conflict\-capturing capability from CPT, our CORE framework enables rapid adaptation to emerging manipulation patterns\. Superior detection performance can be achieved with only a few\-sample fine\-tuning of new manipulation types, and even under zero\-shot settings\. In summary, our main contributions are as follows:

\(1\)We introduce an effective learning paradigm for general multimodal manipulation detection, which can rapidly adapt to novel manipulations with limited target samples\.

\(2\)Moving beyond the conventional paradigm of designing models for specific manipulations, we propose CORE, a general framework for multimodal misinformation detection that endows MLLMs with human\-like conflict reasoning and enables fast adaptation to unseen misinformation patterns\.

\(3\)We construct the Conflict Attribution Corpus \(CAC\), a carefully curated dataset containing 14k samples with fine\-grained annotations of conflict factors and sources, providing a solid benchmark for studying conflict reasoning in multimodal manipulation\.

Table 1:World Knowledge Evaluation of Non\-MLLMs and MLLMs\.ModelsWorld Knowledge \(ACC %\)Non\-MLLMs41MLLMs96\(a\)
Classification TaskACC \(%\)Pres\. vs\. Football Award61Pres\. vs\. UK Prime Minister53\(b\)

![Refer to caption](https://arxiv.org/html/2606.03066v1/x2.png)

\(a\)
![Refer to caption](https://arxiv.org/html/2606.03066v1/x3.png)

\(b\)

Figure 2:Multimodal feature visualization of two group of conceptions from Qwen2\.5VL\-3B \(a\) and Qwen2\.5VL\-3B equipped with our CORE \(b\), where the textual and visual features are respectively shown from left to right\.Conflict of Interest Disclosure\.The authors declare no conflicts of interest\.

## 2Related Works

As deepfake technologies continue to evolve, research in the field of multimodal disinformation detection has also made significant progress\. For instance, models such as HAMMER\(Shaoet al\.,[2023](https://arxiv.org/html/2606.03066#bib.bib17)\)and ASAP\(Zhanget al\.,[2025b](https://arxiv.org/html/2606.03066#bib.bib31)\)have designed specialized contrastive learning and fine\-grained detection modules to address the specific problem of image\-text inconsistency; meanwhile, RamDG\(Shenet al\.,[2025](https://arxiv.org/html/2606.03066#bib.bib19)\)focuses on celebrity\-related fake news, employing external knowledge bases for targeted detection\. In recent years, the rise of MLLMs has pushed research to new heights\. SNIFFER\(Qiet al\.,[2024](https://arxiv.org/html/2606.03066#bib.bib32)\)designed a specialized two\-stage fine\-tuning process to enhance the ability to judge image\-text consistency, while FKA\-Owl\(Liuet al\.,[2024](https://arxiv.org/html/2606.03066#bib.bib24)\)attempts to tackle specific types of common sense fallacies by integrating world knowledge\. To handle more complex forgeries, MMD\-Agent\(Liuet al\.,[2025](https://arxiv.org/html/2606.03066#bib.bib18)\)constructs a specific multi\-step reasoning framework, and AMD\(Zhanget al\.,[2025a](https://arxiv.org/html/2606.03066#bib.bib20)\)relies on detailed prior information, such as manipulation region coordinates and manipulation types, for detection\.

Despite the considerable progress in multimodal news detection methods, they suffer from two limitations\. First, they heavily rely on large\-scale datasets constructed for specific manipulation types; second, their model designs or training strategies are often specialized for certain forgery traces\. These specialized designs make it difficult to guarantee the models generalization ability when faced with out\-of\-distribution, and especially unseen, manipulation types\. Therefore, our work moves away from designing for specific forgery traces and instead focuses on the core flaw of forged information—conflict\. Mastering this fundamental capability allows the model to break its dependence on large\-scale, specific data, thereby demonstrating robust generalization and detection capabilities for unseen manipulation types in a few\-sample and even zero\-shot scenarios\.

![Refer to caption](https://arxiv.org/html/2606.03066v1/x4.png)Figure 3:\(a\) The construction process of CAC\. \(b\) An example from CAC\. \(c\) Statistics of CAC, including the distribution of conflict sources and word clouds of the Conflict Factor\.
## 3The Challenge of Conflict Perception

Human often do not require extensive training on similar samples to identify novel forged information\. This is largely attributable to their ability to acutely identify conflicts within news based on their world knowledge and understanding\. This capability is built upon two core foundations: 1\) a comprehensive repository of world knowledge, and 2\) a clear and well understanding of that knowledge to support conflict\-capturing\. This section investigates through a series of experiments whether current mainstream models possess these two key capabilities\.

We first investigate a fundamental question:Do existing models possess the world knowledge required to identify fake news?To this end, we constructed a benchmark of 200 multiple\-choice questions, covering the diverse world knowledge needed to detect fake news \(See Appendix[G](https://arxiv.org/html/2606.03066#A7)for details\)\. Our evaluation includes two representative classes of models: non\-MLLM models, such as CLIP\(Radfordet al\.,[2021](https://arxiv.org/html/2606.03066#bib.bib33)\)and ALBEF\(Liet al\.,[2021](https://arxiv.org/html/2606.03066#bib.bib34)\), and MLLMs, including Qwen2\.5VL\-3B and Gemma3\-4B\. For the non\-MLLM models, we assess their choices by calculating the cosine similarity between the embeddings of the question and the options after they pass through the encoder; the option with the highest similarity is considered the model’s prediction\. For MLLMs, we directly use prompting to have them output the correct option\. The experimental results, as shown in TableLABEL:tab:world\_knowledge\_benchmark, indicate that MLLMs possess relatively complete knowledge, whereas non\-MLLMs do not\.

To investigate whether MLLMs possess clear conceptual boundaries like humans, we systematically analyze their feature representation space\. We select concept pairs with varying semantic differences \(e\.g\., U\.S\. President vs\. football player; U\.S\. President vs\. UK Prime Minister\), collect 100 relevant entities for each concept, and extract their multimodal features \(See the Appendix[F](https://arxiv.org/html/2606.03066#A6)for details\)\. We then use t\-SNE\(Van der Maaten and Hinton,[2008](https://arxiv.org/html/2606.03066#bib.bib35)\)for visualization\. As shown in FigureLABEL:fig:tsnebeforetrain, the results demonstrate that the MLLM’s representation space fails to form clear boundaries: the distributions for even semantically disparate concepts are diffuse and overlapping; We further train a classifier based on the features to quantify their separability, the low classification accuracy in TableLABEL:tab:linear\_separabilityalso quantitatively confirms this\.

The experiments show that non\-MLLM models suffer from incomplete knowledge, while MLLMs, though addressing the knowledge repository issue, still lack conceptual clarity\. The key to resolving this dilemma is to enable models to possess knowledge and understand it in a clear, structured way, thereby learning detection based on the core principle of conflict\.

## 4Methodology

![Refer to caption](https://arxiv.org/html/2606.03066v1/x5.png)Figure 4:The architecture of our proposed CORE\. It first employs MBPT to train cross\-modal alignment, subsequently utilizes CPT to train conflict perception, and finally achieves effective detection of novel manipulations via Rapid Adaptation\.To overcome the above limitations, we propose the Conflict\-Oriented Reasoning \(CORE\) framework, as illustrated in Figure[4](https://arxiv.org/html/2606.03066#S4.F4)\. Built upon MLLMs, CORE leverages their extensive knowledge base and reshapes the boundaries between conflicting concepts to enhance the model’s conceptual understanding, thereby improving conflict perception capability and enabling MLLMs with human\-like conflict capturing ability\. To enable this, we first construct the Conflict Attribution Corpus \(CAC\) with fine\-grained annotations of conflict sources and factors\. Next, Modality Bridging Pre\-Training \(MBPT\) is conducted to train a Cross\-Modal Aligner\. This aligner bridges the modality gap, enabling the full utilization of CAC annotations\. Finally, the Conflict Perception Training \(CPT\) stage explicitly reshapes the model’s conceptual understanding of conflicting elements, thereby refining its ability to perceive and reason over multimodal conflicts\.

### 4\.1Conflict Attribution Corpus

The CAC provides explicit supervision signals to facilitate conflict perception learning\. As illustrated in Figure[3](https://arxiv.org/html/2606.03066#S2.F3)\(a\), the construction of CAC involves the following steps:

\-Source Sample Selection\.Given that the SAMM\(Shenet al\.,[2025](https://arxiv.org/html/2606.03066#bib.bib19)\)provides rich manipulation annotations including manipulated objects and regions, which offer valuable cues for subsequent conflict attribution generation\. We therefore select 100k image\-text pairs from it as our base data\.

\-Background Knowledge Collection\.Next, we collect external supplementary materials that are semantically related to each image–text pair as background knowledge via the Google Search API\(Google,[2025](https://arxiv.org/html/2606.03066#bib.bib36)\), providing reliable and comprehensive support for conflict reasoning\.

\-Conflict Rationale Generation\. Subsequently, the integrated information including the multimodal inputs, manipulation prior and the background information are fed into a MLLM randomly selected from an expert pool of \{GPT\-4o, Gemini2\.5\-Pro, Qwen3\-VL\-Plus\}\(Team,[2024a](https://arxiv.org/html/2606.03066#bib.bib37),[2025a](https://arxiv.org/html/2606.03066#bib.bib39); Baiet al\.,[2023](https://arxiv.org/html/2606.03066#bib.bib38)\), instructing it to generate a detailed reason why the news is false\. The conflict rationale is then cross\-validated for plausibility by the other two MLLMs\.

\-Conflict Structuring\. After validation, the sample is once again sent to a randomly selected MLLM to distill the reason into a structured format of Conflict Factor 1, Conflict Factor 2 and their respective Conflict Source 1, Conflict Source 2, where the conflict factor specifies the content of the contradiction, while the conflict source pinpoints its origin, as shown in Figure[3](https://arxiv.org/html/2606.03066#S2.F3)\(b\)\. This result also undergos a final review by the remaining two MLLMs\.

Statistics\.As shown in Figure[3](https://arxiv.org/html/2606.03066#S2.F3)\(c\), CAC contains 14k instances\. Its final data structure is <ImageII, TextTT, \{Conflict FactorC1C\_\{1\}, Conflict FactorC2C\_\{2\}\}, \{Conflict SourceS1S\_\{1\}, Conflict SourceS2S\_\{2\}\}\>\. Regarding the distribution of conflict sources, 29\.98% of conflicts originate from the news caption, 36\.86% originate from the news image, and 33\.16% from world knowledge\. Please refer to Appendix[H](https://arxiv.org/html/2606.03066#A8)for the prompts and validation protocols\.

Table 2:Performance comparison \(ACC\) on multiple datasets using 100\-750 \(100\-350\) samples\.
### 4\.2Modality Bridging Pre\-Training

Although the sources of conflict span both visual and textual modalities, their annotated form in CAC is uniformly text\. Therefore, accurately mapping conflict descriptions that originate from vision but exist in text form back to the visual space becomes a challenge\. To bridge this modality gap, we introduce a concise and efficient Cross\-modal Aligner and forge its cross\-modal alignment capability through a dedicated pre\-training stage\. After the model acquires reliable alignment capabilities, we then commence the second stage of conflict perception training on CAC\.

This stage of training is conducted on 50k samples from the FineHARD\(Xieet al\.,[2025](https://arxiv.org/html/2606.03066#bib.bib40)\)dataset, whose samples consist of an imageII, a positive sample𝑃𝑆\\mathit\{PS\}that exists in the image, and three hard negative samples\{𝑁𝑆i\}i=13\\\{\\mathit\{NS\}\_\{i\}\\\}\_\{i=1\}^\{3\}that are semantically close to𝑃𝑆\\mathit\{PS\}but do not exist in the image \(see Appendix for examples\)\. For feature extraction, the imageIIis passed through the vision encoderℰV\\mathcal\{E\}\_\{V\}and a modality connector𝒫\\mathcal\{P\}to obtain the visual feature sequenceV=𝒫​\(ℰV​\(I\)\)V=\\mathcal\{P\}\(\\mathcal\{E\}\_\{V\}\(I\)\)\. Simultaneously, the positive and negative text samples, each appended with an<EOS\>token, are fed into the LLM\. We then extract the hidden state corresponding to the<EOS\>token from the LLM’s final layer as the global feature\. From this, we get the global text feature for the positive sample𝐭p\\mathbf\{t\}\_\{p\}, and a set of global text features for the negative samples,\{𝐭ni\}\\\{\\mathbf\{t\}\_\{n\_\{i\}\}\\\}\. Next, using𝐭p,\{𝐭ni\}\\mathbf\{t\}\_\{p\},\\\{\\mathbf\{t\}\_\{n\_\{i\}\}\\\}as the Query andVVas the Key and Value\(Vaswaniet al\.,[2017](https://arxiv.org/html/2606.03066#bib.bib41)\), we compute text\-guided visual features𝐯p,\{𝐯ni\}\\mathbf\{v\}\_\{p\},\\\{\\mathbf\{v\}\_\{n\_\{i\}\}\\\}via a Cross\-modal Aligner, which is simply implemented as a cross\-attention layer:

𝐯p=Aligner⁡\(𝐭p,V,V\),\{𝐯ni\}=Aligner⁡\(\{𝐭ni\},V,V\)\.\\mathbf\{v\}\_\{p\}=\\operatorname\{Aligner\}\\left\(\\mathbf\{t\}\_\{p\},V,V\\right\),\\\{\\mathbf\{v\}\_\{n\_\{i\}\}\\\}=\\operatorname\{Aligner\}\\left\(\\\{\\mathbf\{t\}\_\{n\_\{i\}\}\\\},V,V\\right\)\.\(1\)
To achieve a fine\-grained alignment, we adopt the contrastive learning loss proposed by SigLIP\(Zhaiet al\.,[2023](https://arxiv.org/html/2606.03066#bib.bib43)\), which aims to ensure that the extracted visual feature𝐯p\\mathbf\{v\}\_\{p\}is semantically highly correlated with the corresponding text feature𝐭p\\mathbf\{t\}\_\{p\}:

ℒc​l=∑\(𝐭,𝐯\)∈Q11\+ey𝐭​\(s1⋅⟨𝐭,𝐯⟩\+b1\),\\mathcal\{L\}\_\{cl\}=\\sum\_\{\\mathbf\{\(t,v\)\}\\in Q\}\\frac\{1\}\{1\+e^\{y\_\{\\mathbf\{t\}\}\(s\_\{1\}\\cdot\\langle\\mathbf\{t\},\\mathbf\{v\}\\rangle\+b\_\{1\}\)\}\},\(2\)whereQ=\{\(𝐭p,𝐯p\),\{\(𝐭ni,𝐯ni\)\},\{\(𝐭ni,𝐯p\)\}\}Q=\\\{\(\\mathbf\{t\}\_\{p\},\\mathbf\{v\}\_\{p\}\),\\\{\(\\mathbf\{t\}\_\{n\_\{i\}\},\\mathbf\{v\}\_\{n\_\{i\}\}\)\\\},\\\{\(\\mathbf\{t\}\_\{n\_\{i\}\},\\mathbf\{v\}\_\{p\}\)\\\}\\\},s1s\_\{1\}andb1b\_\{1\}are learnable scalar parameters, and⟨⋅,⋅⟩\\langle\\cdot,\\cdot\\ranglerepresents cosine similarity\. When\(𝐭,𝐯\)=\(𝐭p,𝐯p\)\(\\mathbf\{t\},\\mathbf\{v\}\)=\(\\mathbf\{t\}\_\{p\},\\mathbf\{v\}\_\{p\}\),y𝐭=1y\_\{\\mathbf\{t\}\}=1; otherwise,y𝐭=−1y\_\{\\mathbf\{t\}\}=\-1\.

To aid fine\-grained multimodal understanding and preserve the model’s inherent language capabilities, we further devise an object\-occurrence\-based visual question answering task\. Specifically, we construct the following question\-answering\(Antolet al\.,[2015](https://arxiv.org/html/2606.03066#bib.bib44)\)instruction format:

Question:“Does the image containRandSF​\(\{P​S,N​S1,N​S2,N​S3\}\)\\text\{RandSF\}\(\\\{PS,NS\_\{1\},NS\_\{2\},NS\_\{3\}\\\}\)?”

Answer:“The image contains𝑃𝑆\\mathit\{PS\}and doesn’t contain\{𝑁𝑆1,𝑁𝑆2,𝑁𝑆3\}\\\{\\mathit\{NS\}\_\{1\},\\mathit\{NS\}\_\{2\},\\mathit\{NS\}\_\{3\}\\\}\.”

where RandSF\(⋅\\cdot\) is the random shuffling operation\. We then calculate a language generation lossℒo​2​v​q​a\\mathcal\{L\}\_\{o2vqa\}\. The total loss function for this stage is defined as follows:

ℒm​b​p​t=ℒc​l\+ℒo​2​v​q​a\.\\mathcal\{L\}\_\{mbpt\}=\\mathcal\{L\}\_\{cl\}\+\\mathcal\{L\}\_\{o2vqa\}\.\(3\)
![Refer to caption](https://arxiv.org/html/2606.03066v1/x6.png)Figure 5:Performance comparison on multiple datasets using 100\-2\.5k \(100\-350 on MMFakeBench\) samples\.
### 4\.3Conflict Perception Training

Given a sample<I,T,\{C1,C2\},\{S1,S2\}\>∈CAC<I,T,\\\{C\_\{1\},C\_\{2\}\\\},\\\{S\_\{1\},S\_\{2\}\\\}\>\\in\\text\{CAC\}, the news imageIIis passed throughℰV\\mathcal\{E\}\_\{V\}and𝒫\\mathcal\{P\}to obtain the visual feature sequenceVV\. The two conflict factorsC1,C2C\_\{1\},C\_\{2\}, after appending the<EOS\>token, are fed into the LLM to obtain their corresponding global features𝐭c1,𝐭c2\\mathbf\{t\}\_\{c\_\{1\}\},\\mathbf\{t\}\_\{c\_\{2\}\}\. We process the conflict factor features based on the modality of their sources\{S1,S2\}\\\{S\_\{1\},S\_\{2\}\\\}\. If the sourceSiS\_\{i\}of a conflict factorCiC\_\{i\}is the visual modality \(i\.e\.,SiS\_\{i\}is image\), we invoke the Cross\- modal Aligner pre\-trained in Modality Bridging Pre\-Training \(MBPT\) stage to extract the corresponding visual feature\. Otherwise, we keep the feature unchanged\. For ease of notation, we denote the two conflict representations used for comparison as𝐳1\\mathbf\{z\}\_\{1\}and𝐳2\\mathbf\{z\}\_\{2\}, defined as follows:

𝐳i=\{Aligner⁡\(𝐭ci,V,V\)if​Si​is image,𝐭ciotherwise\.\\mathbf\{z\}\_\{i\}=\\begin\{cases\}\\operatorname\{Aligner\}\\left\(\\mathbf\{t\}\_\{c\_\{i\}\},V,V\\right\)&\\text\{if \}S\_\{i\}\\text\{ is image,\}\\\\ \\mathbf\{t\}\_\{c\_\{i\}\}&\\text\{otherwise\.\}\\end\{cases\}\(4\)
First, we adopt a conflict\-aware contrastive lossℒc​a​c​l\\mathcal\{L\}\_\{cacl\}to help the model establish clear conceptual boundaries by pushing the two conflict factor representations𝐳1\\mathbf\{z\}\_\{1\}and𝐳2\\mathbf\{z\}\_\{2\}far apart in the semantic space, which is the core to identify conflicts in manipulated samples:

ℒc​a​c​l=11\+e−\(s2⋅⟨𝐳1,𝐳2⟩\+b2\)\.\\mathcal\{L\}\_\{cacl\}=\\frac\{1\}\{1\+e^\{\-\(s\_\{2\}\\cdot\\langle\\mathbf\{z\}\_\{1\},\\mathbf\{z\}\_\{2\}\\rangle\+b\_\{2\}\)\}\}\.\(5\)This loss function aims to maximize the distance between the two conflict factor representations\.

Besides, we further design a conflict reasoning loss to enhance the conflict capture and preserve the model’s inherent language capabilities:

Question:“Does the news Real or Fake? If it’s fake, further give the reason\.”

Answer:“Real\. / Fake\. Because theC1C\_\{1\}fromS1S\_\{1\}conflicts withC2C\_\{2\}fromS2S\_\{2\}\.”

We then calculate a language modeling lossℒc​r\\mathcal\{L\}\_\{cr\}to produce conflict reasoning\. The total loss function for CPT stage is defined as follows:

ℒc​p​t=ℒc​a​c​l\+ℒc​r\.\\mathcal\{L\}\_\{cpt\}=\\mathcal\{L\}\_\{cacl\}\+\\mathcal\{L\}\_\{cr\}\.\(6\)

### 4\.4Rapid Adaptation to Novel Manipulation

To verify whether the model improved by CORE framework have a clear conceptual understanding, we re\-examine the concept pairs of ”US President vs\. Football Award” and ”US President vs\. UK Prime Minister”\. As shown in FigureLABEL:fig:tsneaftertrain, the boundaries of conceptual embedding from the model become clear\. The MLLMs with rich real\-world knowledge and clear conceptual understanding hold good ability for conflict\-capturing in manipulated multimodal misinformation\. Therefore, when facing with new manipulation types or news patterns, it only requires a small amount of data for fine\-tuning to adapt quickly and achieve superior recognition performance\. To ensure the model’s generalization ability, we do not make specialized designs for specific data types\. We only have the model make predictions by constructing a question\-answering instruction,“Is the news real or fake?”, and calculate the corresponding language generation lossℒr​a\\mathcal\{L\}\_\{ra\}when finetuning\. Similarly, during inference, the model requires only this simple instruction to make predictions\.

## 5Experiments

Table 3:Performance comparison \(ACC\) on large\-scale data\.Implementation Details\.To validate the generalization ability of the proposed CORE framework, we select two advanced open\-source MLLMs as our backbones: Qwen2\.5VL\-3B\(Baiet al\.,[2025](https://arxiv.org/html/2606.03066#bib.bib27)\)and Gemma3\-4B\(Team,[2025b](https://arxiv.org/html/2606.03066#bib.bib28)\), and apply the CORE framework to train them\. All our training processes utilize the LoRA\(Huet al\.,[2022](https://arxiv.org/html/2606.03066#bib.bib45)\)technique\. Please refer to the Appendix[A](https://arxiv.org/html/2606.03066#A1)for more details\.

Datasets\.To comprehensively evaluate the model’s performance, we select four public multimodal datasets with diverse manipulation patterns: DGM4\(Shaoet al\.,[2023](https://arxiv.org/html/2606.03066#bib.bib17)\), MDSM\(Zhanget al\.,[2025a](https://arxiv.org/html/2606.03066#bib.bib20)\), MMFakeBench\(Liuet al\.,[2025](https://arxiv.org/html/2606.03066#bib.bib18)\), and NewsCLIPpings\(Luoet al\.,[2021](https://arxiv.org/html/2606.03066#bib.bib46)\)\. To simulate the real\-world scenario where novel manipulated data is scarce, we randomly sample a small number of samples from the aforementioned datasets for training\. It is worth noting that, since MMFakeBench\(Liuet al\.,[2025](https://arxiv.org/html/2606.03066#bib.bib18)\)does not include a training set, we use its validation set as training data\. Please refer to the Appendix[L](https://arxiv.org/html/2606.03066#A12)for the data overlap analysis between FineHARD, CAC, and the benchmarks\.

Baselines\.For a comprehensive comparison, we select various advanced Multimodal Manipulation Detection methods as baselines and compared their performance againstCOREQwen\\text\{CORE\}\_\{\\text\{Qwen\}\}andCOREgemma\\text\{CORE\}\_\{\\text\{gemma\}\}on multiple datasets\. The baseline models include non\-MLLMs: HAMMER\(Shaoet al\.,[2023](https://arxiv.org/html/2606.03066#bib.bib17)\), HAMMER\+\+\(Shaoet al\.,[2024](https://arxiv.org/html/2606.03066#bib.bib26)\), RamDG\(Shenet al\.,[2025](https://arxiv.org/html/2606.03066#bib.bib19)\); as well as MLLMs specialized for detection tasks: FKA\-Owl\(Liuet al\.,[2024](https://arxiv.org/html/2606.03066#bib.bib24)\), AMD\(Zhanget al\.,[2025a](https://arxiv.org/html/2606.03066#bib.bib20)\), Qwen2\.5VL\-3B\(Baiet al\.,[2025](https://arxiv.org/html/2606.03066#bib.bib27)\), and Gemma3\-4B\(Team,[2025b](https://arxiv.org/html/2606.03066#bib.bib28)\)\. In addition, we also introduce general\-purpose MLLMs with larger parameter scales: Qwen3VL\-235B\(Baiet al\.,[2023](https://arxiv.org/html/2606.03066#bib.bib38)\), Gemma3\-27B\(Team,[2025b](https://arxiv.org/html/2606.03066#bib.bib28)\), LLaMA\-3\.2\-Vision\-90B\(Team,[2024b](https://arxiv.org/html/2606.03066#bib.bib29)\), and SeedVL\-1\.5\(Guoet al\.,[2025](https://arxiv.org/html/2606.03066#bib.bib30)\)for zero\-shot\.

![Refer to caption](https://arxiv.org/html/2606.03066v1/x7.png)

\(a\)
\(b\)
\(c\)
\(d\)

Table 4:Ablation and discussion experimentsof \(a\) average zero\-shot performance comparison onDGM4\\text\{DGM\}^\{4\}and MDSM, \(b\) impact of data scale in MBPT stage, \(c\) impact of different loss types in MBPT stage, and \(d\) the impact of data scale in CPT stage\.### 5\.1Performance Comparison

Rapid Adaptation with a Few Samples\.With the conflict\-capturing ability of CORE framework, the model can achieve a rapid adaption to novel manipulations with limited data\. To verify this, we construct training subsets by randomly sampling 100, 200, 500, 750, 1k, 1\.5k, 2k, and 2\.5k samples from each dataset, respectively\. Table[2](https://arxiv.org/html/2606.03066#S4.T2)shows the performance of all methods with training set sizes of 100, 200, 500 and 750\. Figure[5](https://arxiv.org/html/2606.03066#S4.F5)shows the performance trend as the training set size increases \(100\-2\.5k\)\.

As shown in Table[2](https://arxiv.org/html/2606.03066#S4.T2)and Figure[5](https://arxiv.org/html/2606.03066#S4.F5), bothCOREQwen\\text\{CORE\}\_\{\\text\{Qwen\}\}andCOREGemma\\text\{CORE\}\_\{\\text\{Gemma\}\}achieve SOTA performance across multiple datasets when training data is limited, surpassing the second\-best methods by an average of9\.94%and10\.50%, respectively\. This demonstrates that the CORE framework can rapidly adapt to novel image manipulation techniques\. Furthermore, the strong performance of CORE across two different MLLMs validates the method’s generalization ability\.

Zero\-shot Cross\-Manipulation Detection\.To evaluate model’s generalization, we utilize the MDSM and DGM4datasets, as both feature multiple manipulation categories, each with a sufficient number of training samples\. In our experimental design, for each specific \(target\) manipulation type within a dataset, we train a model using 1k randomly sampled instances composed of other manipulation types and authentic news samples\. Subsequently, the model is subjected to direct zero\-shot inference on a test set constructed exclusively from the held\-out target manipulation type and authentic samples\. As shown in TableLABEL:fig:crotype, our methods significantly outperform baselines \(e\.g\.,COREQwen\\text\{CORE\}\_\{\\text\{Qwen\}\}andCOREGemma\\text\{CORE\}\_\{\\text\{Gemma\}\}surpass the second\-best methods on MDSM by over14%and12%, respectively\)\. This superiority validates CORE’s design principle of detecting the inherent conflicts essential to fake news\. Please refer to the Appendix[J](https://arxiv.org/html/2606.03066#A10)for Cross\-Dataset Zero\-Shot\.

Large\-scale Training Data\.To verify CORE’s scalability, we evaluate it on the full MDSM and SAMM datasets, together with 200k randomly sampled NewsClippings examples\.

As shown in Table[3](https://arxiv.org/html/2606.03066#S5.T3), CORE remains highly effective under large\-scale manipulation data of the same type\. Across the three benchmarks,COREQwen\\text\{CORE\}\_\{\\text\{Qwen\}\}andCOREGemma\\text\{CORE\}\_\{\\text\{Gemma\}\}outperform the second\-best method by an average of3\.79%and4\.16%, respectively, demonstrating strong stability and robustness\.

Further Discussion\.Appendix[I](https://arxiv.org/html/2606.03066#A9)reports experiments on time\-sensitive events beyond the model’s pre\-training scope, while Appendix[K](https://arxiv.org/html/2606.03066#A11)compares prompting strategies withCOREtraining\.

Table 5:Loss component ablationin MBPT and CPT stages onCOREQwen\\text\{CORE\}\_\{\\text\{Qwen\}\}\.\(e\)
\(f\)

### 5\.2Ablation Study

To validate the effectiveness of each component within the CORE framework, We designed a series of ablation experiments onCOREQwen\\text\{CORE\}\_\{\\text\{Qwen\}\}:

Data Scale in MBPT\.We evaluated the impact of MBPT data volume \(0–75k\) on random subsets \(500 \(300\) samples\)\. As shown in TableLABEL:tab:mbpt\_data, insufficient data \(0k, 25k\) leads to an under\-trained Cross\-modal Aligner, hindering conflict localization and causing a4\.47%average drop \(0k vs\. 50k\)\. Conversely, the negligible gap between 50k and 75k indicates performance saturation, confirming that 50k samples suffice for robust alignment\.

Loss Discussion in MBPT\.To validate our loss design, we compared it with an MSE\-based alternative where ROIAlign\(Heet al\.,[2017](https://arxiv.org/html/2606.03066#bib.bib47)\)aligns positive visual featuresP​SPSdirectly with text featuresvciv\_\{c\_\{i\}\}\. As shown in TableLABEL:tab:mbpt\_loss, this explicit supervision lags behind contrastive learning\(Heet al\.,[2020](https://arxiv.org/html/2606.03066#bib.bib48)\)by2\.67%\. This decline stems from disrupting the training consistency with CPT and ROIAlign’s limitation in capturing global semantics\. For instance, resolving conflicts \(e\.g\., linking “Ballon d’Or” to a “football field”\) requires global context that ROIAlign’s strict localization severs, thereby degrading performance\.

Data Scale in CPT\.To explore the impact of data scale in CPT, we adjusted the CAC volume from 0k to 38k compared to the standard setting \(14k\)\. The results in TableLABEL:tab:cpt\_datareveal that removing the CPT stage \(0k\) leads to a sharp performance deterioration \(avg\.4\.58%\), underscoring the necessity of CPT\. Increasing the volume to 7k improves accuracy but still lags behind 14k \(avg\.0\.79%gap\), suggesting insufficient data for learning semantic conflicts\. Conversely, expanding to 38k slightly degrades performance \(avg\.0\.43%\), potentially due to overfitting on manipulation artifacts specific to the SAMM dataset\.

Loss Components in MBPT and CPT\.To further disentangle the contribution of each objective, we remove one loss term in MBPT and CPT, as shown in Table[5](https://arxiv.org/html/2606.03066#S5.T5)\. In MBPT, removingℒc​l\\mathcal\{L\}\_\{cl\}causes a larger average drop than removingℒo​2​v​q​a\\mathcal\{L\}\_\{o2vqa\}\(4\.44%vs\.1\.37%\), indicating that contrastive alignment provides the primary signal for bridging textual concepts and visual evidence, while the VQA\-style objective further stabilizes multimodal instruction learning\. In CPT, removingℒc​a​c​l\\mathcal\{L\}\_\{cacl\}leads to the most severe degradation \(avg\.11\.75%\), confirming that explicitly separating conflicting factors is central to conflict perception\. Removingℒc​r\\mathcal\{L\}\_\{cr\}also reduces performance \(avg\.1\.16%\), showing that conflict reasoning supervision helps preserve the model’s ability to express and use the learned conflict cues during prediction\.

## 6Conclusion

This paper introduces CORE, a conflict\-oriented reasoning framework that enhances MLLMs with explicit conflict\-capturing capability for robust and generalizable misinformation detection\. By leveraging the newly constructed CAC and a conflict\-aware training paradigm, CORE effectively conducts a conflict\-perception training and enables rapid adaptation to unseen manipulation types\.

## Impact Statement

Positive Impacts:By focusing on fundamental inconsistencies rather than specific manipulation patterns, CORE improves the robustness of information ecosystems against evolving generative AI\. Its high\-efficiency reasoning reduces the energy\-intensive requirement for frequent full\-parameter retraining, supporting sustainable AI deployment\.

Ethical Considerations & Mitigation:To prevent the potential misuse of our model or the Conflict Attribution Corpus \(CAC\) for generating more deceptive content, we will implement a strict data access protocol\. The CAC will be released exclusively for pure research purposes under a specialized license that prohibits its use in generative tasks\. Furthermore, all data in the corpus are sourced from public domains to safeguard privacy\.

## Acknowledgements

This work was funded by the National Natural Science Foundation of China \(No\. 62572166, 62302140, 62502144, 62502142, 62573399\) and the Natural Science Foundation of Anhui Province \(No\. 2508085QF226\)\. The computation is completed on the HPC Platform of Hefei University of Technology\.

## References

- S\. Abdelnabi, R\. Hasan, and M\. Fritz \(2022\)Open\-domain, content\-based, multi\-modal fact\-checking of out\-of\-context images via online resources\.InCVPR,pp\. 14920–14929\.Cited by:[§1](https://arxiv.org/html/2606.03066#S1.p1.1)\.
- S\. Antol, A\. Agrawal, J\. Lu, M\. Mitchell, D\. Batra, C\. L\. Zitnick, and D\. Parikh \(2015\)VQA: visual question answering\.InICCV,pp\. 2425–2433\.Cited by:[§4\.2](https://arxiv.org/html/2606.03066#S4.SS2.p4.1)\.
- J\. Bai, S\. Bai, S\. Yang, S\. Wang, S\. Tan, P\. Wang, J\. Lin, C\. Zhou, and J\. Zhou \(2023\)Qwen\-vl: A frontier large vision\-language model with versatile abilities\.CoRR\.Cited by:[§4\.1](https://arxiv.org/html/2606.03066#S4.SS1.p4.1),[§5](https://arxiv.org/html/2606.03066#S5.p3.2)\.
- S\. Bai, K\. Chen, X\. Liu, J\. Wang, W\. Ge, S\. Song, K\. Dang, P\. Wang, S\. Wang, J\. Tang, H\. Zhong, Y\. Zhu, M\. Yang, Z\. Li, J\. Wan, P\. Wang, W\. Ding, Z\. Fu, Y\. Xu, J\. Ye, X\. Zhang, T\. Xie, Z\. Cheng, H\. Zhang, Z\. Yang, H\. Xu, and J\. Lin \(2025\)Qwen2\.5\-vl technical report\.CoRR\.Cited by:[§1](https://arxiv.org/html/2606.03066#S1.p4.1),[§5](https://arxiv.org/html/2606.03066#S5.p1.1),[§5](https://arxiv.org/html/2606.03066#S5.p3.2)\.
- Y\. Bei, H\. Lou, J\. Geng, E\. Liu, L\. Cheng, J\. Song, M\. Song, and Z\. Feng \(2024\)A large\-scale universal evaluation benchmark for face forgery detection\.CoRRabs/2406\.09181\.External Links:[Link](https://doi.org/10.48550/arXiv.2406.09181),[Document](https://dx.doi.org/10.48550/ARXIV.2406.09181),2406\.09181Cited by:[§1](https://arxiv.org/html/2606.03066#S1.p2.1)\.
- T\. B\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell, S\. Agarwal, A\. Herbert\-Voss, G\. Krueger, T\. Henighan, R\. Child, A\. Ramesh, D\. M\. Ziegler, J\. Wu, C\. Winter, C\. Hesse, M\. Chen, E\. Sigler, M\. Litwin, S\. Gray, B\. Chess, J\. Clark, C\. Berner, S\. McCandlish, A\. Radford, I\. Sutskever, and D\. Amodei \(2020\)Language models are few\-shot learners\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2606.03066#S1.p2.1)\.
- R\. Chen, X\. Chen, B\. Ni, and Y\. Ge \(2020\)SimSwap: an efficient framework for high fidelity face swapping\.InACM MM,pp\. 2003–2011\.Cited by:[§1](https://arxiv.org/html/2606.03066#S1.p2.1)\.
- G\. Gao, H\. Huang, C\. Fu, Z\. Li, and R\. He \(2021\)Information bottleneck disentanglement for identity swapping\.InCVPR,pp\. 3404–3413\.Cited by:[§1](https://arxiv.org/html/2606.03066#S1.p2.1)\.
- Google \(2025\)Google search API\.External Links:[Link](https://developers.google.com/custom-search/v1/overview)Cited by:[§4\.1](https://arxiv.org/html/2606.03066#S4.SS1.p3.1)\.
- D\. Guo, F\. Wu, F\. Zhu, F\. Leng, G\. Shi, H\. Chen, H\. Fan, J\. Wang, J\. Jiang, J\. Wang, J\. Chen, J\. Huang, K\. Lei, L\. Yuan, L\. Luo, P\. Liu, Q\. Ye, R\. Qian, S\. Yan, S\. Zhao, S\. Peng, S\. Li, S\. Yuan, S\. Wu, T\. Cheng, W\. Liu, W\. Wang, X\. Zeng, X\. Liu, X\. Qin, X\. Ding, X\. Xiao, X\. Zhang, X\. Zhang, X\. Xiong, Y\. Peng, Y\. Chen, Y\. Li, Y\. Hu, Y\. Lin, Y\. Hu, Y\. Zhang, Y\. Wu, Y\. Li, Y\. Liu, Y\. Ling, Y\. Qin, Z\. Wang, Z\. He, A\. Zhang, B\. Yi, B\. Liao, C\. Huang, C\. Zhang, C\. Deng, C\. Deng, C\. Lin, C\. Yuan, C\. Li, C\. Gou, C\. Lou, C\. Wei, C\. Liu, C\. Li, D\. Zhu, D\. Zhong, F\. Li, F\. Zhang, G\. Wu, G\. Li, G\. Xiao, H\. Lin, H\. Yang, H\. Wang, H\. Ji, H\. Hao, H\. Shen, H\. Li, J\. Li, J\. Wu, J\. Zhu, J\. Jiao, J\. Feng, J\. Chen, J\. Duan, J\. Liu, J\. Zeng, J\. Tang, J\. Sun, J\. Chen, J\. Long, J\. Feng, J\. Zhan, J\. Fang, J\. Lu, K\. Hua, K\. Liu, K\. Shen, K\. Zhang, and K\. Shen \(2025\)Seed1\.5\-vl technical report\.CoRR\.Cited by:[§1](https://arxiv.org/html/2606.03066#S1.p4.1),[§5](https://arxiv.org/html/2606.03066#S5.p3.2)\.
- K\. Haydarov, A\. Muhamed, X\. Shen, J\. Lazarevic, I\. Skorokhodov, C\. J\. Galappaththige, and M\. Elhoseiny \(2024a\)Adversarial text to continuous image generation\.InCVPR,pp\. 6316–6326\.Cited by:[§1](https://arxiv.org/html/2606.03066#S1.p1.1)\.
- K\. Haydarov, A\. Muhamed, X\. Shen, J\. Lazarevic, I\. Skorokhodov, C\. J\. Galappaththige, and M\. Elhoseiny \(2024b\)Adversarial text to continuous image generation\.InCVPR,pp\. 6316–6326\.Cited by:[§1](https://arxiv.org/html/2606.03066#S1.p1.1)\.
- K\. He, H\. Fan, Y\. Wu, S\. Xie, and R\. B\. Girshick \(2020\)Momentum contrast for unsupervised visual representation learning\.InCVPR,pp\. 9726–9735\.Cited by:[§5\.2](https://arxiv.org/html/2606.03066#S5.SS2.p3.2)\.
- K\. He, G\. Gkioxari, P\. Dollár, and R\. B\. Girshick \(2017\)Mask R\-CNN\.InICCV,Cited by:[§5\.2](https://arxiv.org/html/2606.03066#S5.SS2.p3.2)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InICLR,Cited by:[§5](https://arxiv.org/html/2606.03066#S5.p1.1)\.
- L\. Jiang, R\. Li, W\. Wu, C\. Qian, and C\. C\. Loy \(2020a\)DeeperForensics\-1\.0: A large\-scale dataset for real\-world face forgery detection\.InCVPR,pp\. 2886–2895\.Cited by:[§1](https://arxiv.org/html/2606.03066#S1.p1.1)\.
- L\. Jiang, R\. Li, W\. Wu, C\. Qian, and C\. C\. Loy \(2020b\)DeeperForensics\-1\.0: a large\-scale dataset for real\-world face forgery detection\.InCVPR,pp\. 2889–2898\.Cited by:[§1](https://arxiv.org/html/2606.03066#S1.p1.1)\.
- J\. Li, R\. R\. Selvaraju, A\. Gotmare, S\. R\. Joty, C\. Xiong, and S\. C\. Hoi \(2021\)Align before fuse: vision and language representation learning with momentum distillation\.InNeurIPS,pp\. 9694–9705\.Cited by:[§3](https://arxiv.org/html/2606.03066#S3.p2.1)\.
- Y\. Li, X\. Yang, P\. Sun, H\. Qi, and S\. Lyu \(2020a\)Celeb\-df: a large\-scale challenging dataset for deepfake forensics\.InCVPR,pp\. 3207–3216\.Cited by:[§1](https://arxiv.org/html/2606.03066#S1.p1.1)\.
- Y\. Li, X\. Yang, P\. Sun, H\. Qi, and S\. Lyu \(2020b\)Celeb\-df: A large\-scale challenging dataset for deepfake forensics\.InCVPR,pp\. 3204–3213\.Cited by:[§1](https://arxiv.org/html/2606.03066#S1.p1.1)\.
- Z\. Li, R\. Tucker, N\. Snavely, and A\. Holynski \(2024\)Generative image dynamics\.InCVPR,pp\. 24142–24153\.Cited by:[§1](https://arxiv.org/html/2606.03066#S1.p1.1)\.
- J\. Lian, L\. Liu, Y\. Wang, Y\. Wu, L\. Wu, L\. Zhu, and Z\. Zheng \(2026\)Generating attribution reports for manipulated facial images: a dataset and baseline\.InACL,Cited by:[§1](https://arxiv.org/html/2606.03066#S1.p2.1)\.
- X\. Liu, P\. Li, H\. Huang, Z\. Li, X\. Cui, J\. Liang, L\. Qin, W\. Deng, and Z\. He \(2024\)FKA\-owl: advancing multimodal fake news detection through knowledge\-augmented lvlms\.InACM MM,pp\. 10154–10163\.Cited by:[§1](https://arxiv.org/html/2606.03066#S1.p2.1),[§2](https://arxiv.org/html/2606.03066#S2.p1.1),[§5](https://arxiv.org/html/2606.03066#S5.p3.2)\.
- X\. Liu, Z\. Li, P\. Li, H\. Huang, S\. Xia, X\. Cui, L\. Huang, W\. Deng, and Z\. He \(2025\)MMFakeBench: A mixed\-source multimodal misinformation detection benchmark for lvlms\.InICLR,Cited by:[§2](https://arxiv.org/html/2606.03066#S2.p1.1),[§5](https://arxiv.org/html/2606.03066#S5.p2.1)\.
- Z\. Lu, D\. Huang, L\. Bai, J\. Qu, C\. Wu, X\. Liu, and W\. Ouyang \(2023\)Seeing is not always believing: benchmarking human and model perception of ai\-generated images\.InNeurIPS,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),Cited by:[§1](https://arxiv.org/html/2606.03066#S1.p1.1)\.
- G\. Luo, T\. Darrell, and A\. Rohrbach \(2021\)NewsCLIPpings: automatic generation of out\-of\-context multimodal media\.InEMNLP,pp\. 6801–6817\.Cited by:[§5](https://arxiv.org/html/2606.03066#S5.p2.1)\.
- A\. Madaan, S\. Zhou, U\. Alon, Y\. Yang, and G\. Neubig \(2022\)Language models of code are few\-shot commonsense learners\.InEMNLP,pp\. 1384–1403\.Cited by:[§1](https://arxiv.org/html/2606.03066#S1.p2.1)\.
- O\. Patashnik, Z\. Wu, E\. Shechtman, D\. Cohen\-Or, and D\. Lischinski \(2021\)StyleCLIP: text\-driven manipulation of stylegan imagery\.InICCV,pp\. 2085–2094\.Cited by:[§1](https://arxiv.org/html/2606.03066#S1.p2.1)\.
- P\. Qi, Z\. Yan, W\. Hsu, and M\. Lee \(2024\)Sniffer: multimodal large language model for explainable out\-of\-context misinformation detection\.InCVPR,pp\. 13052–13062\.Cited by:[§2](https://arxiv.org/html/2606.03066#S2.p1.1)\.
- A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark, G\. Krueger, and I\. Sutskever \(2021\)Learning transferable visual models from natural language supervision\.InICML,pp\. 8748–8763\.Cited by:[§3](https://arxiv.org/html/2606.03066#S3.p2.1)\.
- R\. Shao, T\. Wu, and Z\. Liu \(2022\)Detecting and recovering sequential deepfake manipulation\.InECCV,pp\. 712–728\.Cited by:[§1](https://arxiv.org/html/2606.03066#S1.p1.1)\.
- R\. Shao, T\. Wu, and Z\. Liu \(2023\)Detecting and grounding multi\-modal media manipulation\.InCVPR,pp\. 6904–6913\.Cited by:[§1](https://arxiv.org/html/2606.03066#S1.p2.1),[§2](https://arxiv.org/html/2606.03066#S2.p1.1),[§5](https://arxiv.org/html/2606.03066#S5.p2.1),[§5](https://arxiv.org/html/2606.03066#S5.p3.2)\.
- R\. Shao, T\. Wu, J\. Wu, L\. Nie, and Z\. Liu \(2024\)Detecting and grounding multi\-modal media manipulation and beyond\.IEEE TPAMI46\(8\),pp\. 5556–5574\.Cited by:[§1](https://arxiv.org/html/2606.03066#S1.p2.1),[§5](https://arxiv.org/html/2606.03066#S5.p3.2)\.
- J\. Shen, Y\. Wang, N\. Pu, L\. Cheng, and Z\. Zhong \(2025\)Beyond artificial misalignment: detecting and grounding semantic\-coordinated multimodal manipulations\.InACM MM,Cited by:[§1](https://arxiv.org/html/2606.03066#S1.p2.1),[§2](https://arxiv.org/html/2606.03066#S2.p1.1),[§4\.1](https://arxiv.org/html/2606.03066#S4.SS1.p2.1),[§5](https://arxiv.org/html/2606.03066#S5.p3.2)\.
- G\. Team \(2025a\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.CoRR\.Cited by:[§4\.1](https://arxiv.org/html/2606.03066#S4.SS1.p4.1)\.
- G\. Team \(2025b\)Gemma 3 technical report\.CoRR\.Cited by:[§1](https://arxiv.org/html/2606.03066#S1.p4.1),[§5](https://arxiv.org/html/2606.03066#S5.p1.1),[§5](https://arxiv.org/html/2606.03066#S5.p3.2)\.
- G\. Team \(2024a\)GPT\-4o system card\.CoRR\.Cited by:[§4\.1](https://arxiv.org/html/2606.03066#S4.SS1.p4.1)\.
- L\. Team \(2024b\)The llama 3 herd of models\.CoRR\.Cited by:[§1](https://arxiv.org/html/2606.03066#S1.p4.1),[§5](https://arxiv.org/html/2606.03066#S5.p3.2)\.
- L\. Van der Maaten and G\. Hinton \(2008\)Visualizing data using t\-sne\.Journal of Machine Learning Research9\(11\),pp\. 2579–2605\.Cited by:[§3](https://arxiv.org/html/2606.03066#S3.p3.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, L\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.InNeurIPS,pp\. 5998–6008\.Cited by:[§4\.2](https://arxiv.org/html/2606.03066#S4.SS2.p2.13)\.
- T\. Wang, Y\. Zhang, Y\. Fan, J\. Wang, and Q\. Chen \(2022a\)High\-fidelity gan inversion for image attribute editing\.InCVPR,pp\. 11379–11388\.Cited by:[§1](https://arxiv.org/html/2606.03066#S1.p2.1)\.
- Z\. Wang, M\. Li, R\. Xu, L\. Zhou, J\. Lei, X\. Lin, S\. Wang, Z\. Yang, C\. Zhu, D\. Hoiem, S\. Chang, M\. Bansal, and H\. Ji \(2022b\)Language models with image descriptors are strong few\-shot video\-language learners\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2606.03066#S1.p2.1)\.
- C\. Xie, B\. Wang, F\. Kong, J\. Li, D\. Liang, G\. Zhang, D\. Leng, and Y\. Yin \(2025\)FG\-CLIP: fine\-grained visual and textual alignment\.InICML,Cited by:[§4\.2](https://arxiv.org/html/2606.03066#S4.SS2.p2.13)\.
- F\. Yu, J\. Gu, Z\. Li, J\. Hu, X\. Kong, X\. Wang, J\. He, Y\. Qiao, and C\. Dong \(2024\)Scaling up to excellence: practicing model scaling for photo\-realistic image restoration in the wild\.InCVPR,pp\. 25669–25680\.Cited by:[§1](https://arxiv.org/html/2606.03066#S1.p1.1)\.
- X\. Zhai, B\. Mustafa, A\. Kolesnikov, and L\. Beyer \(2023\)Sigmoid loss for language image pre\-training\.InICCV,pp\. 11941–11952\.Cited by:[§4\.2](https://arxiv.org/html/2606.03066#S4.SS2.p3.2)\.
- Y\. Zhang, Y\. Wang, K\. Han, Y\. Wu, L\. Wu, L\. Zhu, and Z\. Zheng \(2026\)Cultivating forensic reasoning for generalizable multimodal manipulation detection\.InACL,Cited by:[§1](https://arxiv.org/html/2606.03066#S1.p2.1)\.
- Y\. Zhang, Y\. Wang, Y\. Wu, L\. Wu, and L\. Zhu \(2025a\)The coherence trap: when mllm\-crafted narratives exploit manipulated visual contexts\.CoRR\.Cited by:[§1](https://arxiv.org/html/2606.03066#S1.p2.1),[§2](https://arxiv.org/html/2606.03066#S2.p1.1),[§5](https://arxiv.org/html/2606.03066#S5.p2.1),[§5](https://arxiv.org/html/2606.03066#S5.p3.2)\.
- Z\. Zhang, Y\. Wang, L\. Cheng, Z\. Zhong, D\. Guo, and M\. Wang \(2025b\)ASAP: advancing semantic alignment promotes multi\-modal manipulation detecting and grounding\.InCVPR,pp\. 4005–4014\.Cited by:[§2](https://arxiv.org/html/2606.03066#S2.p1.1)\.
- Z\. Zhang, Y\. Wang, L\. Cheng, Z\. Zhong, D\. Guo, and M\. Wang \(2025c\)ASAP: advancing semantic alignment promotes multi\-modal manipulation detecting and grounding\.InCVPR,pp\. 4005–4014\.Cited by:[§1](https://arxiv.org/html/2606.03066#S1.p2.1)\.

## Appendix AImplementation Details

During MBPT, we optimize the Modality Connector and the LLM with a uniform learning rate of1×10−41\\times 10^\{\-4\}for 1 epoch, while the Vision Encoder is frozen\. In CPT, we continue to optimize the Modality Connector and the LLM for 3 epochs, while the Vision Encoder remains frozen\. In RA, we optimize only the LLM for 3 to 8 epochs, depending on the scale of the training data\. All experiments are conducted on devices equipped with 4 NVIDIA H200 GPUs\.

## Appendix BBenchmark Brief Introduction

To ensure a rigorous evaluation of generalization, we select multiple representative benchmarks that cover a wide spectrum of manipulation types\. The defining features of these datasets are outlined below:

NewsCLIPpingsfocuses on the out\-of\-context \(OOC\) threat scenario: it provides a dataset where both the image and text are individually unmanipulated, but are automatically mismatched to create semantic or entity inconsistencies\.

DGM4employs a random manipulation pipeline: it randomly manipulates a part of the news’s image or caption \(e\.g\., randomly replacing some words in the caption with other words\)\.

MMFakeBenchintroduces a comprehensive benchmark for mixed\-source MMD: it includes 3 critical sources \(textual veracity distortion, visual veracity distortion, and cross\-modal consistency distortion\) along with 12 sub\-categories of misinformation forgery types\.

SAMMpioneers the detection of semantically\-coordinated manipulations: it first applies SOTA image manipulations, and then generates contextually\-plausible, semantically consistent textual narratives designed to reinforce the visual deception\.

MDSMutilizes an adversarial pipeline that leverages MLLMs to simulate high\-risk disinformation: it first alters images using SOTA editing techniques, and then pairs them with MLLM\-generated deceptive texts that maintain semantic consistency with the visual manipulations\.

## Appendix CConcepts in Section 3

![Refer to caption](https://arxiv.org/html/2606.03066v1/x8.png)Figure 6:Examples of UK Prime Ministers\.![Refer to caption](https://arxiv.org/html/2606.03066v1/x9.png)Figure 7:Examples of US President\.![Refer to caption](https://arxiv.org/html/2606.03066v1/x10.png)Figure 8:Examples of Football Award\.Figure[6](https://arxiv.org/html/2606.03066#S5.F6)\-[8](https://arxiv.org/html/2606.03066#A3.F8)illustrates some of the textual concepts and images used in our experiments in Section 3\.

## Appendix DExamples of CAC

![Refer to caption](https://arxiv.org/html/2606.03066v1/x11.png)Figure 9:Examples of CAC
## Appendix EExamples of FineHARD

![Refer to caption](https://arxiv.org/html/2606.03066v1/x12.png)Figure 10:Examples of FineHARD\.
## Appendix FFeature Extraction Details in Section 3

For visual features, the image is passed through the vision encoderℰV\\mathcal\{E\}\_\{V\}and the modality connector𝒫\\mathcal\{P\}to obtain a sequence of visual feature embeddings\. For textual features, the text concept is fed into the LLM, and we extract the corresponding sequence of hidden states from its final layer\. Subsequently, we apply an averaging operation to both the visual and textual feature sequences, transforming each into a tensor of shape\[1,hidden size\]\[1,\\text\{hidden size\}\]\.

## Appendix GConstruction of World Knowledge Evaluation Benchmark

To assess whether models possess the necessary background knowledge for fake news detection, we constructed the World Knowledge Evaluation Benchmark\. This benchmark comprises 200 multiple\-choice questions meticulously designed to cover a diverse range of domains frequently targeted by misinformation, including current events \(e\.g\., political leaders, cultural awards\), social movements, geography, history, and science\.

A key aspect of our construction process was the creation of plausible, semantically\-related distractors\. Unlike standard QA datasets where incorrect answers are often random, our distractors share the same semantic category as the ground truth\. For example, distractors for the question ”Which film won the Academy Award for Best Picture in 2024?” include other highly\-nominated films from the same ceremony \(e\.g\., ”Barbie,” ”Poor Things”\)\. Similarly, distractors for the 2024 German Chancellor include the former chancellor and other contemporary European leaders\. This design ensures the benchmark evaluates precise knowledge rather than coarse categorical association\. Figure[11](https://arxiv.org/html/2606.03066#A7.F11)illustrates several examples from this benchmark\.

![Refer to caption](https://arxiv.org/html/2606.03066v1/x13.png)Figure 11:Examples of the World Knowledge Evaluation Benchmark, highlighting the use of semantically plausible distractors\.To address the need for transparency regarding data provenance and curation, we provide the specific construction details below, focusing on data sources, collection pipelines, and domain\-balancing criteria\.

### G\.1Data Sources

To ensure factual accuracy and relevance, we curated data from two primary streams:

- •Authoritative Knowledge Bases:Static facts \(e\.g\., geography, history, scientific definitions\) were sourced from high\-reliability encyclopedic sources, primarily Wikipedia and Britannica, ensuring that the ground truth is indisputable\.
- •Verified News Outlets:Dynamic facts involving current events \(e\.g\., 2024 political leaders, recent cultural awards\) were cross\-referenced against major reputable news agencies \(e\.g\., BBC, Reuters, AP\)\.

### G\.2Collection and Generation Pipeline

The question collection process followed a ”Human\-Guided, AI\-Assisted” paradigm involving three stages:

1. 1\.Entity and Topic Extraction:We first identified high\-frequency entities and topics appearing in the news, such as political figures \(e\.g\., US Presidents, European leaders\), celebrities, and major global organizations \(e\.g\., WHO, UN\)\. This ensures the benchmark evaluates knowledge directly relevant to the downstream detection task\.
2. 2\.Distractor\-Aware Question Generation:As noted in the introduction, we enforced the creation ofsemantically plausible distractorsto test specific factual knowledge rather than simple elimination capabilities\.
3. 3\.Adversarial Filtering:We employed GPT\-4o to review the questions\. Any question that could be answered solely through linguistic bias or simple elimination without specific knowledge was discarded or rewritten\.

### G\.3Selection and Domain\-Balancing Criteria

To prevent domain bias, we established a strict taxonomy covering five distinct categories\. The 200 questions were balanced to ensure broad coverage of the ”world knowledge” typically exploited in multimodal misinformation\. The domain distribution and selection criteria are defined in Table[6](https://arxiv.org/html/2606.03066#A7.T6)\.

Table 6:Domain Distribution and Selection Criteria for the World Knowledge Benchmark\.#### Final Human Verification

All questions underwent a final manual verification round by the authors to confirm that: \(1\) the ground truth is unambiguous, and \(2\) the distractors are factually incorrect but contextually relevant\.

## Appendix HPrompt Templates and Validation Protocols for CAC Construction

To ensure reproducibility and transparency regarding the construction of the Conflict Attribution Corpus \(CAC\), we provide the exact prompts used at each stage of the pipeline, along with the specific validation criteria employed by the MLLM expert pool and human annotators\. As detailed in Section 4\.1, our pipeline utilizes an expert poolℳ=\{GPT\-4o,Gemini\-2\.5\-Pro,Qwen3\-VL\-Plus\}\\mathcal\{M\}=\\\{\\text\{GPT\-4o\},\\text\{Gemini\-2\.5\-Pro\},\\text\{Qwen3\-VL\-Plus\}\\\}\.

### H\.1Background Knowledge Collection Prompts

As described in the implementation details, this stage leverages an MLLM to bridge the gap between the raw news sample and external knowledge\. The process follows a specific pipeline: analyzing the image and caption to extract key semantic information \(e\.g\., time, event, celebrities\), combining this into search queries, conducting separate web and image searches via the Google Search API, and finally validating the relevance of retrieved materials\.

Table 7:Prompts for Background Knowledge Collection \(Google Search API Stage\)\.
### H\.2Conflict Rationale Generation Prompts

The core logic of CORE relies on identifyingwhya sample is fake\. We task a randomly selected MLLM fromℳ\\mathcal\{M\}to generate this rationale, which is then cross\-validated by the other experts in the pool\.

Table 8:Prompts for Conflict Rationale Generation and Validation\.
### H\.3Conflict Structuring Prompts

This stage distills the natural language rationale into the structured format<C1,C2,S1,S2\><C\_\{1\},C\_\{2\},S\_\{1\},S\_\{2\}\>required for the CPT stage\.

Table 9:Prompts for Conflict Structuring and Validation\.
### H\.4Final Human Verification Protocol

While the automated pipeline ensures scalability, we introduced a rigorous human\-in\-the\-loop verification step to ensure the quality of the CAC dataset\. We employed 5 distinct annotators to validate the dataset\.

Sampling Strategy:We randomly select 1k samples, ensuring balanced coverage across the various conflict source distributions\.

Validation Rubric:Annotators were instructed to reject or correct samples based on the following criteria:

1. 1\.Conflict Existence:Does a logical contradiction actually exist betweenC1C\_\{1\}andC2C\_\{2\}?
2. 2\.Source Accuracy:AreS1S\_\{1\}andS2S\_\{2\}correctly attributed? \(e\.g\., ifS1S\_\{1\}is labeled ‘World Knowledge’, does it rely on external facts rather than visual cues?\)
3. 3\.Granularity Check:Are the conflict factors fine\-grained concepts \(e\.g\., “red tie” vs “blue tie”\) rather than abstract descriptions \(e\.g\., “fake image” vs “real text”\)?

Outcome:Ultimately, 993 samples passed human verification \(a pass rate of 99\.3%\), indicating high reliability in the automated generation process\.

## Appendix IEvaluation on Time\-Sensitive Events

To address the concern regarding the model’s applicability when handlingtime\-sensitive events that fall outside the model’s pre\-trained knowledge scope, we conducted an additional evaluation focusing on emerging misinformation\.

Dataset Construction\.We collected a distinct dataset consisting of 100 high\-risk, time\-sensitive fake news samples from social media platforms\. To rigorously test the ”out\-of\-scope” condition, the majority of these events occurred in 2025, ensuring they post\-date the training cut\-off of the foundation models and our training corpus\. These samples simulate real\-world ”zero\-day” misinformation scenarios where specific world knowledge is absent from the model’s parametric memory\. Figure[12](https://arxiv.org/html/2606.03066#A9.F12)visualizes three examples from this collected dataset\.

Baselines\.We benchmarkedCOREQwen\\text\{CORE\}\_\{\\text\{Qwen\}\}against three representative state\-of\-the\-art methods: HAMMER, AMD and FKA\-Owl\.

Results and Analysis\.The quantitative results are presented in Table[10](https://arxiv.org/html/2606.03066#A9.T10)\. While baseline methods struggle significantly due to their reliance on specific patterns or outdated knowledge bases \(ranging from 44% to 52% accuracy\), CORE achieves an accuracy of 74%\.

This superior performance indicates that whileCOREQwen\\text\{CORE\}\_\{\\text\{Qwen\}\}may lack specific knowledge of the exact 2025 event \(e\.g\., the specific outcome of a new election\), itsConflict\-Oriented Reasoningparadigm allows it to identify falsified news by detecting: \(1\)Intrinsic Logic Violations:Contradictions within the text or between visual elements that violate general physical or logical rules \(which remain constant regardless of the year\)\. \(2\)Cross\-Modal Inconsistencies:Discrepancies between the provided image and the textual claim that do not require specific entity knowledge to detect \(e\.g\., emotional mismatch, scene inconsistency\)\.

Table 10:Performance comparison on Time\-Sensitive Fake News\. This dataset comprises events falling outside the pre\-trained knowledge scope of the models\.![Refer to caption](https://arxiv.org/html/2606.03066v1/x14.png)Figure 12:Examples of time\-sensitive fake news samples \(2025\) used in our evaluation\. Despite lacking specific pre\-trained knowledge of these recent events, CORE successfully identifies the misinformation by detecting intrinsic logical conflicts and cross\-modal inconsistencies, whereas baseline models often fail\.
## Appendix JCross\-Dataset Zero\-Shot Generalization

We conduct a rigorousCross\-Dataset Generalizationexperiment\. This setting is designed to evaluate the model’s performance in a true open\-world scenario, where the testing data originates from a completely different source distribution with potentially unseen manipulation pipelines compared to the training data\.

Experimental Setup\.We use four benchmarks: NewsCLIPpings, DGM4, MMFakeBench, and MDSM\. Specifically, to evaluate performance on a target dataset \(e\.g\., NewsCLIPpings\), we trained the models on a subset composed of the remaining three datasets \(e\.g\., DGM4, MMFakeBench, and MDSM\)\. To simulate a low\-resource adaptation scenario effectively, we randomly sampled a total of only 3,000 samples from the union of the three training datasets\. The models were then evaluated directly on the full test set of the held\-out target dataset in a zero\-shot manner\.

Results and Analysis\.The quantitative results are reported in Table[11](https://arxiv.org/html/2606.03066#A10.T11)\. As observed, existing methods struggle significantly in this cross\-dataset setting\. For instance, AMD, which relies heavily on specific manipulation traces and prior information, drops to near\-random performance \(40\.2% \- 46\.1%\) when the testing distribution shifts\. Similarly, FKA\-Owl and HAMMER exhibit limited robustness, with accuracies hovering between 43% and 54%\. This suggests that these baselines tend to overfit to the specific data distributions or manipulation artifacts present in their training domains, failing to generalize to the semantic or physical inconsistencies inherent in unseen datasets\.

In contrast,COREQwen\\text\{CORE\}\_\{\\text\{Qwen\}\}demonstrates superior generalization capabilities, consistently outperforming all baselines across all four held\-out datasets\. Specifically,COREQwen\\text\{CORE\}\_\{\\text\{Qwen\}\}achieves an accuracy of 60\.3% on NewsCLIPpings and 63\.4% on MDSM, surpassing the best\-performing baseline by margins of10\.0%and16\.9%, respectively\. On average, our method improves over the second\-best approach by approximately11\.4%\. This significant improvement validates that the CORE framework, by focusing on the fundamental ”conflict” logic rather than dataset\-specific artifacts, successfully equips MLLMs with a more abstract and robust reasoning capability suitable for open\-world multimodal misinformation detection\.

Table 11:Cross\-Dataset Zero\-Shot Performance \(ACC %\)\. Models are trained on a mixed subset \(3k samples\) of three datasets and evaluated zero\-shot on the held\-out fourth dataset\.
## Appendix KPrompting vs\. Training

A natural question arises regarding the contribution of the proposed training framework:Can MLLMs achieve similar conflict detection capabilities by simply using a similar prompting strategy without the specialized training?

To answer this, we conducted an experiment where we directly prompted the MLLMs\. Specifically, we instructed the models to first generate the conflict factors \(C1C\_\{1\}andC2C\_\{2\}\) based on the input image and text, and subsequently use these factors to deduce whether the news is real or fake\. We applied this prompting strategy to the backbone model Qwen2\.5\-VL\-3B, as well as to two significantly larger and more powerful state\-of\-the\-art MLLMs: Llama\-3\.2\-Vision\-90B and Seed\-1\.6\.

We compared these ”Prompt\-only” baselines against ourCOREQwen\\text\{CORE\}\_\{\\text\{Qwen\}\}fine\-tuned on 100 samples \(Rapid Adaptation\)\. The results are reported in Table[12](https://arxiv.org/html/2606.03066#A11.T12)\.

Table 12:Performance comparison between direct prompting and CORE training \(100 samples\)\.Prompt\-onlydenotes using the model directly with conflict\-oriented instructions without MBPT/CPT training\.COREQwen\\text\{CORE\}\_\{\\text\{Qwen\}\}trains on 100 samples\.Analysis\.As shown in Table[12](https://arxiv.org/html/2606.03066#A11.T12), directly prompting the models yields suboptimal results compared to the CORE framework, even when using significantly larger models\. The prompt\-only Qwen2\.5\-VL\-3B achieves only 45\.3% on DGM4and 46\.2% on MDSM\. In contrast, by training on merely 100 samples,COREQwen\\text\{CORE\}\_\{\\text\{Qwen\}\}boosts performance to 59\.7% \(\+14\.4%\) and 69\.0% \(\+22\.8%\) respectively\. This indicates that without the explicit MBPT and CPT, the model struggles to accurately ground visual concepts and identify subtle contradictions, often hallucinating conflicts or failing to align the visual and textual modalities effectively\.

In conclusion, simply instructing an MLLM to ”find conflicts” is insufficient\. The CORE framework is essential to endow the model with the actual capability to perceive and reason about these conflicts, achieving superior generalization with minimal data\.

## Appendix LData Leakage and Overlap Analysis

To ensure that the reported performance gains reflect genuine generalization rather than dataset bias or memorization, we explicitly clarify the relationship between our training sources \(FineHARD and SAMM\) and the evaluation benchmarks \(NewsCLIPpings, DGM4, MMFakeBench, and MDSM\)\. We conduct both a source\-based qualitative analysis and a rigorous CLIP\-based empirical verification to rule out data leakage\.

### L\.1Source and Distribution Analysis

1\. Pre\-training Data \(FineHARD vs\. Benchmarks\):The FineHARD dataset, used for our pre\-training, is constructed based on LAION\-2B\. This dataset predominantly consists of general\-domain natural images and web\-crawled captions\. In contrast, the evaluation benchmarks \(e\.g\., NewsCLIPpings and DGM4\) are derived primarily from the VisualNews dataset, which focuses strictly on news events and journalistic imagery\. There is a fundamental domain gap between the general ”in\-the\-wild” distribution of LAION\-2B and the specific ”news\-caption” distribution of the benchmarks, minimizing the likelihood of direct overlap\.

2\. CAC \(SAMM vs\. Benchmarks\):The SAMM dataset, utilized for the Conflict\-Aware Contrastive \(CAC\) learning, employs a distinct generation pipeline for creating manipulations\. The benchmarks \(e\.g\., DGM4and NewsCLIPpings\) rely on manipulation techniques that differ significantly from the patterns in SAMM \(See Appendix[B](https://arxiv.org/html/2606.03066#A2)\)\. Consequently, there is no overlap in the manipulation logic or the specific samples used\.

### L\.2Empirical Verification via CLIP Similarity

To further quantitatively verify the absence of overlap, we conducted a comprehensive similarity search across the datasets\.

Methodology\.We employed a pre\-trained CLIP model to extract features from:

- •The specific subsets of the FineHARD \(Used in MBPT\) and SAMM \(CAC\) datasets that are actually utilized for training\.
- •The full test sets of all four benchmarks: NewsCLIPpings, DGM4, MMFakeBench, and MDSM\.

For every pair of samples\(Strain,Stest\)\(S\_\{\\text\{train\}\},S\_\{\\text\{test\}\}\)—whereStrainS\_\{\\text\{train\}\}is from the training sources andStestS\_\{\\text\{test\}\}is from the benchmarks—we calculated a composite similarity scoreScoresim\\text\{Score\}\_\{\\text\{sim\}\}\. This score is defined as the sum of four cross\-modal and uni\-modal cosine similarities:

Scoresim=Sim​\(Itrain,Itest\)\+Sim​\(Ttrain,Ttest\)\+Sim​\(Itrain,Ttest\)\+Sim​\(Ttrain,Itest\)\\text\{Score\}\_\{\\text\{sim\}\}=\\text\{Sim\}\(I\_\{\\text\{train\}\},I\_\{\\text\{test\}\}\)\+\\text\{Sim\}\(T\_\{\\text\{train\}\},T\_\{\\text\{test\}\}\)\+\\text\{Sim\}\(I\_\{\\text\{train\}\},T\_\{\\text\{test\}\}\)\+\\text\{Sim\}\(T\_\{\\text\{train\}\},I\_\{\\text\{test\}\}\)\(7\)whereIIandTTrepresent the image and text embeddings, respectively\.

Results\.We identified and retrieved the top\-200 pairs with the highestScoresim\\text\{Score\}\_\{\\text\{sim\}\}from the millions of potential combinations\. A manual inspection of these top\-200 pairs was conducted\. The inspection revealed that even among the pairs with the highest similarity scores, there were no identical images or captions, and no content duplication was observed\. The matches were primarily based on broad semantic similarities \(e\.g\., two different images containing a ”dog” or a ”politician”\) rather than data leakage\.

Conclusion\.Both the source provenance analysis and the empirical feature matching confirm that there is no overlap between our training data \(FineHARD, SAMM\) and the evaluation benchmarks\. The performance improvements reported in this paper are therefore attributed to the model’s robust reasoning capabilities rather than memorization of testing samples\.

Similar Articles

Modality-Decoupled Online Recursive Editing

arXiv cs.LG

Proposes M-ORE, a modality-decoupled online recursive editor for lifelong adaptation of multimodal large language models, addressing cross-modal conflict and inter-edit interference with constant per-edit overhead.

Reinforcing Multimodal Reasoning Against Visual Degradation

Hugging Face Daily Papers

This paper introduces ROMA, an RL fine-tuning framework that enhances the robustness of multimodal large language models against visual degradations like blur and compression artifacts. It achieves this through a dual-forward-pass strategy and specialized regularization techniques, improving performance on reasoning benchmarks without sacrificing accuracy on clean inputs.

CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

Hugging Face Daily Papers

Contrastive Reflection (CORE) is a non-parametric algorithm that generates concise, interpretable insights from comparing successful and unsuccessful reasoning traces, enabling faster and more efficient self-improvement for language models with fewer samples and rollouts than existing methods.