ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection
Summary
ThinkDeception proposes a novel framework that leverages multimodal large language models and a progressive reinforcement learning strategy with chain-of-thought reasoning for interpretable deception detection, achieving new state-of-the-art results on standard benchmarks.
View Cached Full Text
Cached at: 06/18/26, 05:41 AM
# ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection Source: [https://arxiv.org/html/2606.18988](https://arxiv.org/html/2606.18988) ,Shan Liang[Shan\.Liang@xjtlu\.edu\.cn](https://arxiv.org/html/2606.18988v1/mailto:[email protected])Xi’an Jiaotong\-Liverpool UniversitySuzhouChina,Yiqun YueXi’an Jiaotong\-Liverpool UniversitySuzhouChina,Zhuhuayang ZhangXi’an Jiaotong\-Liverpool UniversitySuzhouChinaandTianqi GaoXi’an Jiaotong\-Liverpool UniversitySuzhouChina[Tianqi\.Gao25@student\.xjtlu\.edu\.cn](https://arxiv.org/html/2606.18988v1/mailto:[email protected]) \(2026\) ###### Abstract\. Multimodal deception detection is critical for identifying fraudulent intentions, yet existing approaches predominantly rely on end\-to\-end black\-box paradigms\. These methods suffer from a severe lack of interpretability—failing to provide transparent reasoning trajectories and struggling to explicitly capture the subtle, cross\-modal inconsistencies inherent in deceptive behaviors\. To transcend these limitations, we propose ThinkDeception, a novel and interpretable multimodal deception detection framework\. As a pioneering effort, it introduces Multimodal Large Language Models \(MLLMs\) into this domain, transforming deception detection from a traditional binary classification task into an explicit cognitive reasoning process\. Facilitated by the first meticulously annotated step\-by\-step multimodal Chain\-of\-Thought \(CoT\) dataset, we develop a foundational model, ThinkDeception\-Base, empirically validating the critical role of modal inconsistency in decoding deception\. Building upon this foundation, our core innovation lies in proposing Visual\-Audio Consistency Group Relative Policy Optimization\(VAC\-GRPO\) equipped with a progressive training strategy\. Distinct from standard GRPO, we stratify the training data into four progressive difficulty tiers, guiding the model through a psychologically grounded “easy\-to\-hard” cognitive transition\. By innovatively coupling this dynamic curriculum scheduler with a multi\-dimensional, process\-aware reward mechanism and a reflective learning paradigm, we significantly elevate the model’s overall reasoning quality\. Extensive experiments on mainstream benchmarks demonstrate that ThinkDeception establishes a new state\-of\-the\-art \(SOTA\), significantly outperforming existing methods in both detection accuracy and rationale quality\. Ultimately, this work successfully drives the field of deception detection toward interpretable, multimodal cognitive reasoning\. Deception Detection, Multimodal Learning, Chain\-of\-Thought, Reinforcement Learning ††copyright:acmlicensed††journalyear:2026††doi:XXXXXXX\.XXXXXXX††conference:Make sure to enter the correct conference title from your rights confirmation email; June 03–05, 2026; Woodstock, NY††isbn:978\-1\-4503\-XXXX\-X/2018/06††ccs:Computing methodologies Scene understanding## 1\.Introduction Although multimodal deception detection methods integrating visual and acoustic information have achieved significant progress, they remain constrained by two major bottlenecks\(Guoet al\.,[2024](https://arxiv.org/html/2606.18988#bib.bib26)\)\(Zhuet al\.,[2025](https://arxiv.org/html/2606.18988#bib.bib29)\)\. First, due to the high costs associated with data collection and annotation, existing datasets are limited in scale, rendering models prone to local overfitting and hindering their ability to extract universally applicable deceptive cues\. Second, the diversity of data source scenarios\(courtrooms,laboratory settings\) introduces significant domain discrepancies\(Caiet al\.,[2025](https://arxiv.org/html/2606.18988#bib.bib8)\), coupled with the inherent heterogeneity across modalities, severely impedes the learning of multimodal representations and the cross\-domain generalization capabilities of the models\. With the rapid advancement of multimodal large language models\(MLLMs\) in areas such as video understanding\(Liet al\.,[2024a](https://arxiv.org/html/2606.18988#bib.bib38)\)\(Liuet al\.,[2024](https://arxiv.org/html/2606.18988#bib.bib39)\)\(Chenet al\.,[2024](https://arxiv.org/html/2606.18988#bib.bib40)\)\(Liet al\.,[2024b](https://arxiv.org/html/2606.18988#bib.bib41)\)\(Ronget al\.,[2025](https://arxiv.org/html/2606.18988#bib.bib42)\)\(Wanget al\.,[2024](https://arxiv.org/html/2606.18988#bib.bib43)\)\(Yeet al\.,[2025](https://arxiv.org/html/2606.18988#bib.bib44)\), we also begin to contemplate a question: Can we fully leverage the reasoning potential of multimodal large language models\(MLLMs\) to propose a method that, akin to human cognition, progressively analyses the deceptive cues step by step and ultimately arrives at a judgement? Therefore, we present the first exploration of an interpretable deception detection method empowered by Reinforcement Learning \(RL\)\. During our investigation, we identified and tackled the following core bottlenecks: \(1\) Lack of Fine\-Grained Reasoning Datasets: Current datasets\(Caiet al\.,[2025](https://arxiv.org/html/2606.18988#bib.bib8)\)\(Guoet al\.,[2023](https://arxiv.org/html/2606.18988#bib.bib9)\)\(Pérez\-Rosaset al\.,[2015](https://arxiv.org/html/2606.18988#bib.bib6)\)\(Soldneret al\.,[2019](https://arxiv.org/html/2606.18988#bib.bib7)\)are largely limited to coarse\-grained veracity labels or shallow feature shifts, severely lacking the fine\-grained visual and acoustic descriptive annotations necessary to supervise the reasoning process\. Most critically, the field still lacks high\-quality datasets that can directly instruct RL models to engage in Chain\-of\-Thought \(CoT\) reasoning\. \(2\) Inadequate Logical Reasoning Capabilities: Current MLLMs lack a systematic reasoning paradigm for deception detection\(Yanget al\.,[2024](https://arxiv.org/html/2606.18988#bib.bib48)\)\(Fanget al\.,[2026](https://arxiv.org/html/2606.18988#bib.bib49)\)\. They struggle to align critical multimodal cues, including visual micro\-expressions and AUs intensities, acoustic pitch and prosody fluctuations, and textual emotional shifts\. \(3\) Transfer Limitations of Traditional RL\(Liuet al\.,[2025](https://arxiv.org/html/2606.18988#bib.bib46)\)\(Zhouet al\.,[2025](https://arxiv.org/html/2606.18988#bib.bib47)\): While RL excels in visual understanding, its direct application to deception detection is still very limited\. Relying strictly on final classification accuracy for outcome supervision creates a sparse reward environment\. This frequently induces factual hallucinations in audio\-visual cue extraction, severely compromising the interpretability and reliability of the reasoning chain\. \(4\) Significant Heterogeneity and Domain Shifts: Existing datasets are highly heterogeneous and feature spontaneous, highly subtle deceptive behaviors\. Applying RL directly to data with such severe domain shifts frequently traps models in local optima or precipitates training collapse\. Figure 1\.\(A\) Conventional deep learning directly yields predictions without interpretable processes\.\(B\) Conversely, our method performs explicit step\-by\-step reasoning to deliver both transparent analytical trajectories and the final result\.To address the aforementioned challenges, we proposeThinkDeception, a novel multimodal deception detection reasoning framework enhanced by Reinforcement Learning \(RL\)\. First, to bridge the critical gap in reasoning data, we constructDeception\-10K, a high\-quality multimodal Chain\-of\-Thought \(CoT\) dataset derived from open\-source data\. We design a specialized annotation pipeline to extract textual, visual, and acoustic cues, supplemented by fine\-grained timestamp annotations to achieve precise audio\-visual alignment\. Second, as directly applying RL strategies to large models often suffers from convergence difficulties and fails to capture core cues, we adopt a “teach\-then\-align” paradigm\. Using Supervised Fine\-Tuning \(SFT\), we train aThinkDeception\-Baseto initially acquire the capabilities of step\-by\-step reasoning and cross\-modal inconsistency verification\. This stage ensures that the model’s reasoning process closely aligns with human psychological cognitive models\. Finally, in the RL phase, we introduce aCurriculum Learning\(Narvekaret al\.,[2020](https://arxiv.org/html/2606.18988#bib.bib45)\)\(Zhaoet al\.,[2026](https://arxiv.org/html/2606.18988#bib.bib50)\)mechanism\. By categorizing the data into four difficulty levels—truthful, low\-level, mid\-level, and high\-level deception—we guide the model learning through a progressive and easy to hard approach\. During this process, alongside foundational rule and format\-based rewards, we pioneer a fine\-grained, step\-wise evaluation metric following the cognitive sequence of “Observation\-Listening\-Reasoning\-Answering”\. By providing comprehensive reward supervision across the multimodal reasoning chain, we ultimately ensure the dual superiority of both reasoning quality and prediction accuracy\. ## 2\.Related Work ### 2\.1\.Deception Detection Deception detection has recently achieved remarkable progress in single\-modal representation and multimodal fusion\. Specifically, capturing fine\-grained acoustic features\(Wanget al\.,[2026b](https://arxiv.org/html/2606.18988#bib.bib1)\), analyzing visual emotion and eye movement\(Yanget al\.,[2020](https://arxiv.org/html/2606.18988#bib.bib2)\)\(Foucheret al\.,[2025](https://arxiv.org/html/2606.18988#bib.bib3)\), and enhancing cross\-domain generalization through feature alignment and unified mapping\(Xianget al\.,[2025](https://arxiv.org/html/2606.18988#bib.bib5)\)have significantly advanced the field\.\(Huet al\.,[2024](https://arxiv.org/html/2606.18988#bib.bib53)\) Nevertheless, current deep learning methods are hindered by their inherent black\-box nature\(Xianget al\.,[2025](https://arxiv.org/html/2606.18988#bib.bib5)\)\(Linet al\.,[2025](https://arxiv.org/html/2606.18988#bib.bib33)\)\(Liet al\.,[2025](https://arxiv.org/html/2606.18988#bib.bib34)\), producing predictions that lack traceable and verifiable logical reasoning grounds\. Therefore, overcoming this interpretability bottleneck to shift the paradigm from opaque classification toward full\-pipeline transparent reasoning is crucial for ensuring both the reliability and accuracy of deception detection\. ### 2\.2\.GRPO and Multimodal Reasoning Group Relative Policy Optimization \(GRPO\)\(Shaoet al\.,[2024](https://arxiv.org/html/2606.18988#bib.bib19)\)significantly reduces training overhead by estimating advantages via intra\-group relative scores\. DeepSeek\-R1\(Guoet al\.,[2025](https://arxiv.org/html/2606.18988#bib.bib20)\)demonstrated that sparse rule\-based rewards can elicit emergent Chain\-of\-Thought \(CoT\) reasoning\. However, relying solely on outcome rewards makes standard GRPO prone to “reward hacking” in multimodal tasks, causing the model to generate superficially fluent reasoning disconnected from perceptual evidence, thereby undermining CoT credibility\. To address this, Vision\-R1\(Huanget al\.,[2026](https://arxiv.org/html/2606.18988#bib.bib21)\)and Video\-R1\(Fenget al\.,[2025](https://arxiv.org/html/2606.18988#bib.bib22)\)extended GRPO to visual domains, demonstrating the generalization benefits of RL post\-training\. Subsequently, works like GRPO\-CARE\(Chenet al\.,[2025](https://arxiv.org/html/2606.18988#bib.bib23)\), Fact\-R1\(Zhanget al\.,[2025](https://arxiv.org/html/2606.18988#bib.bib24)\), and EmotionThinker\(Wanget al\.,[2026a](https://arxiv.org/html/2606.18988#bib.bib25)\)introduced process\-aware rewards and progressive constraints, partially mitigating advantage signal collapse and providing a paradigm for multimodal reasoning\. While emerging works have successfully adapted RL\-driven CoT to specific fields like autonomous driving \(ThinkDrive\(Zhaoet al\.,[2026](https://arxiv.org/html/2606.18988#bib.bib50)\)\) and emotion recognition \(EmotionThinker\(Wanget al\.,[2026a](https://arxiv.org/html/2606.18988#bib.bib25)\), EMO\-R3\(Yanget al\.,[2024](https://arxiv.org/html/2606.18988#bib.bib48)\)\), they operate under the fundamental assumption of cooperative consistency across multimodal features\. Deception detection, however, is an inherently adversarial cognitive process driven by deliberate behavioral camouflage\.Therefore, the development of GRPO optimization mechanisms for deception detection is critical for advancing deceptive reasoning in multimodal large language models\(MLLMs\)\. ## 3\.Method Figure 2\.The overall pipeline of the proposed ThinkDeception framework\. It comprises four main components: \(a\) Dataset Processing Pipeline; \(b\) Supervised Fine\-Tuning \(SFT\) Phase; \(c\) Reinforcement Learning Phase; and \(d\) Progressive Training Strategy, which stratifies the dataset into four distinct difficulty levels to facilitate an easy\-to\-hard progressive curriculum; \(e\) Reward and Reflection mechanismThe proposed ThinkDeception framework unfolds in three key stages, as depicted in Figure[2](https://arxiv.org/html/2606.18988#S3.F2)\. Initially, we formulate the Deception\-10K dataset, which provides step\-by\-step reasoning trajectories grounded in fine\-grained visual, acoustic, and temporal annotations\. Subsequently, leveraging Qwen2\.5\-Omni\-7B\(Xuet al\.,[2025a](https://arxiv.org/html/2606.18988#bib.bib18)\)as the foundational architecture, we apply Supervised Fine\-Tuning \(SFT\) for a cold start, producing ThinkDeception\-Base equipped with initial deductive reasoning skills\. Ultimately, we deploy a novel scheduled training strategy alongside a reflective Reinforcement Learning \(RL\) approach to comprehensively refine and optimize the model’s reasoning optimization process\. ### 3\.1\.Deception\-10K By combining open\-source benchmarks including MDPE\(Caiet al\.,[2025](https://arxiv.org/html/2606.18988#bib.bib8)\)and DOLOS\(Guoet al\.,[2023](https://arxiv.org/html/2606.18988#bib.bib9)\), RLTD\(Pérez\-Rosaset al\.,[2015](https://arxiv.org/html/2606.18988#bib.bib6)\), and Box of Lies\(Soldneret al\.,[2019](https://arxiv.org/html/2606.18988#bib.bib7)\), we construct the first fine\-grained audio\-visual Chain\-of\-Thought \(CoT\) dataset\. This dataset comprises 10,000 video\-reasoning pairs, totaling approximately 50 hours, with each sample featuring step\-by\-step reasoning trajectories and precise timestamp alignment annotations\. For visual feature extraction, we employ OpenFace3\.0\(Huet al\.,[2025](https://arxiv.org/html/2606.18988#bib.bib10)\), pre\-trained on the Affect\+ dataset\(Fardet al\.,[2026](https://arxiv.org/html/2606.18988#bib.bib32)\), to systematically extract facial Action Units \(AUs\) intensities and eight fundamental emotion categories\. Instead of using emotion labels, we use the emotion probability distribution to represent the subtle and dynamic in this work\. The core motivation behind this design is that both emotional expression and the leakage of deceptive cues are highly coherent temporal processes\. Forcing the use of discrete labels inherently causes the model to overlook discriminative, latent emotional fluctuations\. In contrast, continuous probability distributions effectively preserve these easily neglected yet crucial authentic emotional shifts\. In terms of acoustic features, we use standard speech processing tools to deeply disentangle and extract pitch, speech rate, and prosody directly from the raw audio signals\. Ultimately, these fine\-grained multimodal cues serve as conditional prompts for the Qwen3\-Omni\-30B\(Xuet al\.,[2025b](https://arxiv.org/html/2606.18988#bib.bib17)\)model, driving it to generate high\-quality, step\-by\-step reasoning processes\. The generated reasoning chains are strictly standardized into`<Think\>``</Think\>`,`<Step\>``</Step\>`and`<Answer\>``</Answer\>`structural formats\. To mitigate potential inherent biases and factual hallucinations from the language models, all generated reasoning trajectories were rigorously reviewed and scored by professional psychologists\. Comprehensive pipeline details and dataset exemplars are provided in the Appendix\. ### 3\.2\.Progressive Training Strategy While Supervised Fine\-Tuning \(SFT\) successfully aligns the output format, the model’s inherent reasoning capabilities remain suboptimal, necessitating Reinforcement Learning \(RL\)\(Shaoet al\.,[2024](https://arxiv.org/html/2606.18988#bib.bib19)\)for further policy optimization\. However, standard group\-relative RL algorithms struggle during early training\. When confronted with complex, spontaneous deception, the model’s initial reasoning deficits lead to incorrect trajectories, injecting severe gradient noise and risking catastrophic policy collapse\. To overcome this bottleneck, we abandon traditional random sampling and introduce a progressive RL framework\. By decoupling multimodal sample difficulty, we guide policy iteration through a psychologically grounded, “easy\-to\-hard” cognitive progression\. #### 3\.2\.1\.Multimodal Difficulty Assessment and Curriculum Sampling As illustrated in Figure[2](https://arxiv.org/html/2606.18988#S3.F2)\(d\), to facilitate a smooth cognitive transition during the Reinforcement Learning \(RL\)\(Shaoet al\.,[2024](https://arxiv.org/html/2606.18988#bib.bib19)\)phase, we design a difficulty assessment mechanism predicated on the salience of multimodal cues\. Specifically, we categorize the samples into four progressive difficulty levels based on the deceptive features exhibited by the speakers: Truthful, Low\-level deception, Mid\-level deception, and High\-level deception\. Lety∈\{0,1\}y\\in\\\{0,1\\\}denote the ground\-truth veracity label \(0for truthful,11for deceptive\)\. We define three boolean indicator variables,Iv,Ia,Ic∈\{0,1\}I\_\{v\},I\_\{a\},I\_\{c\}\\in\\\{0,1\\\}, representing the presence of explicit deceptive cues in the visual modality, explicit deceptive cues in the acoustic modality, and significant cross\-modal semantic\-audio\-visual conflicts, respectively\. Consequently, for any given samplexix\_\{i\}, its difficulty leveldid\_\{i\}is formulated as follows: \(1\)di=\{0,ify=01,ify=1and\(Iv=1∧Ia=1\)2,ify=1and\(Iv⊕Ia=1\)3,ify=1and\(Iv=0∧Ia=0∧Ic=1\)d\_\{i\}=\\begin\{cases\}0,&\\text\{if \}y=0\\\\ 1,&\\text\{if \}y=1\\text\{ and \}\(I\_\{v\}=1\\land I\_\{a\}=1\)\\\\ 2,&\\text\{if \}y=1\\text\{ and \}\(I\_\{v\}\\oplus I\_\{a\}=1\)\\\\ 3,&\\text\{if \}y=1\\text\{ and \}\(I\_\{v\}=0\\land I\_\{a\}=0\\land I\_\{c\}=1\)\\end\{cases\} where⊕\\oplusdenotes the logical XOR operator\. This formally models progressive deception concealment: low\-level deceivers show explicit visual\-acoustic flaws \(di=1d\_\{i\}=1\), mid\-level deceivers reveal unimodal inconsistencies \(di=2d\_\{i\}=2\), and high\-level deceivers fully camouflage audiovisual cues, requiring cross\-modal inconsistency verification \(Ic=1I\_\{c\}=1\) to detect latent conflicts \(di=3d\_\{i\}=3\)\. To prevent optimization instability caused by abrupt shifts in task complexity, we propose a Gaussian\-weighted curriculum learning strategy\. This smoothly prioritizes easier samples early on and dynamically transitions to harder ones\. ForK=4K=4difficulty levels, the unnormalized sampling weightSGaussian\(t,k\)S\_\{Gaussian\}\(t,k\)for difficultyk∈\{0,1,2,3\}k\\in\\\{0,1,2,3\\\}at stepttis defined as: \(2\)SGaussian\(t,k\)=exp\(−\(xt−μk\)22σ2\)S\_\{Gaussian\}\(t,k\)=\\exp\\left\(\-\\frac\{\(x\_\{t\}\-\\mu\_\{k\}\)^\{2\}\}\{2\\sigma^\{2\}\}\\right\) The dynamic control variablextx\_\{t\}, which determines the temporal evolution of the sampling peak center, is calculated as: \(3\)xt=\(tT\)β\(K−1\)x\_\{t\}=\\left\(\\frac\{t\}\{T\}\\right\)^\{\\beta\}\(K\-1\)whereTTrepresents the total number of RL training steps;μk=k\\mu\_\{k\}=kdenotes the fixed mean corresponding to difficulty levelkk;σ\\sigmais the variance parameter controlling the concentration of the sampling distribution; andβ\\betais a non\-linear modulation coefficient governing the drift rate of the sampling centerxtx\_\{t\}\. To obtain the actual batch sampling probability, we normalize the Gaussian weights across all difficulty levels\. The final probabilityP\(k\|t\)P\(k\|t\)of sampling a sample of difficultykkat stepttis given by: \(4\)P\(k\|t\)=SGaussian\(t,k\)∑j=0K−1SGaussian\(t,j\)P\(k\|t\)=\\frac\{S\_\{Gaussian\}\(t,k\)\}\{\\sum\_\{j=0\}^\{K\-1\}S\_\{Gaussian\}\(t,j\)\} By employing this progressive, difficulty\-aware scheduling strategy, we effectively mitigate the issues of sparse rewards and training collapse caused by overly large exploration spaces in the initial RL stages\. Concurrently, it compels the model to concentrate on resolving subtle cross\-modal conflicts during the mid\-to\-late training phases, thereby profoundly activating the multimodal large language model’s underlying reasoning potential\. #### 3\.2\.2\.Structured Analysis Traditional deep learning for deception detection\(Xianget al\.,[2025](https://arxiv.org/html/2606.18988#bib.bib5)\)\(Linet al\.,[2025](https://arxiv.org/html/2606.18988#bib.bib33)\)\(Liet al\.,[2025](https://arxiv.org/html/2606.18988#bib.bib34)\)relies on implicit feature clustering, which suffers from overfitting and frequently misclassifies genuine stress as deceit\. To overcome this, we propose a Structured Analysis and Reflection Mechanism that shifts to explicit step\-wise reasoning\. Because sophisticated deceivers meticulously camouflage their cues, deception rarely manifests in a single modality; rather, it is embedded in latent cross\-modal inconsistencies\. By forcing the model to scrutinize divergences across semantic, visual, and acoustic behaviors, we deconstruct the detection task into four reflection that mirror criminal psychology: Structured Analysis and Reflection Mechanism:•Textual Semantic Anchoring:Extract and understand the textual content as the factual baseline and cognitive context of the entire video\.•Visual Cue Decoding:Focus on fine\-grained visual changes within specific time segments, identifying potential deceptive cues such as masking smiles through precise description of micro\-expressions and Action Units \(AUs\)\.•Acoustic Feature Mapping:Analyze acoustic fluctuations such as pitch, speech rate, and prosody within corresponding time segments to quantitatively assess whether the speaker is in a state of abnormal tension or anxiety\.•Cross\-modal Conflict Reflection:Act as the core reasoning hub to cross\-reference and globally evaluate the aforementioned textual, visual, and acoustic outputs, aiming to unearth deep\-seated cross\-modal inconsistencies\. #### 3\.2\.3\.Format Reward and Accuracy Reward Specifically, following the rule\-based reward paradigm of GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.18988#bib.bib19)\), we define two reward terms to govern the model’s output structure\. The format reward, denoted asℛf\\mathcal\{R\}\_\{f\}, measures whether the model adheres to the structured output format\. It verifies the presence of each intermediate reasoning stepsis\_\{i\}and ensures that the final answer is properly enclosed within the`</Answer\>`tags: \(5\)ℛf=\{1,if the step and Answer format are correct;0,otherwise\.\\mathcal\{R\}\_\{f\}=\\begin\{cases\}1,&\\text\{if the step and Answer format are correct;\}\\\\ 0,&\\text\{otherwise\.\}\\end\{cases\} Meanwhile, the accuracy reward, denoted asℛacc\\mathcal\{R\}\_\{acc\}, evaluates whether the predicted deception labelℰ^\\hat\{\\mathcal\{E\}\}aligns with the ground\-truth deception labelℰ∗\\mathcal\{E\}^\{\*\}: \(6\)ℛacc=\{1,ifℰ^=ℰ∗;0,otherwise\.\\mathcal\{R\}\_\{acc\}=\\begin\{cases\}1,&\\text\{if \}\\hat\{\\mathcal\{E\}\}=\\mathcal\{E\}^\{\*\};\\\\ 0,&\\text\{otherwise\.\}\\end\{cases\} The two reward terms are jointly used to ensure that the generated reasoning contents can strictly meet the structural requirements\. #### 3\.2\.4\.Visual\-Audio Consistency Reasoning Reward and Reflection Mechanism The previously discussed reward mechanisms rely exclusively on outcome supervision, rendering the model highly susceptible to “shortcut learning”\. In such scenarios, the model may hallucinate flawed or irrational reasoning trajectories merely to manipulate the correct final answer\. To overcome this limitation, we propose theVisual\-AudioConsistency Reward \(VAC\-GRPO\), designed to impose fine\-grained constraints on the factual accuracy and feature completeness of the audio\-visual reasoning process\. Specifically, we leverage a knowledge distillation strategy to pre\-train a lightweight judge model based on the Qwen2\.5\-Omni\-3B\(Xuet al\.,[2025a](https://arxiv.org/html/2606.18988#bib.bib18)\)architecture\. We define a structured factual ground\-truth setJ=\{Fv,Fa\}J=\\\{F\_\{v\},F\_\{a\}\\\}derived from raw videos, whereFvF\_\{v\}contains continuous emotion probability distributions and AUs intensities, andFaF\_\{a\}includes disentangled pitch, speech rate, and prosody\. To build the training corpus for this judge model, we prompt GPT\-4o\(Hurstet al\.,[2024](https://arxiv.org/html/2606.18988#bib.bib37)\)with the factual baselineJJto generate training data scored along two distinct dimensions: Factual Accuracy and Feature Completeness\. To mitigate potential inherent biases and factual hallucinations from the language models, all generated reasoning trajectories were rigorously reviewed and scored by professional psychologists\. During the RL phase, the generated visual \(s2s\_\{2\}\) and acoustic \(s3s\_\{3\}\) reasoning steps are fed into the frozen judge model along with the factual baselineJJand specifically tailored evaluation promptsPvP\_\{v\}andPaP\_\{a\}: PromptPvP\_\{v\}\(Visual Evaluation\):Evaluate whether the following textual description accurately and comprehensively captures the visual content of the video in terms of factual alignment and feature completeness\. PromptPaP\_\{a\}\(Acoustic Evaluation\):Evaluate whether the following textual description accurately and comprehensively captures the acoustic content of the video in terms of factual alignment and feature completeness\. Conditioned on the factual ground\-truth setJJ, the frozen judge modelℳjudge\\mathcal\{M\}\_\{judge\}generates discrete reflective outputs for the visual and acoustic reasoning steps, respectively: \(7\)y^v=ℳjudge\(Fv,s2,Pv\),y^a=ℳjudge\(Fa,s3,Pa\)\\hat\{y\}\_\{v\}=\\mathcal\{M\}\_\{judge\}\(F\_\{v\},s\_\{2\},P\_\{v\}\),\\quad\\hat\{y\}\_\{a\}=\\mathcal\{M\}\_\{judge\}\(F\_\{a\},s\_\{3\},P\_\{a\}\) The reflective outputsy^v\\hat\{y\}\_\{v\}andy^a\\hat\{y\}\_\{a\}are discrete binary indicators \(“Yes” or “No”\)\. Consequently, the modality\-specific consistency rewards \(denoted asℛv\\mathcal\{R\}\_\{v\}andℛa\\mathcal\{R\}\_\{a\}\) are formally defined as follows: \(8\)ℛm=\{1,ify^m=Yes;0,ify^m=No\.form∈\{v,a\}\\mathcal\{R\}\_\{m\}=\\begin\{cases\}1,&\\text\{if \}\\hat\{y\}\_\{m\}=\\text\{Yes;\}\\\\ 0,&\\text\{if \}\\hat\{y\}\_\{m\}=\\text\{No\.\}\\end\{cases\}\\quad\\text\{for \}m\\in\\\{v,a\\\} By incorporating this dual\-modality reward mechanism, we ensure that the generated reasoning process achieves superior quality across two critical dimensions: - •Factuality:It rigorously verifies whether the textual reasoning strictly adheres to the objective physical features delineated inJJ, such as subtle micro\-expression variations and dynamic pitch shifts\. - •Completeness:It assesses whether the model comprehensively captures and articulates the salient physical traits present inJJ\. Ultimately, this constraint guarantees a high degree of multimodal consistency between the step\-wise textual descriptions and the actual audio\-visual content\. However, for certain high\-level deceivers, easily perceptible deceptive cues may not manifest within the audio\-visual modalities\. To address this, we propose a reflection reward mechanism based on cross\-modal inconsistency verification\. Lety∈\{0,1\}y\\in\\\{0,1\\\}denote the ground\-truth label of the sample \(where11indicates the presence of deceptive behavior\)\. LetE∈\{0,1\}E\\in\\\{0,1\\\}be a boolean indicator variable representing whether the model extracts salient unimodal abnormal features during the generation of stepss1s\_\{1\}tos3s\_\{3\}\(E=1E=1indicates the presence of explicit features\)\. Furthermore, we define an indicator functionΦconflict\(s4\)∈\{0,1\}\\Phi\_\{conflict\}\(s\_\{4\}\)\\in\\\{0,1\\\}to determine whether the model explicitly conducts logical reasoning and transition analysis regarding “cross\-modal conflicts, contradictions, or camouflage” in the outputs4s\_\{4\}of the fourth stage\. Based on this, the conditional logic alignment rewardRlogicR\_\{logic\}is mathematically defined as follows: \(9\)Rlogic=\{\+1\.0,if\(y=1∧E=0\)andΦconflict\(s4\)=1−1\.0,if\(y=1∧E=0\)andΦconflict\(s4\)=0\+1\.0,if\(y=1∧E=1\)andΦconflict\(s4\)=1−1\.0,ify=0andΦconflict\(s4\)=10\.0,otherwiseR\_\{logic\}=\\begin\{cases\}\+1\.0,&\\text\{if \}\(y=1\\land E=0\)\\text\{ and \}\\Phi\_\{conflict\}\(s\_\{4\}\)=1\\\\ \-1\.0,&\\text\{if \}\(y=1\\land E=0\)\\text\{ and \}\\Phi\_\{conflict\}\(s\_\{4\}\)=0\\\\ \+1\.0,&\\text\{if \}\(y=1\\land E=1\)\\text\{ and \}\\Phi\_\{conflict\}\(s\_\{4\}\)=1\\\\ \-1\.0,&\\text\{if \}y=0\\text\{ and \}\\Phi\_\{conflict\}\(s\_\{4\}\)=1\\\\ 0\.0,&\\text\{otherwise\}\\end\{cases\} This reward function effectively ensures that if no deceptive cues are captured during the initial three reasoning steps, the model is compelled to engage in deep reflective reasoning; conversely, if explicit deceptive cues are detected, it directly proceeds to output the conclusion\. More detailed explanations are provided in the Appendix\. Ultimately, the joint reasoning reward is obtained by averaging the audio\-visual consistency reward and the logic reflection reward: \(10\)Rreasoning=Ra\+Rv\+Rlogic3R\_\{reasoning\}=\\frac\{R\_\{a\}\+R\_\{v\}\+R\_\{logic\}\}\{3\} Figure 3\.Qualitative comparison between ThinkDeception and baseline models\.Table 1\.Performance comparison of ThinkDeception with state\-of\-the\-art baselines on four deception detection datasets\. We report Classification Accuracy \(%\) across both in\-domain \(DOLOS, MDPE\) and cross\-domain \(RLTD, Box of Lies\) settings, along with the Average Reasoning Quality Score \(Avg RS, scaled 1\-5\)\.Ultimately, the overall optimization objective for the proposed VAC\-GRPO framework is formulated as a weighted sum of the aforementioned reward components: \(11\)ℛtotal=αfℛf\+αaℛacc\+αrℛreasoning\\mathcal\{R\}\_\{total\}=\\alpha\_\{f\}\\mathcal\{R\}\_\{f\}\+\\alpha\_\{a\}\\mathcal\{R\}\_\{acc\}\+\\alpha\_\{r\}\\mathcal\{R\}\_\{reasoning\}where the coefficientsαf\\alpha\_\{f\},αa\\alpha\_\{a\}, andαr\\alpha\_\{r\}represent the corresponding hyperparameter weights that govern the relative contribution of each reward component during the policy update\. ## 4\.Experiments ### 4\.1\.Datasets and Evaluation Metrics To comprehensively evaluate the model’s accuracy and reasoning capabilities in deception detection, our experiments are conducted on our newly constructed multimodal Chain\-of\-Thought dataset, Deception\-10K\. Specifically, we select four mainstream deception detection subsets encompassed by this dataset\. Regarding the training protocol, we conduct independent model training on the DOLOS\(Guoet al\.,[2023](https://arxiv.org/html/2606.18988#bib.bib9)\)and MDPE\(Caiet al\.,[2025](https://arxiv.org/html/2606.18988#bib.bib8)\)datasets \(In\-domain Training\)\. Conversely, the remaining two datasets \(RLTD\(Pérez\-Rosaset al\.,[2015](https://arxiv.org/html/2606.18988#bib.bib6)\)and Box of Lies\(Soldneret al\.,[2019](https://arxiv.org/html/2606.18988#bib.bib7)\)\) are strictly reserved as unseen test beds, dedicated exclusively to assessing the model’s cross\-domain generalization performance across diverse scenarios, racial demographics, and elicitation environments\. In terms of evaluation metrics, we adopt standard classification Accuracy \(ACC\) as the core quantitative metric for the final deception recognition results\. Furthermore, as this paper pioneers the introduction of an explicit reasoning paradigm to this field, we incorporate reasoning quality metrics\. By employing both an LLM\-as\-a\-Judge and human expert blind reviews, we conduct a fine\-grained quantitative assessment of the generated Chain\-of\-Thought \(CoT\) across critical dimensions, including factual consistency and logical coherence\. ### 4\.2\.Baseline Methods To thoroughly demonstrate the efficacy of the ThinkDeception framework, we compare it against 7 representative baseline methods, categorized into two groups: - •Traditional Multimodal Deep Learning Methods: We select LCUNet\(Xianget al\.,[2025](https://arxiv.org/html/2606.18988#bib.bib5)\), MMPDA\(Linet al\.,[2025](https://arxiv.org/html/2606.18988#bib.bib33)\), and CogGuided\(Liet al\.,[2025](https://arxiv.org/html/2606.18988#bib.bib34)\)\. To maintain absolute experimental fairness, we strictly isolate the Chain\-of\-Thought \(CoT\) texts in Deception\-10K during the training of these models\. Consequently, they are trained solely on the raw audio\-visual features coupled with binary veracity labels\. - •Omni Large Language Models: Qwen2\.5\-Omni\-7B\(Xuet al\.,[2025a](https://arxiv.org/html/2606.18988#bib.bib18)\), Qwen3\-Omni\-30B\(Xuet al\.,[2025b](https://arxiv.org/html/2606.18988#bib.bib17)\), Gemini2\.5\-Pro\(Comaniciet al\.,[2025](https://arxiv.org/html/2606.18988#bib.bib35)\), and GLM\-4\.6v\(Honget al\.,[2025](https://arxiv.org/html/2606.18988#bib.bib36)\)\. ### 4\.3\.Implementation Details All training procedures are conducted on8×8\\timesNVIDIA A100 \(80GB\) GPUs\. In the supervised fine\-tuning \(SFT\) cold\-start phase, we adopt Qwen2\.5\-Omni\-7B\(Xuet al\.,[2025a](https://arxiv.org/html/2606.18988#bib.bib18)\)as the foundation model and fine\-tune it for one epoch on a subset of the Deception\-10K dataset\. This yields the baseline model, ThinkDeception\-Base, which is equipped with preliminary step\-by\-step reasoning capabilities\. During the subsequent reinforcement learning \(RL\) phase, we employ the Group Relative Policy Optimization \(GRPO\) algorithm\(Guoet al\.,[2025](https://arxiv.org/html/2606.18988#bib.bib20)\)\(Rameshet al\.,[2024](https://arxiv.org/html/2606.18988#bib.bib51)\)with a learning rate of1×10−61\\times 10^\{\-6\}\. For each input video\-text pair, the policy model generatesK=8K=8candidate reasoning trajectories \(rollouts\), with the sampling process executed every 50 training steps\. ### 4\.4\.Evaluation Metrics To comprehensively and rigorously assess the generalization capability of our model, we conduct evaluations under both in\-domain and cross\-domain settings\. Across all benchmark datasets, we employ classification accuracy as the primary objective metric\. Furthermore, specifically tailored for Multimodal Large Language Models \(MLLMs\), we introduce a novel Reasoning Quality Score to quantitatively measure the logical coherence and cross\-modal factual consistency of the generated rationales\. ### 4\.5\.Comparative Results As illustrated in Table[2](https://arxiv.org/html/2606.18988#S4.T2), ThinkDeception achieves state\-of\-the\-art \(SOTA\) performance in both overall detection accuracy and reasoning quality\. Specifically, our model consistently secures the highest accuracy across all evaluated datasets, reaching an average accuracy of 73\.76% and outperforming the second\-best baseline by a substantial absolute margin of 8\.52%\.Notably, while the majority of existing baseline models excel in general visual understanding tasks, their performance in deception detection hovers around the random guess baseline of 50%, despite being guided by identical prompts\. This profound degradation underscores the significant potential and critical necessity for developing domain\-specific Multimodal Large Language Models \(MLLMs\) tailored for deception detection\. Furthermore, our model demonstrates exceptional robustness in cross\-domain evaluations, particularly on the highly challenging, multi\-speaker Box of Lies \(BOL\)\(Soldneret al\.,[2019](https://arxiv.org/html/2606.18988#bib.bib7)\)dataset\. This compellingly verifies that rather than merely overfitting to surface\-level feature mappings, ThinkDeception has successfully internalized a generalized and unified reasoning paradigm for deception recognition\. Figure 4\.\(a\) Test accuracy comparison between VAC\-GRPO and standard GRPO\. \(b\) Dynamic sampling distribution across distinct difficulty levels over training steps\. \(c\) Ablation study on key hyperparameters\. ### 4\.6\.Ablation Study Experimental results demonstrate that Supervised Fine\-Tuning \(SFT\) yields a notable improvement in accuracy, substantiating the efficacy of our proposed modality inconsistency verification\. The subsequent integration of VAC\-GRPO reinforcement learning further elevates the model’s performance\. Furthermore, ablation studies on the core reward mechanisms reveal a phenomenon highly consistent with deceptive psychology: low\-level visual\-audio conflicts are inherently more discriminative than pure textual logic\. This empirical finding corroborates that while deceivers can often fabricate logically watertight lies, it is exceedingly difficult for them to simultaneously suppress the physiological tension manifested in their visual and acoustic cues\. This aligns with earlier observations that over\-reliance on textual priors in deception detection makes the model highly susceptible to overfitting\. Table 2\.Ablation studies on the proposed ThinkDeception framework\.Finally, we conduct hyperparameter ablations\. Results indicate that the model achieves optimal performance when the number of sampled trajectories is set toK=8K=8\. Additionally, sensitivity analysis onαa\\alpha\_\{a\}andαr\\alpha\_\{r\}shows that the peak performance occurs atαr=0\.5\\alpha\_\{r\}=0\.5\. An excessively highαr\\alpha\_\{r\}leads to performance degradation, demonstrating that overemphasizing intermediate reasoning signals can interfere with the advantage estimation of the primary task, thereby introducing optimization instability\. This finding profoundly underscores the critical importance of maintaining a dynamic balance in reward distribution during multimodal reinforcement learning\. ### 4\.7\.Qualitative Analysis As illustrated in Figure[4](https://arxiv.org/html/2606.18988#S4.F4), compared to state\-of\-the\-art models such as Qwen2\.5\-Omni\-7B\(Xuet al\.,[2025a](https://arxiv.org/html/2606.18988#bib.bib18)\), Gemini 2\.5 Pro\(Comaniciet al\.,[2025](https://arxiv.org/html/2606.18988#bib.bib35)\), and Qwen3\-Omni\-30B\(Xuet al\.,[2025b](https://arxiv.org/html/2606.18988#bib.bib17)\), our ThinkDeception framework achieves the highest results in both reasoning quality and factual consistency\. The baseline models generally suffer from short\-circuit reasoning, erroneous classifications, and severe hallucination issues where the reasoning trajectories are disconnected from factual evidence\. ### 4\.8\.Reliability of The Foundational Models Comprehensive details and reliability evaluations for the adopted generative \(Qwen3\-Omni\-30B\(Xuet al\.,[2025b](https://arxiv.org/html/2606.18988#bib.bib17)\)\) and judge \(GPT\-4o\(Hurstet al\.,[2024](https://arxiv.org/html/2606.18988#bib.bib37)\), Qwen2\.5\-Omni\-3B\(Xuet al\.,[2025a](https://arxiv.org/html/2606.18988#bib.bib18)\)\) models, as well as how to train a lightweight judge model\(Liet al\.,[2026](https://arxiv.org/html/2606.18988#bib.bib52)\), are detailed in the Appendix\. ## 5\.Conclusion In this paper, we propose ThinkDeception, successfully introducing the reasoning capabilities of Large Language Models into the domain of deception detection for the first time\. This work drives a paradigm shift in deception recognition from a traditional binary classification task to an interpretable reasoning process\. During the training phase, we design a progressive learning strategy to guide the model in internalizing deceptive features in an easy\-to\-hard manner\. Concurrently, we introduce VAC\-GRPO, enabling the model to conduct rigorous step\-by\-step reasoning\. This architecture empowers the model to generate factually consistent reasoning steps and precisely capture latent deceptive cues\. Comprehensive experiments across multiple benchmark datasets demonstrate that ThinkDeception establishes a new state\-of\-the\-art in both detection accuracy and reasoning quality\. ## References - C\. Cai, S\. Liang, X\. Liu, K\. Zhu, Z\. Wen, J\. Tao, H\. Xie, J\. Cui, Y\. Ma, Z\. Cheng, H\. Xu, R\. Fu, B\. Liu, and Y\. Li \(2025\)MDPE: a multimodal deception dataset with personality and emotional characteristics\.InProceedings of the 33rd ACM International Conference on Multimedia,MM ’25,New York, NY, USA,pp\. 12957–12964\.External Links:ISBN 9798400720352,[Link](https://doi.org/10.1145/3746027.3758242),[Document](https://dx.doi.org/10.1145/3746027.3758242)Cited by:[§1](https://arxiv.org/html/2606.18988#S1.p1.1),[§1](https://arxiv.org/html/2606.18988#S1.p2.1),[§3\.1](https://arxiv.org/html/2606.18988#S3.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.18988#S4.SS1.p1.1)\. - Y\. Chen, Y\. Ge, R\. Wang, Y\. Ge, J\. Cheng, Y\. Shan, and X\. Liu \(2025\)GRPO\-care: consistency\-aware reinforcement learning for multimodal reasoning\.External Links:2506\.16141,[Link](https://arxiv.org/abs/2506.16141)Cited by:[§2\.2](https://arxiv.org/html/2606.18988#S2.SS2.p2.1)\. - Z\. Chen, J\. Wu, W\. Wang, W\. Su, G\. Chen, S\. Xing, M\. Zhong, Q\. Zhang, X\. Zhu, L\. Lu,et al\.\(2024\)Internvl: scaling up vision foundation models and aligning for generic visual\-linguistic tasks\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 24185–24198\.Cited by:[§1](https://arxiv.org/html/2606.18988#S1.p1.1)\. - G\. Comanici, E\. Bieber, M\. Schaekermann, I\. Pasupat, N\. Sachdeva, I\. Dhillon, M\. Blistein, O\. Ram, D\. Zhang, E\. Rosen,et al\.\(2025\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.arXiv preprint arXiv:2507\.06261\.Cited by:[Table 1](https://arxiv.org/html/2606.18988#S3.T1.1.1.10.10.1),[2nd item](https://arxiv.org/html/2606.18988#S4.I1.i2.p1.1),[§4\.7](https://arxiv.org/html/2606.18988#S4.SS7.p1.1)\. - Y\. Fang, W\. Huang, P\. Fu, Y\. Yang, K\. Su, Z\. Luo, J\. Luan, and M\. Ye \(2026\)EMO\-r3: reflective reinforcement learning for emotional reasoning in multimodal large language models\.arXiv preprint arXiv:2602\.23802\.Cited by:[§1](https://arxiv.org/html/2606.18988#S1.p2.1)\. - A\. P\. Fard, M\. M\. Hosseini, T\. D\. Sweeny, and M\. H\. Mahoor \(2026\)AffectNet\+: a database for enhancing facial expression recognition with soft\-labels\.IEEE Transactions on Affective Computing17\(1\),pp\. 784–800\.External Links:[Document](https://dx.doi.org/10.1109/TAFFC.2025.3634523)Cited by:[§3\.1](https://arxiv.org/html/2606.18988#S3.SS1.p2.1)\. - K\. Feng, K\. Gong, B\. Li, Z\. Guo, Y\. Wang, T\. Peng, J\. Wu, X\. Zhang, B\. Wang, and X\. Yue \(2025\)Video\-r1: reinforcing video reasoning in mllms\.External Links:2503\.21776,[Link](https://arxiv.org/abs/2503.21776)Cited by:[§2\.2](https://arxiv.org/html/2606.18988#S2.SS2.p2.1)\. - V\. Foucher, S\. de Leon\-Martinez, and R\. Moro \(2025\)Eye movements as indicators of deception: a machine learning approach\.InProceedings of the 2025 Symposium on Eye Tracking Research and Applications,pp\. 1–7\.Cited by:[§2\.1](https://arxiv.org/html/2606.18988#S2.SS1.p1.1)\. - D\. Guo, D\. Yang, H\. Zhang,et al\.\(2025\)DeepSeek\-r1 incentivizes reasoning in llms through reinforcement learning\.Nature645\(8081\),pp\. 633–638\.External Links:ISSN 1476\-4687,[Link](http://dx.doi.org/10.1038/s41586-025-09422-z),[Document](https://dx.doi.org/10.1038/s41586-025-09422-z)Cited by:[§2\.2](https://arxiv.org/html/2606.18988#S2.SS2.p1.1),[§4\.3](https://arxiv.org/html/2606.18988#S4.SS3.p1.3)\. - X\. Guo, N\. M\. Selvaraj, Z\. Yu, A\. W\. Kong, B\. Shen, and A\. Kot \(2023\)Audio\-visual deception detection: dolos dataset and parameter\-efficient crossmodal learning\.External Links:2303\.12745,[Link](https://arxiv.org/abs/2303.12745)Cited by:[§1](https://arxiv.org/html/2606.18988#S1.p2.1),[§3\.1](https://arxiv.org/html/2606.18988#S3.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.18988#S4.SS1.p1.1)\. - X\. Guo, Z\. Yu, N\. M\. Selvaraj, B\. Shen, A\. W\. Kong, and A\. C\. Kot \(2024\)Benchmarking cross\-domain audio\-visual deception detection\.arXiv preprint arXiv:2405\.06995\.Cited by:[§1](https://arxiv.org/html/2606.18988#S1.p1.1)\. - W\. Hong, W\. Yu, X\. Gu, G\. Wang, G\. Gan, H\. Tang, J\. Cheng, J\. Qi, J\. Ji, L\. Pan,et al\.\(2025\)Glm\-4\.5 v and glm\-4\.1 v\-thinking: towards versatile multimodal reasoning with scalable reinforcement learning\.arXiv preprint arXiv:2507\.01006\.Cited by:[Table 1](https://arxiv.org/html/2606.18988#S3.T1.1.1.11.11.1),[2nd item](https://arxiv.org/html/2606.18988#S4.I1.i2.p1.1)\. - J\. Hu, L\. Mathur, P\. P\. Liang, and L\. Morency \(2025\)OpenFace 3\.0: a lightweight multitask system for comprehensive facial behavior analysis\.arXiv preprint arXiv:2506\.02891\.Cited by:[§3\.1](https://arxiv.org/html/2606.18988#S3.SS1.p2.1)\. - X\. Hu, Y\. Tai, X\. Zhao, C\. Zhao, Z\. Zhang, J\. Li, B\. Zhong, and J\. Yang \(2024\)Exploiting multimodal spatial\-temporal patterns for video object tracking\.External Links:2412\.15691,[Link](https://arxiv.org/abs/2412.15691)Cited by:[§2\.1](https://arxiv.org/html/2606.18988#S2.SS1.p1.1)\. - W\. Huang, B\. Jia, Z\. Zhai, S\. Cao, Z\. Ye, F\. Zhao, Z\. Xu, X\. Tang, Y\. Hu, and S\. Lin \(2026\)Vision\-r1: incentivizing reasoning capability in multimodal large language models\.External Links:2503\.06749,[Link](https://arxiv.org/abs/2503.06749)Cited by:[§2\.2](https://arxiv.org/html/2606.18988#S2.SS2.p2.1)\. - A\. Hurst, A\. Lerer, A\. P\. Goucher, A\. Perelman, A\. Ramesh, A\. Clark, A\. Ostrow, A\. Welihinda, A\. Hayes, A\. Radford,et al\.\(2024\)Gpt\-4o system card\.arXiv preprint arXiv:2410\.21276\.Cited by:[§3\.2\.4](https://arxiv.org/html/2606.18988#S3.SS2.SSS4.p1.4),[§4\.8](https://arxiv.org/html/2606.18988#S4.SS8.p1.1)\. - B\. Li, Y\. Zhang, D\. Guo, R\. Zhang, F\. Li, H\. Zhang, K\. Zhang, P\. Zhang, Y\. Li, Z\. Liu,et al\.\(2024a\)Llava\-onevision: easy visual task transfer\.arXiv preprint arXiv:2408\.03326\.Cited by:[§1](https://arxiv.org/html/2606.18988#S1.p1.1)\. - H\. Li, W\. Tian, H\. Xie, Z\. Hu, Z\. Yang, and Z\. Wang \(2025\)Multimodal deception detection via cognitively guided inconsistency modeling\.InProceedings of the 1st International Workshop & Challenge on Subtle Visual Computing,SVC ’25,New York, NY, USA,pp\. 40–45\.External Links:ISBN 9798400718373,[Link](https://doi.org/10.1145/3728425.3759922),[Document](https://dx.doi.org/10.1145/3728425.3759922)Cited by:[§2\.1](https://arxiv.org/html/2606.18988#S2.SS1.p2.1),[§3\.2\.2](https://arxiv.org/html/2606.18988#S3.SS2.SSS2.p1.1),[Table 1](https://arxiv.org/html/2606.18988#S3.T1.1.1.6.6.1),[1st item](https://arxiv.org/html/2606.18988#S4.I1.i1.p1.1)\. - Z\. Li, B\. Yang, Q\. Liu, Z\. Ma, S\. Zhang, J\. Yang, Y\. Sun, Y\. Liu, and X\. Bai \(2024b\)Monkey: image resolution and text label are important things for large multi\-modal models\.Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 26763–26773\.Cited by:[§1](https://arxiv.org/html/2606.18988#S1.p1.1)\. - Z\. Li, Y\. Zhang, M\. Li, Y\. Ji, Y\. Zeng, N\. Cheng, Y\. Zhu, Y\. Wang, S\. Wang, J\. Xiao, and D\. He \(2026\)Rethinking llm\-as\-a\-judge: representation\-as\-a\-judge with small language models via semantic capacity asymmetry\.External Links:2601\.22588,[Link](https://arxiv.org/abs/2601.22588)Cited by:[§4\.8](https://arxiv.org/html/2606.18988#S4.SS8.p1.1)\. - R\. Lin, S\. Mai, Y\. Zeng, Q\. He, A\. Xiong, and H\. Hu \(2025\)Multi\-source multimodal progressive domain adaption for audio\-visual deception detection\.InProceedings of the 1st International Workshop & Challenge on Subtle Visual Computing,SVC ’25,New York, NY, USA,pp\. 52–58\.External Links:ISBN 9798400718373,[Link](https://doi.org/10.1145/3728425.3759924),[Document](https://dx.doi.org/10.1145/3728425.3759924)Cited by:[§2\.1](https://arxiv.org/html/2606.18988#S2.SS1.p2.1),[§3\.2\.2](https://arxiv.org/html/2606.18988#S3.SS2.SSS2.p1.1),[Table 1](https://arxiv.org/html/2606.18988#S3.T1.1.1.5.5.1),[1st item](https://arxiv.org/html/2606.18988#S4.I1.i1.p1.1)\. - H\. Liu, C\. Li, Y\. Li, and Y\. J\. Lee \(2024\)Improved baselines with visual instruction tuning\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 26296–26306\.Cited by:[§1](https://arxiv.org/html/2606.18988#S1.p1.1)\. - K\. Liu, D\. Yang, Z\. Qian, W\. Yin, Y\. Wang, H\. Li, J\. Liu, P\. Zhai, Y\. Liu, and L\. Zhang \(2025\)Reinforcement learning meets large language models: a survey of advancements and applications across the llm lifecycle\.arXiv preprint arXiv:2509\.16679\.Cited by:[§1](https://arxiv.org/html/2606.18988#S1.p2.1)\. - S\. Narvekar, B\. Peng, M\. Leonetti, J\. Sinapov, M\. E\. Taylor, and P\. Stone \(2020\)Curriculum learning for reinforcement learning domains: a framework and survey\.Journal of Machine Learning Research21\(181\),pp\. 1–50\.Cited by:[§1](https://arxiv.org/html/2606.18988#S1.p3.1.4)\. - V\. Pérez\-Rosas, M\. Abouelenien, R\. Mihalcea, and M\. Burzo \(2015\)Deception detection using real\-life trial data\.InProceedings of the 2015 ACM on International Conference on Multimodal Interaction,ICMI ’15,New York, NY, USA,pp\. 59–66\.External Links:ISBN 9781450339124,[Link](https://doi.org/10.1145/2818346.2820758),[Document](https://dx.doi.org/10.1145/2818346.2820758)Cited by:[§1](https://arxiv.org/html/2606.18988#S1.p2.1),[§3\.1](https://arxiv.org/html/2606.18988#S3.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.18988#S4.SS1.p1.1)\. - S\. S\. Ramesh, Y\. Hu, I\. Chaimalas, V\. Mehta, P\. G\. Sessa, H\. Bou Ammar, and I\. Bogunovic \(2024\)Group robust preference optimization in reward\-free rlhf\.Advances in Neural Information Processing Systems37,pp\. 37100–37137\.Cited by:[§4\.3](https://arxiv.org/html/2606.18988#S4.SS3.p1.3)\. - X\. Rong, W\. Huang, J\. Liang, J\. Bi, X\. Xiao, Y\. Li, B\. Du, and M\. Ye \(2025\)Backdoor cleaning without external guidance in mllm fine\-tuning\.arXiv preprint arXiv:2505\.16916\.Cited by:[§1](https://arxiv.org/html/2606.18988#S1.p1.1)\. - Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.External Links:2402\.03300,[Link](https://arxiv.org/abs/2402.03300)Cited by:[§2\.2](https://arxiv.org/html/2606.18988#S2.SS2.p1.1),[§3\.2\.1](https://arxiv.org/html/2606.18988#S3.SS2.SSS1.p1.1),[§3\.2\.3](https://arxiv.org/html/2606.18988#S3.SS2.SSS3.p1.1),[§3\.2](https://arxiv.org/html/2606.18988#S3.SS2.p1.1)\. - F\. Soldner, V\. Pérez\-Rosas, and R\. Mihalcea \(2019\)Box of lies: multimodal deception detection in dialogues\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),J\. Burstein, C\. Doran, and T\. Solorio \(Eds\.\),Minneapolis, Minnesota,pp\. 1768–1777\.External Links:[Link](https://aclanthology.org/N19-1175/),[Document](https://dx.doi.org/10.18653/v1/N19-1175)Cited by:[§1](https://arxiv.org/html/2606.18988#S1.p2.1),[§3\.1](https://arxiv.org/html/2606.18988#S3.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.18988#S4.SS1.p1.1),[§4\.5](https://arxiv.org/html/2606.18988#S4.SS5.p1.1)\. - D\. Wang, S\. Liu, T\. Zhang, Y\. Chen, J\. Li, and H\. Meng \(2026a\)EmotionThinker: prosody\-aware reinforcement learning for explainable speech emotion reasoning\.External Links:2601\.15668,[Link](https://arxiv.org/abs/2601.15668)Cited by:[§2\.2](https://arxiv.org/html/2606.18988#S2.SS2.p2.1),[§2\.2](https://arxiv.org/html/2606.18988#S2.SS2.p3.1)\. - P\. Wang, Z\. Ma, X\. Dai, Y\. Liu, S\. Feng, X\. Yang, W\. Hu, Z\. Wang, M\. Pan, L\. Yuan,et al\.\(2026b\)SAFE\-qaq: end\-to\-end slow\-thinking audio\-text fraud detection via reinforcement learning\.arXiv preprint arXiv:2601\.01392\.Cited by:[§2\.1](https://arxiv.org/html/2606.18988#S2.SS1.p1.1)\. - P\. Wang, S\. Bai, S\. Tan, S\. Wang, Z\. Fan, J\. Bai, K\. Chen, X\. Liu, J\. Wang, W\. Ge,et al\.\(2024\)Qwen2\-vl: enhancing vision\-language model’s perception of the world at any resolution\.arXiv preprint arXiv:2409\.12191\.Cited by:[§1](https://arxiv.org/html/2606.18988#S1.p1.1)\. - X\. Xiang, S\. Li, J\. Huang, Q\. Yan, Z\. Zhu, H\. Zhang, and J\. Ma \(2025\)LCUNet: a lightweight concatenated unified mapping multi\-modal deception detector\.InProceedings of the 1st International Workshop & Challenge on Subtle Visual Computing,SVC ’25,New York, NY, USA,pp\. 46–51\.External Links:ISBN 9798400718373,[Link](https://doi.org/10.1145/3728425.3759923),[Document](https://dx.doi.org/10.1145/3728425.3759923)Cited by:[§2\.1](https://arxiv.org/html/2606.18988#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2606.18988#S2.SS1.p2.1),[§3\.2\.2](https://arxiv.org/html/2606.18988#S3.SS2.SSS2.p1.1),[Table 1](https://arxiv.org/html/2606.18988#S3.T1.1.1.4.4.1),[1st item](https://arxiv.org/html/2606.18988#S4.I1.i1.p1.1)\. - J\. Xu, Z\. Guo, J\. He, H\. Hu, T\. He, S\. Bai, K\. Chen, J\. Wang, Y\. Fan, K\. Dang, B\. Zhang, X\. Wang, Y\. Chu, and J\. Lin \(2025a\)Qwen2\.5\-omni technical report\.arXiv preprint arXiv:2503\.20215\.Cited by:[§3\.2\.4](https://arxiv.org/html/2606.18988#S3.SS2.SSS4.p1.4),[Table 1](https://arxiv.org/html/2606.18988#S3.T1.1.1.8.8.1),[§3](https://arxiv.org/html/2606.18988#S3.p1.1),[2nd item](https://arxiv.org/html/2606.18988#S4.I1.i2.p1.1),[§4\.3](https://arxiv.org/html/2606.18988#S4.SS3.p1.3),[§4\.7](https://arxiv.org/html/2606.18988#S4.SS7.p1.1),[§4\.8](https://arxiv.org/html/2606.18988#S4.SS8.p1.1)\. - J\. Xu, Z\. Guo, H\. Hu, Y\. Chu, X\. Wang, J\. He, Y\. Wang, X\. Shi, T\. He, X\. Zhu, Y\. Lv, Y\. Wang, D\. Guo, H\. Wang, L\. Ma, P\. Zhang, X\. Zhang, H\. Hao, Z\. Guo, B\. Yang, B\. Zhang, Z\. Ma, X\. Wei, S\. Bai, K\. Chen, X\. Liu, P\. Wang, M\. Yang, D\. Liu, X\. Ren, B\. Zheng, R\. Men, F\. Zhou, B\. Yu, J\. Yang, L\. Yu, J\. Zhou, and J\. Lin \(2025b\)Qwen3\-omni technical report\.arXiv preprint arXiv:2509\.17765\.Cited by:[§3\.1](https://arxiv.org/html/2606.18988#S3.SS1.p3.1),[Table 1](https://arxiv.org/html/2606.18988#S3.T1.1.1.9.9.1),[2nd item](https://arxiv.org/html/2606.18988#S4.I1.i2.p1.1),[§4\.7](https://arxiv.org/html/2606.18988#S4.SS7.p1.1),[§4\.8](https://arxiv.org/html/2606.18988#S4.SS8.p1.1)\. - J\. Yang, G\. Liu, and S\. C\. Huang \(2020\)Emotion transformation feature: novel feature for deception detection in videos\.In2020 IEEE International Conference on Image Processing \(ICIP\),Vol\.,pp\. 1726–1730\.External Links:[Document](https://dx.doi.org/10.1109/ICIP40778.2020.9190846)Cited by:[§2\.1](https://arxiv.org/html/2606.18988#S2.SS1.p1.1)\. - Q\. Yang, M\. Ye, and B\. Du \(2024\)Emollm: multimodal emotional understanding meets large language models\.arXiv preprint arXiv:2406\.16442\.Cited by:[§1](https://arxiv.org/html/2606.18988#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.18988#S2.SS2.p3.1)\. - M\. Ye, X\. Rong, W\. Huang, B\. Du, N\. Yu, and D\. Tao \(2025\)A survey of safety on large vision\-language models: attacks, defenses and evaluations\.arXiv preprint arXiv:2502\.14881\.Cited by:[§1](https://arxiv.org/html/2606.18988#S1.p1.1)\. - F\. Zhang, D\. Li, Q\. Zhang, J\. Chen, G\. Liu, J\. Lin, J\. Yan, J\. Liu, and Z\. Zha \(2025\)Fact\-r1: towards explainable video misinformation detection with deep reasoning\.External Links:2505\.16836,[Link](https://arxiv.org/abs/2505.16836)Cited by:[§2\.2](https://arxiv.org/html/2606.18988#S2.SS2.p2.1)\. - C\. Zhao, Z\. Yang, Y\. Hu, Q\. Guo, Z\. Wang, P\. Li, and W\. Ji \(2026\)ThinkDrive: chain\-of\-thought guided progressive reinforcement learning fine\-tuning for autonomous driving\.arXiv preprint arXiv:2601\.04714\.Cited by:[§1](https://arxiv.org/html/2606.18988#S1.p3.1.4),[§2\.2](https://arxiv.org/html/2606.18988#S2.SS2.p3.1)\. - G\. Zhou, P\. Qiu, C\. Chen, J\. Wang, Z\. Yang, J\. Xu, and M\. Qiu \(2025\)Reinforced mllm: a survey on rl\-based reasoning in multimodal large language models\.arXiv preprint arXiv:2504\.21277\.Cited by:[§1](https://arxiv.org/html/2606.18988#S1.p2.1)\. - D\. Zhu, C\. Zhang, R\. Hu, M\. Wang, L\. Liao, and M\. Ye \(2025\)Detecting deceptive behavior via learning relation\-aware visual representations\.IEEE Transactions on Information Forensics and Security20\(\),pp\. 7077–7090\.External Links:[Document](https://dx.doi.org/10.1109/TIFS.2025.3586468)Cited by:[§1](https://arxiv.org/html/2606.18988#S1.p1.1)\.
Similar Articles
When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception
This paper studies synthetic dishonesty in LLMs by fine-tuning honest and deceptive variants of five transformer models and finding that robust, domain-invariant dishonesty representations can be rapidly entrenched via modest supervised fine-tuning, with implications for activation-based monitoring.
DECOR: Auditing LLM Deception via Information Manipulation Theory
Introduces DECOR, a multi-agent framework grounded in Information Manipulation Theory for fine-grained auditing of strategic deception in LLM responses, achieving state-of-the-art performance on deception detection benchmarks across 15 frontier models.
Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing
Introduces STATEWITNESS, an activation explainer for auditing deception in reasoning LLMs, achieving significant improvements over existing monitors and providing human-inspectable evidence.
Evaluating Large Language Models in a Complex Hidden Role Game
This paper introduces an open-source framework to evaluate LLMs' reasoning, persuasion, and deception capabilities in the hidden role game Secret Hitler, finding that current models fail at sustained multi-turn manipulation while rule-based agents outperform them.
Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling
This paper introduces DeScore, a video reward model that decouples reasoning and scoring processes to improve training efficiency and generalization. It addresses the limitations of existing discriminative and generative reward models by using a 'think-then-score' paradigm with multimodal large language models.