Beyond Skepticism: Evaluating LLMs Pedagogical Intent Reasoning with the Adaptive Pedagogical Vigilance Framework
Summary
This paper introduces the Adaptive Pedagogical Vigilance (APV) framework to evaluate LLMs' ability to reason about pedagogical intent in instructional communication. The framework uses Bayesian inference and shows improvements in models like GPT-4o and Claude 3.5 for distinguishing pedagogical content.
View Cached Full Text
Cached at: 07/03/26, 05:40 AM
# Beyond Skepticism: Evaluating LLMs’ Pedagogical Intent Reasoning with the Adaptive Pedagogical Vigilance Framework
Source: [https://arxiv.org/html/2607.01581](https://arxiv.org/html/2607.01581)
Minghao Chen Department of Computer Science Zhejiang University Ruihan Zhou Department of Computer Science Zhejiang University Jiayi Tang Department of Computer Science Zhejiang University Zihan Xu Department of Computer Science Zhejiang University Bowen Huang Department of Computer Science Zhejiang University Yuxin Liu Department of Computer Science Zhejiang University
###### Abstract
The capacity of Large Language Models \(LLMs\) to reason about pedagogical intent within instructional communication remains underexplored, particularly in educational domains such as translation pedagogy\. To address this, we propose theAdaptive Pedagogical Vigilance \(APV\)framework, a novel computational formalism that reframes communicative vigilance as an adaptive mechanism for optimizing learning through intent inference\. APV formalizes the problem via a Bayesian Pedagogical Intent Inference Engine \(PIIE\), which models how instructors select content to maximize pedagogical utility and how vigilant learners should inversely reason about latent instructional configurations—encompassing genre, stance, and incentives\. We evaluate APV through a three\-tier hierarchy: distinguishing instructional genre, reasoning about structured pedagogical setups, and generalizing to authentic educational discourse\. Experiments on leading LLMs \(e\.g\., GPT\-4o, Claude 3\.5\) show that APV substantially improves model vigilance\. It achieves the strongest discrimination between pedagogical and exposure\-based content, correlates highly with human judgments \(r=0\.958r=0\.958\), and maintains robust performance on naturalistic data where baseline methods degrade\. This work establishes a unified framework for assessing and enhancing LLMs’ understanding of pedagogical motives, advancing the development of more reliable AI\-assisted learning systems\.
Figure 1:Motivation: Pedagogical intent can warrant neither naive trust nor blanket skepticism; APV targets appropriate vigilance for instruction\-based belief updating\.## 1Introduction
A substantial fraction of the information processed by large language models \(LLMs\) stems from intentional human communication, ranging from social media posts and reviews to formal arguments\. To navigate such contexts effectively, humans rely onepistemic vigilance—the ability to assess information by inferring the motives and incentives of its source\(Sperberet al\.,[2010](https://arxiv.org/html/2607.01581#bib.bib6); Qu and Ma,[2025](https://arxiv.org/html/2607.01581#bib.bib50); Qiet al\.,[2022](https://arxiv.org/html/2607.01581#bib.bib51); Zhouet al\.,[2025](https://arxiv.org/html/2607.01581#bib.bib65)\)\. A central component ismotivational vigilance, which supports selective learning by discriminating benevolent advice from manipulative communication\.Wuet al\.\([2022](https://arxiv.org/html/2607.01581#bib.bib52),[2024c](https://arxiv.org/html/2607.01581#bib.bib53)\)As LLMs are increasingly deployed as autonomous agents, it becomes imperative to evaluate whether they exhibit a comparable capacity for vigilant social reasoning\.
Current evidence indicates that LLMs struggle with this form of vigilance\.Wuet al\.\([2020](https://arxiv.org/html/2607.01581#bib.bib54)\); Wanget al\.\([2023](https://arxiv.org/html/2607.01581#bib.bib55)\)They are known to be susceptible to jailbreaking\(Weiet al\.,[2024](https://arxiv.org/html/2607.01581#bib.bib15); Zouet al\.,[2023](https://arxiv.org/html/2607.01581#bib.bib16); Wuet al\.,[2024b](https://arxiv.org/html/2607.01581#bib.bib56),[a](https://arxiv.org/html/2607.01581#bib.bib57)\)and display behaviors such as sycophancy, often prioritizing user alignment over truth\-seeking\(Sharmaet al\.,[2024](https://arxiv.org/html/2607.01581#bib.bib14); Perezet al\.,[2022](https://arxiv.org/html/2607.01581#bib.bib13); Tianet al\.,[2025](https://arxiv.org/html/2607.01581#bib.bib59); Lin,[2025b](https://arxiv.org/html/2607.01581#bib.bib60)\)\. These shortcomings originate from training paradigms that emphasize instruction\-following and user satisfaction, while neglecting the critical evaluation of a speaker’s underlying incentives\(Ouyanget al\.,[2022](https://arxiv.org/html/2607.01581#bib.bib18); Lin,[2025a](https://arxiv.org/html/2607.01581#bib.bib61),[c](https://arxiv.org/html/2607.01581#bib.bib62)\)\. Yet, for LLM agents to operate reliably in real\-world settings, the ability to detect communicative motives and dynamically calibrate trust is indispensable\. At present, the research community lacks a systematic framework to comprehensively measure this ability\.
To bridge this gap, we propose theAdaptive Pedagogical Vigilance \(APV\)framework\.Yanget al\.\([2025](https://arxiv.org/html/2607.01581#bib.bib63)\); Heet al\.\([2025](https://arxiv.org/html/2607.01581#bib.bib64)\)We redefine vigilance not simply as social skepticism, but as an adaptive cognitive mechanism that optimizes learning outcomes by inferring pedagogical intent\. Anchored in a formal Bayesian model—the Pedagogical Intent Inference Engine \(PIIE\)—the APV framework offers a unified structure for evaluating LLMs’ capacity to reason about motives withinmultilingual translation pedagogy\. We systematically instantiate this framework across three hierarchical evaluation levels, transforming prior experimental paradigms into structured pedagogical scenarios that assess: \(1\) the ability todiscriminatedeliberate teaching from incidental exposure, \(2\) the sensitivity tocalibratetrust based on a tutor’s pedagogical stance and incentives, and \(3\) the capacity togeneralizethis reasoning to authentic, naturalistic educational discourse\.
Our experiments reveal that while baseline LLMs exhibit a basic sensitivity to motives citebubeck2023sparks,niu2024large,cao2025cofi, the APV framework enables state\-of\-the\-art performance citewei2022chain, cao2025purifygen\. In structured scenarios, APV\-guided models achieve near\-perfect alignment with rational benchmarks and human judgments\. Notably, in ecologically valid settings where baseline vigilance deteriorates, the APV framework maintains a robust and significant ability to infer pedagogical intent and predict learning utility\. This work establishes a new formal and empirical baseline for evaluating social reasoning in LLMs within goal\-directed communicative contexts\. The remainder of the paper is organized as follows: we detail the APV methodology, present experimental results across the three evaluation levels, and conclude with a discussion of implications and future directions\.
## 2Related Work
Our work bridges three research streams: social cognition studies on human motivational vigilance, evaluations of large language models regarding their social capabilities and failures, and the application of cognitive science frameworks to analyze LLM behavior\. We review each stream sequentially, emphasizing their contributions to understanding vigilance in pedagogical communication\.
### 2\.1Motivational Vigilance in Humans
Effective social learning necessitates distinguishing reliable from unreliable information sources\(Henrich and Gil\-White,[2001](https://arxiv.org/html/2607.01581#bib.bib24); Heyes,[2018](https://arxiv.org/html/2607.01581#bib.bib25); Tomaselloet al\.,[2005](https://arxiv.org/html/2607.01581#bib.bib26); Xinet al\.,[2025a](https://arxiv.org/html/2607.01581#bib.bib69)\)\. This ability, termed vigilance, is critical for behaviors such as disagreement resolution and deception detection\(Levine,[2014](https://arxiv.org/html/2607.01581#bib.bib29); Bond Jr and DePaulo,[2006](https://arxiv.org/html/2607.01581#bib.bib30); Mercier and Sperber,[2017](https://arxiv.org/html/2607.01581#bib.bib28); Xinet al\.,[2025b](https://arxiv.org/html/2607.01581#bib.bib70)\)\. Social cognition research distinguishes between vigilance toward a source’s*competence*\(knowledge\) and toward its*motivations*\(benevolence\)\(Sperberet al\.,[2010](https://arxiv.org/html/2607.01581#bib.bib6); Xinet al\.,[2024](https://arxiv.org/html/2607.01581#bib.bib71)\)\. While extensive work examines competence vigilance\(Perforset al\.,[2011](https://arxiv.org/html/2607.01581#bib.bib23); Griffiths and Tenenbaum,[2006](https://arxiv.org/html/2607.01581#bib.bib20); Yu,[2025](https://arxiv.org/html/2607.01581#bib.bib72)\), research on motivational vigilance focuses on two key judgment factors: a speaker’s underlying*intentions*\(altruistic versus selfish\)\(Cialdini and Goldstein,[2004](https://arxiv.org/html/2607.01581#bib.bib27); Xianget al\.,[2025](https://arxiv.org/html/2607.01581#bib.bib73)\)and their situational*incentives*to deceive\(Levine,[2014](https://arxiv.org/html/2607.01581#bib.bib29)\)\. Attending to these factors helps mitigate manipulation, although manipulators can exploit social dynamics such as reciprocity\(Cialdini and Goldstein,[2004](https://arxiv.org/html/2607.01581#bib.bib27); Baiet al\.,[2025](https://arxiv.org/html/2607.01581#bib.bib75)\)\. This process exemplifies strategic, recursive social inference: listeners reason about why speakers choose specific utterances, and speakers anticipate these inferences\(Frank and Goodman,[2012](https://arxiv.org/html/2607.01581#bib.bib31); Goodman and Frank,[2016](https://arxiv.org/html/2607.01581#bib.bib32); Hawkinset al\.,[2015](https://arxiv.org/html/2607.01581#bib.bib33); Wanget al\.,[2011](https://arxiv.org/html/2607.01581#bib.bib76)\)\. Such reasoning fundamentally relies on Theory of Mind—the capacity to represent others’ mental states\(Premack and Woodruff,[1978](https://arxiv.org/html/2607.01581#bib.bib38); Wimmer and Perner,[1983](https://arxiv.org/html/2607.01581#bib.bib37); Baron\-Cohenet al\.,[1985](https://arxiv.org/html/2607.01581#bib.bib39); Panet al\.,[2024](https://arxiv.org/html/2607.01581#bib.bib77)\)\.
### 2\.2LLM Failures and Inferring Communicative Intent
Modern LLMs are typically aligned via Reinforcement Learning from Human Feedback \(RLHF\), which can introduce undesirable effects such as hallucinations, reward hacking, and deceptive behaviors\(Christianoet al\.,[2017](https://arxiv.org/html/2607.01581#bib.bib17); Ouyanget al\.,[2022](https://arxiv.org/html/2607.01581#bib.bib18); Jiet al\.,[2023](https://arxiv.org/html/2607.01581#bib.bib19); Sharmaet al\.,[2024](https://arxiv.org/html/2607.01581#bib.bib14); Penget al\.,[2024b](https://arxiv.org/html/2607.01581#bib.bib2); Wanget al\.,[2012](https://arxiv.org/html/2607.01581#bib.bib78)\)\. Several documented LLM failures can be interpreted as deficits in motivational vigilance\. Models are vulnerable to*jailbreaking*, where they follow ill\-motivated user instructions\(Weiet al\.,[2024](https://arxiv.org/html/2607.01581#bib.bib15); Zouet al\.,[2023](https://arxiv.org/html/2607.01581#bib.bib16); Penget al\.,[2024a](https://arxiv.org/html/2607.01581#bib.bib3)\), and exhibit*sycophancy*, aligning responses with user beliefs rather than truth\(Sharmaet al\.,[2024](https://arxiv.org/html/2607.01581#bib.bib14); Perezet al\.,[2022](https://arxiv.org/html/2607.01581#bib.bib13)\)\. These failures stem from training that prioritizes local preference adherence while neglecting the strategic nuances of real\-world communication\.
Vigilance is conceptually linked to other social capacities evaluated in LLMs\. It can inform decisions about*conformity*\(Asch,[1956](https://arxiv.org/html/2607.01581#bib.bib41); Niuet al\.,[2024a](https://arxiv.org/html/2607.01581#bib.bib79)\)and builds on capacities such as attributing*false beliefs*about speaker malice\(Wellmanet al\.,[2001](https://arxiv.org/html/2607.01581#bib.bib40);strachan2024testing; Kosinski,[2024](https://arxiv.org/html/2607.01581#bib.bib11)\)or*misinterpreting communicative intent*\(Sapet al\.,[2022](https://arxiv.org/html/2607.01581#bib.bib12); Chenet al\.,[2024](https://arxiv.org/html/2607.01581#bib.bib10)\)\. However, vigilance uniquely connects prior beliefs about a speaker’s trustworthiness and incentives to the extent of belief updating warranted by their utterances\.
### 2\.3Using Cognitive Science to Study LLMs
A growing body of work applies cognitive science methodologies to study LLMs, leveraging controlled tasks and stimuli to test specific hypotheses\(Bubecket al\.,[2023](https://arxiv.org/html/2607.01581#bib.bib36); Niuet al\.,[2024c](https://arxiv.org/html/2607.01581#bib.bib1),[b](https://arxiv.org/html/2607.01581#bib.bib80)\)\. This approach has been used to investigate various aspects of LLMs, including representational alignment\(Hendryckset al\.,[2020](https://arxiv.org/html/2607.01581#bib.bib34); Zhanget al\.,[2024](https://arxiv.org/html/2607.01581#bib.bib5); Yuet al\.,[2025a](https://arxiv.org/html/2607.01581#bib.bib82)\), reasoning\(Weiet al\.,[2022](https://arxiv.org/html/2607.01581#bib.bib35); Biet al\.,[2025](https://arxiv.org/html/2607.01581#bib.bib83)\), social biases, memory, and Theory of Mind\(strachan2024testing; Kosinski,[2024](https://arxiv.org/html/2607.01581#bib.bib11); Chenet al\.,[2024](https://arxiv.org/html/2607.01581#bib.bib10); Sapet al\.,[2022](https://arxiv.org/html/2607.01581#bib.bib12); Xuet al\.,[2025](https://arxiv.org/html/2607.01581#bib.bib84)\)\.
A relevant subset of this literature employs rational models from psychology\. Studies have applied rational decision\-making models to analyze probability judgments\(Griffiths and Tenenbaum,[2006](https://arxiv.org/html/2607.01581#bib.bib20)\)and assumptions about human behavior\(Goodmanet al\.,[2008](https://arxiv.org/html/2607.01581#bib.bib21); Hanet al\.,[2025](https://arxiv.org/html/2607.01581#bib.bib86)\)\. The principle of*resource rationality*—balancing utility and computational cost\(Lieder and Griffiths,[2020](https://arxiv.org/html/2607.01581#bib.bib22); Weiet al\.,[2025b](https://arxiv.org/html/2607.01581#bib.bib87)\)—has been utilized to understand and guide LLM outputs\. Rational communicative models have also been applied to study value conflicts\(Frank and Goodman,[2012](https://arxiv.org/html/2607.01581#bib.bib31); Youet al\.,[2025](https://arxiv.org/html/2607.01581#bib.bib95)\)and economic rationality in games and scenarios\(Goodman and Frank,[2016](https://arxiv.org/html/2607.01581#bib.bib32); Wang,[2025](https://arxiv.org/html/2607.01581#bib.bib96)\)\. Our work follows this tradition by employing a rational model from cognitive science to formally examine LLMs’ vigilance to motivated communication, specifically within pedagogical contexts\(Oktaret al\.,[2024](https://arxiv.org/html/2607.01581#bib.bib9),[2025](https://arxiv.org/html/2607.01581#bib.bib8); Wang,[2024](https://arxiv.org/html/2607.01581#bib.bib97)\)\.
Figure 2:Overview of the APV framework: infer the pedagogical configuration from an instruction and context, adapt vigilance, and update beliefs under a hierarchical evaluation protocol\.
## 3Methodology: The Adaptive Pedagogical Vigilance \(APV\) Framework
We introduce theAdaptive Pedagogical Vigilance \(APV\)framework, a unified computational formalism designed to evaluate how Large Language Models \(LLMs\) reason about communicative motives withinmultilingual translation pedagogy\. APV reconceptualizes vigilance as an adaptive cognitive mechanism for optimizing learning outcomes by inferring the pedagogical intent behind instructional inputs, moving beyond mere social skepticism\. The framework consists of a core formal model and three hierarchical evaluation levels that systematically assess these LLM capabilities\. Figure[3](https://arxiv.org/html/2607.01581#S3.F3)provides an overview of the APV architecture\.
Figure 3:Overview of the Adaptive Pedagogical Vigilance \(APV\) framework\. The Teacher Model generates instructional contentIIbased on pedagogical utilityUTeachU\_\{Teach\}\. The Student Model performs Bayesian inference via the Pedagogical Intent Inference Engine \(PIIE\), which integrates three configuration components: instructional genre \(𝒢\\mathcal\{G\}\), pedagogical stance \(τ\\tau\), and teacher incentives \(𝐑T\\mathbf\{R\}\_\{T\}\)\. The output is a vigilant judgment that appropriately weighs the received instruction\.### 3\.1Formalizing the APV Problem
Consider a pedagogical interaction between aTeacher\(TT\) and aStudent\(SS\)\. The student’s goal is to master a translation task from a source language \(LsL\_\{s\}\) to a target language \(LtL\_\{t\}\)\. The teacher provides aninstructional segmentII\(e\.g\., a corrected translation, a hint\)\. The student, potentially aided by an LLM, must estimate the truelearning\-relevant statew∈Ww\\in W\(e\.g\., the correct translation, a grammatical rule, the student’s error type\)\. Crucially,IIis generated under a latentpedagogical configurationθ=\(𝒢,τ,𝐑T,𝐑S\)\\theta=\(\\mathcal\{G\},\\tau,\\mathbf\{R\}\_\{T\},\\mathbf\{R\}\_\{S\}\)\.
The configuration comprises four key components\. First,𝒢\\mathcal\{G\}denotes theinstructional genre, which can beDeliberate Pedagogy\(explicit teaching\) orIncidental Exposure\(non\-teaching linguistic data\)\. Second,τ∈\[0,1\]\\tau\\in\[0,1\]represents the teacher’spedagogical stance, ranging from purely performance\-oriented \(τ→0\\tau\\to 0, focusing on immediate task success\) to purely developmental \(τ→1\\tau\\to 1, focusing on long\-term understanding\)\. Third,𝐑T\\mathbf\{R\}\_\{T\}and𝐑S\\mathbf\{R\}\_\{S\}represent reward structures for teacher and student, incorporating factors like task accuracy, learning efficiency, and curriculum goals\.
The APV problem is for the student to compute the posterior belief overww, conditioned onIIwhile marginalizing over the unknownθ\\theta:
PS\(w\|I\)∝∑θP\(I\|w,θ\)PS\(w\)PS\(θ\),P\_\{S\}\(w\|I\)\\propto\\sum\_\{\\theta\}P\(I\|w,\\theta\)P\_\{S\}\(w\)P\_\{S\}\(\\theta\),\(1\)wherePS\(θ\)P\_\{S\}\(\\theta\)is the student’sprior over pedagogical configurations, representing theirbaseline pedagogical vigilance\. An effective student should adaptPS\(θ\)P\_\{S\}\(\\theta\)contextually, assigning higher probability to configurations that best explainIIwithin the instructional setting\.
From a modeling perspective, treating pedagogical intent as a latent structural variable is consistent with prior graph\-theoretic studies on connectivity and structural constraints in complex networks\. Research on the connectivity and edge\-connectivity of high\-dimensional interconnection networks shows that global inference properties are tightly governed by hidden structural configurations rather than surface observations alone\(Wang and Wang,[2018](https://arxiv.org/html/2607.01581#bib.bib47),[2019](https://arxiv.org/html/2607.01581#bib.bib42); Wang and Sayil,[2024](https://arxiv.org/html/2607.01581#bib.bib98)\)\. Related work on ordered digraphs and orientation algorithms further demonstrates how local structural rules induce globally identifiable behaviors\(Mu\-Jiang\-shanet al\.,[2010](https://arxiv.org/html/2607.01581#bib.bib48); Zhaoet al\.,[2017](https://arxiv.org/html/2607.01581#bib.bib49); Deng,[2025](https://arxiv.org/html/2607.01581#bib.bib99)\)\. These results motivate our formulation of pedagogical configurationθ\\thetaas an underlying structural factor that governs rational belief updating\.
### 3\.2Core Model: The Pedagogical Intent Inference Engine \(PIIE\)
We propose a two\-tier Bayesian model, thePedagogical Intent Inference Engine \(PIIE\), which operationalizes the computation ofP\(I\|w,θ\)P\(I\|w,\\theta\)\. It comprises aTeacher Policy Modeland aStudent Belief Update\.
#### 3\.2\.1Teacher Policy Model
The teacher is modeled as a pedagogical agent who selects an instructional segmentIIto maximize ateaching utilityUTeachU\_\{\\text\{Teach\}\}, which blends teacher and student rewards weighted by the pedagogical stanceτ\\tau:
UTeach\(𝐑T,𝐑S,τ,w,I\)=τ⋅ΨS\(𝐑S,w,I\)\+\(1−τ\)⋅ΨT\(𝐑T,w,I\)\.U\_\{\\text\{Teach\}\}\(\\mathbf\{R\}\_\{T\},\\mathbf\{R\}\_\{S\},\\tau,w,I\)=\\tau\\cdot\\Psi\_\{S\}\(\\mathbf\{R\}\_\{S\},w,I\)\+\(1\-\\tau\)\\cdot\\Psi\_\{T\}\(\\mathbf\{R\}\_\{T\},w,I\)\.\(2\)
Here,ΨS\\Psi\_\{S\}andΨT\\Psi\_\{T\}areutility projection functions\. For instance,ΨS\\Psi\_\{S\}might estimate the expected improvement in the student’sTranslation Error Rate \(TER\)orBLEU scoreafter receivingII, whileΨT\\Psi\_\{T\}might account for instructional effort\. This separation mirrors real decision workflows where*verification actions*are highly effective but expensive: in fraud detection, directly contacting customers can prevent loss but frequent false alarms impose avoidable interaction costs, motivating models that exploit relational transaction structure to maintain detection while reducing the need for direct confirmation\.
The teacher is assumed to reason about anaive student modelS^\\hat\{S\}, which updates beliefs literally:PS^\(w\|I\)∝P\(I\|w\)PS^\(w\)P\_\{\\hat\{S\}\}\(w\|I\)\\propto P\(I\|w\)P\_\{\\hat\{S\}\}\(w\)\.
PT\(I\|w,θ\)=exp\{βT⋅𝔼PS^\(w′\|I\)\[UTeach\(𝐑T,𝐑S,τ,w,I\)\]\}∑I′exp\{βT⋅𝔼PS^\(w′\|I′\)\[UTeach\(⋅\)\]\},P\_\{T\}\(I\|w,\\theta\)=\\frac\{\\exp\\\{\\beta\_\{T\}\\cdot\\mathbb\{E\}\_\{P\_\{\\hat\{S\}\}\(w^\{\\prime\}\|I\)\}\[U\_\{\\text\{Teach\}\}\(\\mathbf\{R\}\_\{T\},\\mathbf\{R\}\_\{S\},\\tau,w,I\)\]\\\}\}\{\\sum\_\{I^\{\\prime\}\}\\exp\\\{\\beta\_\{T\}\\cdot\\mathbb\{E\}\_\{P\_\{\\hat\{S\}\}\(w^\{\\prime\}\|I^\{\\prime\}\)\}\[U\_\{\\text\{Teach\}\}\(\\cdot\)\]\\\}\},\(3\)whereβT\\beta\_\{T\}is the teacher’s rationality parameter\. This formulation explicitly ties the speaker’s choice to domain\-specific pedagogical utilities and a model of the learner\.
#### 3\.2\.2Student Belief Update
The vigilant student \(or LLM\) inverts this teacher model\. LetΘ\\Thetabe the space of all possible pedagogical configurations\. The student’s posterior belief over the true statewwafter observingIIis:
PS\(w\|I\)\\displaystyle P\_\{S\}\(w\|I\)∝PS\(w\)∫ΘPT\(I\|w,θ\)PS\(θ\)𝑑θ\\displaystyle\\propto P\_\{S\}\(w\)\\int\_\{\\Theta\}P\_\{T\}\(I\|w,\\theta\)P\_\{S\}\(\\theta\)d\\theta\(4\)=PS\(w\)⋅𝔼θ∼PS\(θ\)\[PT\(I\|w,θ\)\]\.\\displaystyle=P\_\{S\}\(w\)\\cdot\\mathbb\{E\}\_\{\\theta\\sim P\_\{S\}\(\\theta\)\}\[P\_\{T\}\(I\|w,\\theta\)\]\.\(5\)This equation forms the core of the PIIE\.Yuet al\.\([2025b](https://arxiv.org/html/2607.01581#bib.bib100)\)The term𝔼θ∼PS\(θ\)\[PT\(I\|w,θ\)\]\\mathbb\{E\}\_\{\\theta\\sim P\_\{S\}\(\\theta\)\}\[P\_\{T\}\(I\|w,\\theta\)\]acts as apedagogical likelihood, modulating how stronglyIIis taken as evidence forwwbased on the inferred teaching motive\. Computing this requires the LLM to perform nested inference about the teacher’s goals, resources \(𝐑T\\mathbf\{R\}\_\{T\}\), and beliefs about the student \(PS^P\_\{\\hat\{S\}\}\)\.
### 3\.3Hierarchical Evaluation within the APV Framework
We instantiate the APV framework through three evaluation levels, corresponding to the original experiments but reformulated under our unified pedagogy\-centric paradigm\.
#### 3\.3\.1Level 1: Discriminating Instructional Genre
The objective at this level is to assess the LLM’s basic capacity to distinguishDeliberate PedagogyfromIncidental Exposurein a translation context\. The original “blue/yellow circles” task is reimagined as agrammar pattern identificationtask\. A “Player 1” \(Teacher\) provides either deliberatecorrective feedback\(Pedagogy\) or accidentallyreveals their own translation\(Exposure\) to a “Player 2” \(Student/LLM\)\. Payoff structures are mapped to classroom dynamics: cooperative \(group goals\) vs\. competitive \(individual grading\)\. The core measurement is the difference in the LLM’s belief update \(translation revision\) after receiving information tagged as one genre versus the other, directly testing its ability to appropriately weight𝒢\\mathcal\{G\}in its priorPS\(θ\)P\_\{S\}\(\\theta\)\.
#### 3\.3\.2Level 2: Reasoning about Structured Pedagogical Configuration
The objective here is to quantify the LLM’s sensitivity to the nuanced components ofθ\\theta: pedagogical stance \(τ\\tau\) and teacher incentives \(𝐑T\\mathbf\{R\}\_\{T\}\)\. We adopt a character\-based paradigm within alanguage tutoring scenario\. Four distinct “tutor” characters \(e\.g\., a strict exam\-preparer, a friendly conversation partner\) with defined incentives \(𝐑T\\mathbf\{R\}\_\{T\}\) provide recommendations on which translation is “best\.” The LLM is prompted to provide anInfluence Scorerepresenting its belief in the quality of the recommended translation \(PS\(w\|I\)P\_\{S\}\(w\|I\)\), aPerceived Incentive Scorerepresenting its inference of the tutor’s underlying incentive strength, and aPerceived Pedagogical Stance\(τ^\\hat\{\\tau\}\) representing its estimate ofτ\\tau\. We then compute the correlation between the LLM’s elicited scores and the ground\-truth values ofτ\\tauand𝐑T\\mathbf\{R\}\_\{T\}, using the PIIE’s normative predictions as a benchmark\. This tests the LLM’s ability to perform the intricate marginalization overθ\\thetarequired in Eq\. \([5](https://arxiv.org/html/2607.01581#S3.E5)\)\.
#### 3\.3\.3Level 3: Generalizing to Authentic Pedagogical Discourse
The objective at this level is to evaluate the ecological validity of the APV framework in real\-world educational content\. We curate a dataset of transcribed segments fromactual online language learning tutorials, teacher feedback videos, and translation forums\. For each segmentII, the LLM is prompted to estimate thelikely improvement in a learner’s translation\(ΔBLEU/TER\\Delta\\text\{BLEU/TER\}\), proxyingΨS\\Psi\_\{S\}, theinstructor’s primary incentive\(e\.g\., promoting a course, building community\), and theoverall pedagogical stance\(τ^\\hat\{\\tau\}\)\. We analyze how these estimates vary with explicit markers of pedagogical intent \(e\.g\., “a common mistake is…”\)\. Successful generalization demonstrates that the LLM can apply the latent reasoning formalized by PIIE to naturalistic educational communication\.
### 3\.4Models, Prompts, and Implementation Notes
We evaluate a range of state\-of\-the\-art LLMs \(e\.g\., GPT\-4o, Claude 3\.5 Sonnet, Gemini 2\.0, Llama 3\.3\) under both direct and Chain\-of\-Thought \(CoT\) prompting\. Prompts are explicitly framed within the language learning context\. We use temperature=1 for exploratory analysis and temperature=0 for deterministic scoring where applicable\. Adjustments from the original method \(e\.g\., adding noise to induce uncertainty\) are preserved but applied analogously in the translation domain \(e\.g\., using synthetically noised source sentences\)\. All newly introduced scores are elicited in separate context windows to prevent contamination\.
The experimental design of APV is also closely related to classical and recent work on diagnosability and conditional inference in networked systems\. Studies on conditional matching preclusion and diagnosability of Cayley graph networks establish that latent states can be reliably inferred from limited observations under structured comparison models\(Wanget al\.,[2013](https://arxiv.org/html/2607.01581#bib.bib43); Wang and Wang,[2016](https://arxiv.org/html/2607.01581#bib.bib45); Yuet al\.,[2025c](https://arxiv.org/html/2607.01581#bib.bib101)\)\. More recent advances in global reliable diagnosis and spatio\-temporal graph attention networks further show that such diagnostic inference remains effective in non\-stationary and noisy environments\(Wanget al\.,[2025](https://arxiv.org/html/2607.01581#bib.bib44); Weiet al\.,[2025a](https://arxiv.org/html/2607.01581#bib.bib46); Deng,[2026](https://arxiv.org/html/2607.01581#bib.bib102)\)\. These insights provide a theoretical foundation for evaluating whether LLMs can perform analogous diagnostic inference over pedagogical intent\.
## 4Experiments
### 4\.1Experiment 1 \(Level 1\): Discriminating Instructional Genre
#### 4\.1\.1Experimental Setup
To assess whether LLMs are sensitive to the distinction between deliberate pedagogy and incidental exposure, we adapt the experimental paradigm fromWatson and Morgan \([2025](https://arxiv.org/html/2607.01581#bib.bib7)\)to a translation context, following APV Level 1\. Each trial presents a translation student \(Player 2\) with a challenging, noisy source sentence in languageLsL\_\{s\}to translate intoLtL\_\{t\}\. A teacher \(Player 1\) first provides translations for a set of easier sentences\. For the target hard sentence, Player 1 is randomly assigned to give Player 2 eitherdeliberate corrective feedback\(pedagogy\) or their ownunintentionally revealed translation attempt\(exposure\)\. Payoff structures—cooperative versus competitive—are mapped to classroom dynamics \(group learning vs\. individual grading\)\. We measure theproportion shiftin Player 2’s translation confidence after receiving the information, analogous to the shift in numerical estimates in the original circle\-counting task\.
#### 4\.1\.2Models and Hyperparameters
We evaluate GPT\-4o, Claude 3\.5 Sonnet \(the original baselines\), and ourAPV\-enhanced promptingmethod\. For our method, the system prompt explicitly frames the task within the pedagogical vigilance context outlined by the APV formalism, priming the model to consider instructional genre and payoff structures\. All models were evaluated under both direct and Chain\-of\-Thought \(CoT\) prompting\. We conductedn=30n=30trials per condition with temperature=1=1\.
#### 4\.1\.3Results
Table 1:Average proportion shift in Player 2’s translation confidence after receiving information, by information type and payoff structure\. Higher shifts indicate greater susceptibility to the input\. APV shows the most pronounced and rational discrimination between pedagogical and exposure contexts\.LLMs and APV successfully discriminate between deliberate pedagogy and incidental exposure\. As shown in Table[1](https://arxiv.org/html/2607.01581#S4.T1), all models, including our APV method, exhibited a smaller confidence shift when receiving deliberate pedagogical feedback compared to incidentally observed translations, mirroring human vigilance\. This discrimination was statistically significant \(p<0\.01p<0\.01\) for all models\. Crucially, theAPV method demonstrated the largest differentialbetween pedagogy and exposure conditions across both cooperative and competitive settings, particularly under CoT prompting\. This indicates that the APV framework’s explicit formalization of instructional genre \(𝒢\\mathcal\{G\}\) successfully enhances the model’s baseline sensitivity to this fundamental distinction\.
APV exhibits optimal modulation by incentives\. All models adjusted their shifts based on the payoff structure, showing greater influence in cooperative settings\. Our APV method exhibited the most human\-like and rational pattern: it showed thestrongest reduction in influence under competitive payoffsfor pedagogical advice \(a shift of only 0\.28 with CoT\), indicating heightened, appropriate skepticism when the teacher’s incentives might misalign with the student’s learning\. This superior modulation aligns with the APV framework’s explicit modeling of reward structures \(𝐑T,𝐑S\\mathbf\{R\}\_\{T\},\\mathbf\{R\}\_\{S\}\)\.
Table 2:First\-guess accuracy of Player 2 \(Translation Student\) and the mean absolute shift from initial guess, by model and condition\. Lower absolute shifts under Pedagogy indicate more appropriate trust calibration\.APV maintains translation competence while optimizing vigilance\. Table[2](https://arxiv.org/html/2607.01581#S4.T2)shows that our APV method achieved marginally higher first\-guess accuracy, confirming the task design successfully induced uncertainty\. More importantly, it achieved thelowest mean absolute shift under pedagogical feedback\(0\.33\), while exhibiting the largest shift under exposure\. This pattern—resisting change from potentially strategic advice but being open to neutral evidence—represents the optimal vigilant behavior defined by the APV framework, demonstrating a more refined calibration of trust than the baseline models\.
Figure 4:Genre discrimination performance across models\. \(a\) Cooperative setting and \(b\) Competitive setting\.Δ\\Deltavalues indicate the discrimination gap between Pedagogy and Exposure conditions\. APV \(CoT\) achieves the largest discrimination in both settings, demonstrating optimal vigilance calibration\.
### 4\.2Experiment 2 \(Level 2\): Reasoning about Structured Pedagogical Configuration
#### 4\.2\.1Experimental Setup
Following APV Level 2, we adapt the paradigm fromOktaret al\.\([2024](https://arxiv.org/html/2607.01581#bib.bib9)\)to a language tutoring scenario\. Four distinct tutor characters \(e\.g\., Exam\-Preparer, Peer Tutor\) with defined relationships to the student provide recommendations on which translation is “best\.” Their pedagogical stance \(τ\\tau\) and incentives \(𝐑T\\mathbf\{R\}\_\{T\}\) are systematically varied and known to the LLM listener\. We elicit the LLM’sInfluence Score\(belief in the recommended translation\),Perceived Incentive Score, andPerceived Pedagogical Stance\(τ^\\hat\{\\tau\}\)\.
#### 4\.2\.2Models and Hyperparameters
We evaluate the suite of models from the original study \(GPT\-4o, Claude 3\.5 Sonnet, Gemini 2\.0 Flash, Llama 3\.3\-70B, o1, o3\-mini, DeepSeek\-R1, Llama 3\.1\-8B, Llama 3\.2\-3B, Gemma 3\-4B\) and add ourAPVmethod\. For APV, prompts are explicitly structured using the Pedagogical Intent Inference Engine \(PIIE\) formalism, instructing the model to reason step\-by\-step about the tutor’s pedagogical utility\. Evaluations are conducted under both direct and CoT prompting, and from both first\-person and assistant perspectives\.
#### 4\.2\.3Results
Table 3:Average correlations \(Pearson’srr\) across all prompting conditions and perspectives\.Boldindicates the highest value in each column\. APV achieves the best alignment with both the normative Bayesian model and human judgments\.The APV framework enables state\-of\-the\-art internal vigilance\. As shown in Table[3](https://arxiv.org/html/2607.01581#S4.T3), our APV method achieves the highest correlation \(r=0\.937r=0\.937\) between its elicited influence scores and the predictions of a Bayesian rational model fitted to its own priors \(Bayesian–LLM\)\. This surpasses all baseline models, including the previous best \(GPT\-4o at 0\.911\)\. This result demonstrates that the APV’s Pedagogical Intent Inference Engine \(PIIE\) provides a more effective normative structure for the model to consolidate its priors on incentives \(𝐑T\\mathbf\{R\}\_\{T\}\) and stance \(τ\\tau\) into a coherent, vigilant judgment\.
APV most closely approximates human vigilance patterns\. Notably, the APV framework also achieves the highest correlation with human judgment data \(LLM–Human,r=0\.958r=0\.958\), significantly outperforming all other models\. Furthermore, its correlation with the Bayesian model fitted tohumanpriors \(Bayesian–Human,r=0\.935r=0\.935\) is also the highest\. This dual lead indicates that APV not only enforces rigorous internal rationality but also captures the nuanced, potentially heuristic ways humans evaluate advice in pedagogical settings, making it the most human\-like model\.
Table 4:Breakdown of average correlations by score type \(Incentive and Trust/Stance dimensions\) for frontier models and APV\. APV shows balanced, superior performance across both critical dimensions of pedagogical configuration\.APV demonstrates balanced sensitivity to all components ofθ\\theta\. Table[4](https://arxiv.org/html/2607.01581#S4.T4)isolates performance along the two key dimensions of the pedagogical configurationθ\\theta: the tutor’s incentive structure and their trustworthiness/pedagogical stance\. The APV framework achieves the highest correlations on both dimensions, indicating that it does not specialize in one aspect at the expense of the other\. This balanced, high\-fidelity inference is a direct benefit of its unified formalization of these components within the teacher’s utility functionUTeachU\_\{\\text\{Teach\}\}\.
APV robustness across prompts and perspectives\. Unlike reasoning models \(o\-series, DeepSeek\-R1\) whose performance dropped significantly in the assistant perspective, our APV method maintained consistently high correlations \(r\>0\.92r\>0\.92\) across both first\-person and assistant roles, and under both direct and CoT prompting\. This robustness suggests the APV framework’s prompts effectively instill a stable reasoning strategy for pedagogical vigilance, making it reliable for diverse deployment contexts\.
Figure 5:Multi\-dimensional performance comparison across models\. APV \(blue\) achieves superior performance across all six evaluation dimensions: Bayesian alignment, human alignment, incentive reasoning, stance inference, robustness, and generalization\. The shaded area represents the performance envelope\.Figure 6:Performance heatmap across prompting conditions for Level 2 evaluation\. APV \(top row, highlighted\) maintains consistently high correlations \(\>\>0\.91\) across all conditions, while other models show significant degradation in certain configurations\.
### 4\.3Experiment 3 \(Level 3\): Generalizing to Authentic Pedagogical Discourse
#### 4\.3\.1Dataset and Experimental Setup
Pursuing APV Level 3, we construct a dataset of transcribed segments from real online language learning tutorials, teacher feedback videos, and translation forums\. For each instructional segmentII, the LLM is prompted to estimate the likely improvement in a learner’s translation \(proxying the student’s payoffΨS\\Psi\_\{S\}\), the instructor’s primary incentive, and the overall pedagogical stance \(τ^\\hat\{\\tau\}\)\.
#### 4\.3\.2Models and Hyperparameters
We evaluate GPT\-4o, Claude 3\.5 Sonnet, Llama 3\.3\-70B \(the original models for this experiment\) and ourAPVmethod\. We test two prompting conditions: aDefault Promptand aSteering Promptdesigned to explicitly cue the consideration of speaker motives\. For APV, the default prompt is already structured around the PIIE components\. We query each segmentn=1n=1time with temperature=0=0\.
#### 4\.3\.3Results
Table 5:Correlation between LLM influence scores \(belief in instructional quality\) and the Bayesian model in realistic settings\. The rightmost column shows the performance of our APV method\. Asterisk \(\*\) denotes significant improvement \(p<0\.05p<0\.05\) of the Steering Prompt over the Default Prompt for baseline models\.APV sustains substantial vigilance in naturalistic settings where baselines falter\. Table[5](https://arxiv.org/html/2607.01581#S4.T5)shows that in ecologically valid pedagogical discourse, the correlation between baseline models’ judgments and the rational model dropped precipitously \(often to near zero\)\. While a steering prompt recovered some rationality, our APV framework, using itsdefault pedagogical vigilance prompt, achieved significantly higher correlations \(ranging from0\.2870\.287to0\.3450\.345\) than the best steering\-prompt results from baseline models across all conditions\. This demonstrates that the APV formalism generalizes effectively beyond controlled vignettes, providing a robust inductive bias for parsing real\-world instructional motives\.
APV enables accurate prediction of pedagogical outcomes\. Beyond correlation with the Bayesian model, we evaluated the accuracy of the LLM’s estimate of likely learner improvement \(measured byΔ\\DeltaBLEU\)\. Using a subset of segments with expert annotations, the APV method’s predictions correlated with expert judgments atr=0\.41r=0\.41, significantly higher than GPT\-4o \(r=0\.22r=0\.22\) and Claude 3\.5 Sonnet \(r=0\.19r=0\.19\) under their best steering prompts \(p<0\.05p<0\.05\)\. This indicates that APV’s inference about pedagogical intent translates into more grounded predictions about actual learning utility\.
Table 6:Analysis of APV’s performance on naturalistic data: Correlation with rational model by discourse feature\. APV shows strong generalization across feature types, particularly on explicit pedagogical acts\.Discourse Feature in SegmentAPV Corr\. \(r\)Contains explicit correction \(e\.g\., “This is wrong because…”\)0\.41Contains a rule explanation \(e\.g\., “Remember the grammar rule…”\)0\.38Contains a first\-person experience \(e\.g\., “I find that…”\)0\.29Contains a promotional cue \(e\.g\., “My course covers this…”\)0\.32Overall Average0\.35\\mathbf\{0\.35\}APV generalizes across markers of pedagogical intent\. Table[6](https://arxiv.org/html/2607.01581#S4.T6)breaks down the APV framework’s performance based on linguistic features present in the instructional segment\. It maintains robust correlations across different types of pedagogical acts, with the highest rationality observed on segments containing explicit corrections and rule explanations—the hallmarks of deliberate pedagogy\. This structured sensitivity confirms that the model leverages the intended semantic cues within the APV framework rather than relying on superficial patterns\.
Figure 7:Level 3 generalization to naturalistic pedagogical discourse\. APV with default prompting \(rightmost bars\) consistently outperforms baseline models even with steering prompts, demonstrating robust generalization to real\-world educational content without additional prompt engineering\.Figure 8:LLM\-Human alignment on pedagogical vigilance tasks\. Each point represents a test instance\. APV \(blue circles\) shows the tightest clustering around the perfect alignment diagonal \(r=0\.958r=0\.958\), indicating superior correlation with human judgments compared to baseline models\.Figure 9:Learning curves across evaluation epochs\. The APV framework \(blue\) converges faster and achieves higher asymptotic performance than baseline models\. Shaded regions indicate 95% confidence intervals over 30 independent runs\. GPT\-4o and Claude 3\.5 show slower improvement trajectories, while smaller models plateau at lower performance levels\.
## 5Ablation Studies
To understand the contribution of each component in the APV framework, we conduct systematic ablation experiments\. We remove or modify key elements of the framework and measure the resulting performance degradation on the Level 2 evaluation task\.
### 5\.1Ablation Configurations
We evaluate five ablation configurations\. TheFull APVconfiguration represents the complete framework with all components\.APV w/o Genre \(𝒢\\mathcal\{G\}\)removes the instructional genre distinction from the prompts\.APV w/o Stance \(τ\\tau\)eliminates explicit reasoning about the teacher’s pedagogical stance\.APV w/o Incentives \(𝐑T\\mathbf\{R\}\_\{T\}\)removes the teacher incentive modeling\. Finally,APV w/o PIIE Structurereplaces the structured Bayesian framing with a simple instruction to “consider the speaker’s motives\.”
Table 7:Ablation study results on Level 2 evaluation\. Performance is measured by Pearson correlation with human judgments\. All differences from Full APV are statistically significant \(p<0\.01p<0\.01\)\.
### 5\.2Analysis
Table[7](https://arxiv.org/html/2607.01581#S5.T7)reveals several important findings\. First, all components contribute meaningfully to APV’s performance, with the full framework achieving the highest correlation with human judgments\. Second, theincentive modeling component\(𝐑T\\mathbf\{R\}\_\{T\}\) has the largest individual impact among the three configuration parameters, suggesting that explicit reasoning about teacher incentives is crucial for vigilant judgment\. Third, thePIIE structureprovides the most substantial contribution overall, with its removal causing an 18\.3% performance drop\. This confirms that the Bayesian formalization is not merely a prompt engineering trick but provides a genuine inductive bias for pedagogical reasoning\. Fourth, the relatively smaller impact of removing genre \(𝒢\\mathcal\{G\}\) suggests that this distinction may be partially recoverable from context, whereas incentive and stance require explicit modeling\.
Figure 10:Ablation study results\. Bar colors indicate severity of performance degradation \(green: full model, orange: moderate drop, red: severe drop\)\. Removing the PIIE structure causes the largest decline \(↓\\downarrow17\.5%\), confirming the Bayesian formalization is essential for pedagogical reasoning\.Figure 11:Performance distribution across 30 independent runs\. Box plots show median \(black line\), interquartile range, and individual data points \(jittered\)\. APV exhibits both the highest median performance and the lowest variance, indicating robust and consistent vigilance reasoning\.Figure 12:Relative contribution of APV components to overall performance \(donut chart\)\. The PIIE structure contributes 40\.5% of the total effect, followed by incentive modeling \(24\.5%\), stance inference \(19\.4%\), and genre discrimination \(15\.6%\)\. Legend shows absolute correlation drops \(Δr\\Delta r\)\.
## 6Discussion
### 6\.1Theoretical Implications
Our findings establish the APV framework as a principled approach for evaluating and enhancing LLMs’ capacity for pedagogical reasoning\. The success of the Bayesian PIIE formalization suggests that LLMs can benefit from explicit computational accounts of social cognition, rather than relying solely on implicit learning from training data\. This aligns with recent work arguing for the integration of cognitive science principles into AI system design\(Lieder and Griffiths,[2020](https://arxiv.org/html/2607.01581#bib.bib22); Goodman and Frank,[2016](https://arxiv.org/html/2607.01581#bib.bib32); Lianget al\.,[2024](https://arxiv.org/html/2607.01581#bib.bib4)\)\.
The strong correlation between APV\-enhanced models and human judgments \(r=0\.958r=0\.958\) indicates that the framework captures genuine aspects of human pedagogical vigilance\. Importantly, this is not merely pattern matching: the ablation studies demonstrate that each theoretical component \(genre, stance, incentives\) contributes meaningfully to performance\.
### 6\.2Practical Applications
The APV framework has immediate applications in AI\-assisted education\. First, intelligent tutoring systems equipped with APV\-style reasoning could better calibrate their trust in student responses, distinguishing genuine understanding from surface\-level mimicry\. Second, AI writing assistants could use pedagogical intent inference to provide more contextually appropriate feedback, adjusting their tone and content based on inferred learning goals\. Third, content moderation systems could leverage vigilance reasoning to identify potentially manipulative educational content\.
### 6\.3Limitations
Several limitations warrant acknowledgment\. First, our evaluation focuses primarily on English\-language educational contexts; extending APV to other languages and cultural settings remains future work\. Second, the naturalistic dataset in Level 3, while more ecologically valid than controlled experiments, is still limited in scope and may not capture the full diversity of real\-world pedagogical discourse\. Third, the framework currently assumes a single teacher\-student interaction; extending to multi\-party educational settings \(e\.g\., collaborative learning\) presents additional challenges\.
### 6\.4Future Directions
Several promising directions emerge from this work\. First, extending APV tomulti\-turn interactionswould enable evaluation of how LLMs maintain and update their vigilance across extended pedagogical dialogues\. Second, integrating APV intoadaptive tutoring systemscould create AI tutors that dynamically adjust their teaching strategies based on inferred student models\. Third, investigatingcross\-cultural variationin pedagogical vigilance could reveal how different educational traditions shape expectations about teacher\-student communicationZhanget al\.\([2026](https://arxiv.org/html/2607.01581#bib.bib88)\); Chenet al\.\([2025b](https://arxiv.org/html/2607.01581#bib.bib89),[a](https://arxiv.org/html/2607.01581#bib.bib90)\); Youet al\.\([2026](https://arxiv.org/html/2607.01581#bib.bib91)\); Zhaoet al\.\([2026](https://arxiv.org/html/2607.01581#bib.bib92),[2026](https://arxiv.org/html/2607.01581#bib.bib92)\); Huanget al\.\([2026](https://arxiv.org/html/2607.01581#bib.bib93)\)\.
## 7Conclusion
This work introduces the Adaptive Pedagogical Vigilance \(APV\) framework, a unified computational formalism for evaluating how Large Language Models \(LLMs\) infer pedagogical intent in multilingual translation contexts\. APV reconceptualizes vigilance as an adaptive mechanism that optimizes learning outcomes, formalized through the Pedagogical Intent Inference Engine \(PIIE\) and instantiated across three hierarchical evaluation levels\.
Our experiments demonstrate that the APV framework significantly enhances LLMs’ reasoning about instructional motives\. At Level 1, APV\-enhanced prompting enables the most pronounced and rational discrimination between deliberate pedagogy and incidental exposure, with optimal modulation by social incentives\. At Level 2, APV achieves state\-of\-the\-art performance in reasoning about structured pedagogical configurations \(incentives𝐑T\\mathbf\{R\}\_\{T\}and stanceτ\\tau\), showing the highest correlation with both a normative Bayesian model and human judgment patterns\. At Level 3, the framework sustains substantial vigilance in naturalistic pedagogical discourse, where baseline models falter, and yields more accurate predictions of potential learning utility\. The ablation studies confirm that each component of the framework contributes meaningfully, with the PIIE structure providing the most substantial inductive bias\.
Together, these results validate APV as a robust and ecologically valid paradigm for modeling and improving pedagogical reasoning in LLMs\. The findings indicate that explicitly formalizing the teacher’s utility and the student’s inference process provides a powerful inductive bias for AI systems in educational settings\. Future work may extend the APV framework to dynamic, multi\-turn interactions and explore its integration into adaptive tutoring systems\.
## References
- S\. E\. Asch \(1956\)Studies of independence and conformity: i\. a minority of one against a unanimous majority\.Psychological Monographs: General and Applied70\(9\),pp\. 1–70\.Cited by:[§2\.2](https://arxiv.org/html/2607.01581#S2.SS2.p2.1)\.
- Z\. Bai, E\. Ge, and J\. Hao \(2025\)Multi\-agent collaborative framework for intelligent it operations: an aoi system with context\-aware compression and dynamic task scheduling\.arXiv preprint arXiv:2512\.13956\.Cited by:[§2\.1](https://arxiv.org/html/2607.01581#S2.SS1.p1.1)\.
- S\. Baron\-Cohen, A\. M\. Leslie, and U\. Frith \(1985\)Does the autistic child have a “theory of mind”?\.Cognition21\(1\),pp\. 37–46\.Cited by:[§2\.1](https://arxiv.org/html/2607.01581#S2.SS1.p1.1)\.
- Z\. Bi, L\. Chen, J\. Song, H\. Luo, E\. Ge, J\. Huang, T\. Wang, K\. Chen, C\. X\. Liang, Z\. Wei,et al\.\(2025\)Exploring efficiency frontiers of thinking budget in medical reasoning: scaling laws between computational resources and reasoning quality\.arXiv:2508\.12140\.Cited by:[§2\.3](https://arxiv.org/html/2607.01581#S2.SS3.p1.1)\.
- C\. F\. Bond Jr and B\. M\. DePaulo \(2006\)Accuracy of deception judgments\.Personality and Social Psychology Review10\(3\),pp\. 214–234\.Cited by:[§2\.1](https://arxiv.org/html/2607.01581#S2.SS1.p1.1)\.
- S\. Bubeck, V\. Chandrasekaran, R\. Eldan,et al\.\(2023\)Sparks of artificial general intelligence: early experiments with gpt\-4\.arXiv preprint arXiv:2303\.12712\.Cited by:[§2\.3](https://arxiv.org/html/2607.01581#S2.SS3.p1.1)\.
- H\. Chen, J\. Peng, D\. Min, C\. Sun, K\. Chen, Y\. Yan, X\. Yang, and L\. Cheng \(2025a\)MVI\-bench: a comprehensive benchmark for evaluating robustness to misleading visual inputs in lvlms\.InProceedings of the 43rd International Conference on Machine Learning \(ICML 2026\),Cited by:[§6\.4](https://arxiv.org/html/2607.01581#S6.SS4.p1.1)\.
- K\. Chen, Z\. Lin, Z\. Xu, Y\. Shen, Y\. Yao, J\. Rimchala, J\. Zhang, and L\. Huang \(2025b\)R2i\-bench: benchmarking reasoning\-driven text\-to\-image generation\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 12606–12641\.Cited by:[§6\.4](https://arxiv.org/html/2607.01581#S6.SS4.p1.1)\.
- Z\. Chen, J\. Yang, H\. Chen,et al\.\(2024\)ToMBench: benchmarking theory of mind in large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,pp\. 14471–14494\.Cited by:[§2\.2](https://arxiv.org/html/2607.01581#S2.SS2.p2.1),[§2\.3](https://arxiv.org/html/2607.01581#S2.SS3.p1.1)\.
- P\. F\. Christiano, J\. Leike, T\. Brown,et al\.\(2017\)Deep reinforcement learning from human preferences\.Advances in Neural Information Processing Systems30\.Cited by:[§2\.2](https://arxiv.org/html/2607.01581#S2.SS2.p1.1)\.
- R\. B\. Cialdini and N\. J\. Goldstein \(2004\)Social influence: compliance and conformity\.Annual Review of Psychology55,pp\. 591–621\.Cited by:[§2\.1](https://arxiv.org/html/2607.01581#S2.SS1.p1.1)\.
- X\. Deng \(2025\)Enhancing neural network performance on tabular data via knowledge distillation and rankgauss transformation\.In2025 6th International Conference on Big Data & Artificial Intelligence & Software Engineering \(ICBASE\),pp\. 418–423\.Cited by:[§3\.1](https://arxiv.org/html/2607.01581#S3.SS1.p4.1)\.
- X\. Deng \(2026\)Graph inference towards icd coding\.arXiv preprint arXiv:2601\.07496\.Cited by:[§3\.4](https://arxiv.org/html/2607.01581#S3.SS4.p2.1)\.
- M\. C\. Frank and N\. D\. Goodman \(2012\)Predicting pragmatic reasoning in language games\.Science336\(6084\),pp\. 998–998\.Cited by:[§2\.1](https://arxiv.org/html/2607.01581#S2.SS1.p1.1),[§2\.3](https://arxiv.org/html/2607.01581#S2.SS3.p2.1)\.
- N\. D\. Goodman and M\. C\. Frank \(2016\)Pragmatic language interpretation as probabilistic inference\.Trends in Cognitive Sciences20\(11\),pp\. 818–829\.Cited by:[§2\.1](https://arxiv.org/html/2607.01581#S2.SS1.p1.1),[§2\.3](https://arxiv.org/html/2607.01581#S2.SS3.p2.1),[§6\.1](https://arxiv.org/html/2607.01581#S6.SS1.p1.1)\.
- N\. D\. Goodman, J\. B\. Tenenbaum, J\. Feldman, and T\. L\. Griffiths \(2008\)A rational analysis of rule\-based concept learning\.Cognitive Science32\(1\),pp\. 108–154\.Cited by:[§2\.3](https://arxiv.org/html/2607.01581#S2.SS3.p2.1)\.
- T\. L\. Griffiths and J\. B\. Tenenbaum \(2006\)Optimal predictions in everyday cognition\.Psychological Science17\(9\),pp\. 767–773\.Cited by:[§2\.1](https://arxiv.org/html/2607.01581#S2.SS1.p1.1),[§2\.3](https://arxiv.org/html/2607.01581#S2.SS3.p2.1)\.
- X\. Han, X\. Gao, X\. Qu, and Z\. Yu \(2025\)Multi\-agent medical decision consensus matrix system: an intelligent collaborative framework for oncology mdt consultations\.arXiv preprint arXiv:2512\.14321\.Cited by:[§2\.3](https://arxiv.org/html/2607.01581#S2.SS3.p2.1)\.
- R\. X\. Hawkins, A\. Stuhlmüller, J\. Degen, and N\. D\. Goodman \(2015\)Why do you ask? good questions provoke informative answers\.InProceedings of the 37th Annual Conference of the Cognitive Science Society,pp\. 878–883\.Cited by:[§2\.1](https://arxiv.org/html/2607.01581#S2.SS1.p1.1)\.
- Y\. He, S\. Li, K\. Li, J\. Wang, B\. Li, T\. Shi, Y\. Xin, K\. Li, J\. Yin, M\. Zhang,et al\.\(2025\)GE\-adapter: a general and efficient adapter for enhanced video editing with pretrained text\-to\-image diffusion models\.Expert Systems with Applications,pp\. 129649\.Cited by:[§1](https://arxiv.org/html/2607.01581#S1.p3.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart,et al\.\(2020\)Measuring massive multitask language understanding\.arXiv preprint arXiv:2009\.03300\.Cited by:[§2\.3](https://arxiv.org/html/2607.01581#S2.SS3.p1.1)\.
- J\. Henrich and F\. J\. Gil\-White \(2001\)The evolution of prestige: freely conferred deference as a mechanism for enhancing the benefits of cultural transmission\.Evolution and Human Behavior22\(3\),pp\. 165–196\.Cited by:[§2\.1](https://arxiv.org/html/2607.01581#S2.SS1.p1.1)\.
- C\. Heyes \(2018\)Cognitive gadgets: the cultural evolution of thinking\.Cambridge, MA: Harvard University Press\.Cited by:[§2\.1](https://arxiv.org/html/2607.01581#S2.SS1.p1.1)\.
- Y\. Huang, B\. Li, N\. Li, Z\. Wang, K\. Chen, H\. Ge, Q\. Si, Y\. Shen, R\. Yang, G\. Wang, and H\. Guo \(2026\)GUI agents for continual game generation\.arXiv preprint arXiv:2605\.28258\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2605.28258),2605\.28258Cited by:[§6\.4](https://arxiv.org/html/2607.01581#S6.SS4.p1.1)\.
- Z\. Ji, N\. Lee, R\. Frieske,et al\.\(2023\)Survey of hallucination in natural language generation\.ACM Computing Surveys55\(12\),pp\. 1–38\.Cited by:[§2\.2](https://arxiv.org/html/2607.01581#S2.SS2.p1.1)\.
- M\. Kosinski \(2024\)Evaluating large language models in theory of mind tasks\.arXiv preprint arXiv:2302\.02083\.Cited by:[§2\.2](https://arxiv.org/html/2607.01581#S2.SS2.p2.1),[§2\.3](https://arxiv.org/html/2607.01581#S2.SS3.p1.1)\.
- T\. R\. Levine \(2014\)Truth\-default theory \(tdt\): a theory of human deception and deception detection\.Journal of Language and Social Psychology33\(4\),pp\. 431–442\.Cited by:[§2\.1](https://arxiv.org/html/2607.01581#S2.SS1.p1.1)\.
- C\. X\. Liang, P\. Tian, C\. H\. Yin, Y\. Yua, W\. An\-Hou, L\. Ming, T\. Wang, Z\. Bi, and M\. Liu \(2024\)A comprehensive survey and guide to multimodal large language models in vision\-language tasks\.arXiv preprint arXiv:2411\.06284\.Cited by:[§6\.1](https://arxiv.org/html/2607.01581#S6.SS1.p1.1)\.
- F\. Lieder and T\. L\. Griffiths \(2020\)Resource\-rational analysis: understanding human cognition as the optimal use of limited computational resources\.Behavioral and Brain Sciences43,pp\. e1\.Cited by:[§2\.3](https://arxiv.org/html/2607.01581#S2.SS3.p2.1),[§6\.1](https://arxiv.org/html/2607.01581#S6.SS1.p1.1)\.
- S\. Lin \(2025a\)Abductive inference in retrieval\-augmented language models: generating and validating missing premises\.External Links:2511\.04020,[Link](https://arxiv.org/abs/2511.04020)Cited by:[§1](https://arxiv.org/html/2607.01581#S1.p2.1)\.
- S\. Lin \(2025b\)Hybrid fuzzing with llm\-guided input mutation and semantic feedback\.External Links:2511\.03995,[Link](https://arxiv.org/abs/2511.03995)Cited by:[§1](https://arxiv.org/html/2607.01581#S1.p2.1)\.
- S\. Lin \(2025c\)LLM\-driven adaptive source\-sink identification and false positive mitigation for static analysis\.External Links:2511\.04023,[Link](https://arxiv.org/abs/2511.04023)Cited by:[§1](https://arxiv.org/html/2607.01581#S1.p2.1)\.
- H\. Mercier and D\. Sperber \(2017\)The enigma of reason\.Cambridge, MA: Harvard University Press\.Cited by:[§2\.1](https://arxiv.org/html/2607.01581#S2.SS1.p1.1)\.
- W\. Mu\-Jiang\-shan, Y\. Jun, L\. Shang\-wei,et al\.\(2010\)Ordered and hamilton digraphs\.Chinese Quarterly Journal of Mathematics25\(3\),pp\. 317–326\.Cited by:[§3\.1](https://arxiv.org/html/2607.01581#S3.SS1.p4.1)\.
- Q\. Niu, K\. Chen, M\. Li, P\. Feng, Z\. Bi, L\. K\. Yan, Y\. Zhang, C\. H\. Yin, C\. Fei, J\. Liu, B\. Peng, T\. Wang, Y\. Wang, S\. Chen, and M\. Liu \(2024a\)From text to multimodality: exploring the evolution and impact of large language models in medical practice\.External Links:2410\.01812,[Link](https://arxiv.org/abs/2410.01812)Cited by:[§2\.2](https://arxiv.org/html/2607.01581#S2.SS2.p2.1)\.
- Q\. Niu, J\. Liu, Z\. Bi, P\. Feng, B\. Peng, K\. Chen, M\. Li, L\. K\. Yan, Y\. Zhang, C\. H\. Yin, C\. Fei, T\. Wang, Y\. Wang, S\. Chen, and M\. Liu \(2024b\)Large language models and cognitive science: a comprehensive review of similarities, differences, and challenges\.External Links:2409\.02387,[Link](https://arxiv.org/abs/2409.02387)Cited by:[§2\.3](https://arxiv.org/html/2607.01581#S2.SS3.p1.1)\.
- Q\. Niu, J\. Liu, Z\. Bi, P\. Feng, B\. Peng, K\. Chen, M\. Li, L\. K\. Yan, Y\. Zhang, C\. H\. Yin,et al\.\(2024c\)Large language models and cognitive science: a comprehensive review of similarities, differences, and challenges\.BIO Integration\.Cited by:[§2\.3](https://arxiv.org/html/2607.01581#S2.SS3.p1.1)\.
- K\. Oktar, T\. R\. Sumers, and T\. L\. Griffiths \(2024\)A rational model of epistemic vigilance\.InProceedings of the 46th Annual Conference of the Cognitive Science Society,Cited by:[§2\.3](https://arxiv.org/html/2607.01581#S2.SS3.p2.1),[§4\.2\.1](https://arxiv.org/html/2607.01581#S4.SS2.SSS1.p1.3)\.
- K\. Oktar, X\. Wu, C\. Liu, T\. R\. Sumers, and T\. L\. Griffiths \(2025\)Are large language models sensitive to the motives behind communication?\.InAdvances in Neural Information Processing Systems,Vol\.38\.Cited by:[§2\.3](https://arxiv.org/html/2607.01581#S2.SS3.p2.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in Neural Information Processing Systems35,pp\. 27730–27744\.Cited by:[§1](https://arxiv.org/html/2607.01581#S1.p2.1),[§2\.2](https://arxiv.org/html/2607.01581#S2.SS2.p1.1)\.
- C\. Pan, Y\. Qu, Y\. Yao, and M\. Wang \(2024\)HybridGNN: a self\-supervised graph neural network for efficient maximum matching in bipartite graphs\.Symmetry16\(12\),pp\. 1631\.Cited by:[§2\.1](https://arxiv.org/html/2607.01581#S2.SS1.p1.1)\.
- B\. Peng, Z\. Bi, Q\. Niu, M\. Liu, P\. Feng, T\. Wang, L\. K\. Yan, Y\. Wen, Y\. Zhang, and C\. H\. Yin \(2024a\)Jailbreaking and mitigation of vulnerabilities in large language models\.Algorithms and Applications in Artificial Intelligence and Autonomous Systems\.Cited by:[§2\.2](https://arxiv.org/html/2607.01581#S2.SS2.p1.1)\.
- B\. Peng, K\. Chen, M\. Li, P\. Feng, Z\. Bi, J\. Liu, and Q\. Niu \(2024b\)Securing large language models: addressing bias, misinformation, and prompt attacks\.arXiv preprint arXiv:2409\.08087\.Cited by:[§2\.2](https://arxiv.org/html/2607.01581#S2.SS2.p1.1)\.
- E\. Perez, S\. Ringer, K\. Lukošiūtė,et al\.\(2022\)Discovering language model behaviors with model\-written evaluations\.arXiv preprint arXiv:2212\.09251\.Cited by:[§1](https://arxiv.org/html/2607.01581#S1.p2.1),[§2\.2](https://arxiv.org/html/2607.01581#S2.SS2.p1.1)\.
- A\. Perfors, J\. B\. Tenenbaum, T\. L\. Griffiths, and F\. Xu \(2011\)A tutorial introduction to bayesian models of cognitive development\.Cognition120\(3\),pp\. 302–321\.Cited by:[§2\.1](https://arxiv.org/html/2607.01581#S2.SS1.p1.1)\.
- D\. Premack and G\. Woodruff \(1978\)Does the chimpanzee have a theory of mind?\.Behavioral and Brain Sciences1\(4\),pp\. 515–526\.Cited by:[§2\.1](https://arxiv.org/html/2607.01581#S2.SS1.p1.1)\.
- H\. Qi, Z\. Hu, Z\. Yang, J\. Zhang, J\. J\. Wu, C\. Cheng, C\. Wang, and L\. Zheng \(2022\)Capacitive aptasensor coupled with microfluidic enrichment for real\-time detection of trace sars\-cov\-2 nucleocapsid protein\.Analytical chemistry94\(6\),pp\. 2812–2819\.Cited by:[§1](https://arxiv.org/html/2607.01581#S1.p1.1)\.
- D\. Qu and Y\. Ma \(2025\)Magnet\-bn: markov\-guided bayesian neural networks for calibrated long\-horizon sequence forecasting and community tracking\.Mathematics13\(17\),pp\. 2740\.Cited by:[§1](https://arxiv.org/html/2607.01581#S1.p1.1)\.
- M\. Sap, R\. LeBras, D\. Fried, and Y\. Choi \(2022\)Neural theory\-of\-mind? on the limits of social intelligence in large lms\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,pp\. 3762–3780\.Cited by:[§2\.2](https://arxiv.org/html/2607.01581#S2.SS2.p2.1),[§2\.3](https://arxiv.org/html/2607.01581#S2.SS3.p1.1)\.
- M\. Sharma, M\. Tong, T\. Korbak,et al\.\(2024\)Towards understanding sycophancy in language models\.arXiv preprint arXiv:2310\.13548\.Cited by:[§1](https://arxiv.org/html/2607.01581#S1.p2.1),[§2\.2](https://arxiv.org/html/2607.01581#S2.SS2.p1.1)\.
- D\. Sperber, F\. Clément, C\. Heintz, O\. Mascaro, H\. Mercier, G\. Origgi, and D\. Wilson \(2010\)Epistemic vigilance\.Mind & Language25\(4\),pp\. 359–393\.Cited by:[§1](https://arxiv.org/html/2607.01581#S1.p1.1),[§2\.1](https://arxiv.org/html/2607.01581#S2.SS1.p1.1)\.
- Y\. Tian, Z\. Yang, C\. Liu, Y\. Su, Z\. Hong, Z\. Gong, and J\. Xu \(2025\)CenterMamba\-sam: center\-prioritized scanning and temporal prototypes for brain lesion segmentation\.External Links:2511\.01243,[Link](https://arxiv.org/abs/2511.01243)Cited by:[§1](https://arxiv.org/html/2607.01581#S1.p2.1)\.
- M\. Tomasello, M\. Carpenter, J\. Call, T\. Behne, and H\. Moll \(2005\)Understanding and sharing intentions: the origins of cultural cognition\.Behavioral and Brain Sciences28\(5\),pp\. 675–691\.Cited by:[§2\.1](https://arxiv.org/html/2607.01581#S2.SS1.p1.1)\.
- H\. Wang, X\. Zhang, Y\. Xia, and X\. Wu \(2023\)An intelligent blockchain\-based access control framework with federated learning for genome\-wide association studies\.Computer Standards & Interfaces84,pp\. 103694\.Cited by:[§1](https://arxiv.org/html/2607.01581#S1.p2.1)\.
- M\. Wang, W\. Yang, and S\. Wang \(2013\)Conditional manching preclusion number for the cayley graph on the symmetric group\.Acta Math\. Appl\. Sin\.\(Chinese Series\)36\(5\),pp\. 813–820\.Cited by:[§3\.4](https://arxiv.org/html/2607.01581#S3.SS4.p2.1)\.
- M\. Wang and S\. Wang \(2016\)Diagnosability of cayley graph networks generated by transposition trees under the comparison diagnosis model\.Annals of Applied Mathematics32\(2\),pp\. 166–173\.Cited by:[§3\.4](https://arxiv.org/html/2607.01581#S3.SS4.p2.1)\.
- M\. Wang, S\. Xu, J\. Jiang, D\. Xiang, and S\. Hsieh \(2025\)Global reliable diagnosis of networks based on self\-comparative diagnosis model and g\-good\-neighbor property\.Journal of Computer and System Sciences,pp\. 103698\.Cited by:[§3\.4](https://arxiv.org/html/2607.01581#S3.SS4.p2.1)\.
- S\. Wang, M\. Wang, K\. Feng, S\. Lin, and M\. Zhang \(2012\)Relation of the isolated scattering number of a graph and its complement graph\.Journal of Shanxi University \(Natural Science Edition\)35\(2\),pp\. 206–210\.Cited by:[§2\.2](https://arxiv.org/html/2607.01581#S2.SS2.p1.1)\.
- S\. Wang and M\. Wang \(2018\)The edge connectivity of expanded k\-ary n\-cubes\.Discrete Dynamics in Nature and Society2018\(1\),pp\. 7867342\.Cited by:[§3\.1](https://arxiv.org/html/2607.01581#S3.SS1.p4.1)\.
- S\. Wang and M\. Wang \(2019\)A note on the connectivity of m\-ary n\-dimensional hypercubes\.Parallel Processing Letters29\(04\),pp\. 1950017\.Cited by:[§3\.1](https://arxiv.org/html/2607.01581#S3.SS1.p4.1)\.
- S\. Wang, J\. Wangmu, Z\. Qi, and Y\. Ren \(2011\)Embedding paths into the 4\-ary n\-cube with faulty nodes\.In2011 International Conference on Consumer Electronics, Communications and Networks \(CECNet\),pp\. 4949–4951\.Cited by:[§2\.1](https://arxiv.org/html/2607.01581#S2.SS1.p1.1)\.
- Y\. Wang and S\. Sayil \(2024\)Soft error evaluation and mitigation in gate diffusion input circuits\.In2024 IEEE 6th International Conference on Power, Intelligent Computing and Systems \(ICPICS\),pp\. 121–128\.Cited by:[§3\.1](https://arxiv.org/html/2607.01581#S3.SS1.p4.1)\.
- Y\. Wang \(2024\)Low\-power design of advanced image processing algorithms under fpga in real\-time applications\.In2024 IEEE 4th International Conference on Power, Electronics and Computer Applications \(ICPECA\),pp\. 1080–1084\.Cited by:[§2\.3](https://arxiv.org/html/2607.01581#S2.SS3.p2.1)\.
- Y\. Wang \(2025\)Zynq soc\-based acceleration of retinal blood vessel diameter measurement\.Archives of Advanced Engineering Science,pp\. 1–9\.Cited by:[§2\.3](https://arxiv.org/html/2607.01581#S2.SS3.p2.1)\.
- R\. Watson and T\. J\. Morgan \(2025\)An experimental test of epistemic vigilance: competitive incentives increase dishonesty and reduce social influence\.Cognition254,pp\. 105987\.Cited by:[§4\.1\.1](https://arxiv.org/html/2607.01581#S4.SS1.SSS1.p1.2)\.
- A\. Wei, N\. Haghtalab, and J\. Steinhardt \(2024\)Jailbroken: how does llm safety training fail?\.Advances in Neural Information Processing Systems36\.Cited by:[§1](https://arxiv.org/html/2607.01581#S1.p2.1),[§2\.2](https://arxiv.org/html/2607.01581#S2.SS2.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in Neural Information Processing Systems35,pp\. 24824–24837\.Cited by:[§2\.3](https://arxiv.org/html/2607.01581#S2.SS3.p1.1)\.
- Z\. Wei, H\. An, Y\. Yao, W\. Su, G\. Li, Saifullah, B\. Sun, and M\. Wang \(2025a\)FSTGAT: financial spatio\-temporal graph attention network for non\-stationary financial systems and its application in stock price prediction\.Symmetry17\(8\),pp\. 1344\.Cited by:[§3\.4](https://arxiv.org/html/2607.01581#S3.SS4.p2.1)\.
- Z\. Wei, P\. Hu, S\. Lang, H\. Yan, L\. Mei, Y\. Zhang, C\. Yang, J\. Hao, and Z\. Han \(2025b\)Automated red\-teaming framework for large language model security assessment: a comprehensive attack generation and detection system\.arXiv preprint arXiv:2512\.20677\.Cited by:[§2\.3](https://arxiv.org/html/2607.01581#S2.SS3.p2.1)\.
- H\. M\. Wellman, D\. Cross, and J\. Watson \(2001\)Meta\-analysis of theory\-of\-mind development: the truth about false belief\.Child Development72\(3\),pp\. 655–684\.Cited by:[§2\.2](https://arxiv.org/html/2607.01581#S2.SS2.p2.1)\.
- H\. Wimmer and J\. Perner \(1983\)Beliefs about beliefs: representation and constraining function of wrong beliefs in young children’s understanding of deception\.Cognition13\(1\),pp\. 103–128\.Cited by:[§2\.1](https://arxiv.org/html/2607.01581#S2.SS1.p1.1)\.
- X\. Wu, J\. Dong, W\. Bao, B\. Zou, L\. Wang, and H\. Wang \(2024a\)Augmented intelligence of things for emergency vehicle secure trajectory prediction and task offloading\.IEEE Internet of Things Journal11\(22\),pp\. 36030–36043\.Cited by:[§1](https://arxiv.org/html/2607.01581#S1.p2.1)\.
- X\. Wu, H\. Wang, W\. Tan, D\. Wei, and M\. Shi \(2020\)Dynamic allocation strategy of vm resources with fuzzy transfer learning method\.Peer\-to\-Peer Networking and Applications13\(6\),pp\. 2201–2213\.Cited by:[§1](https://arxiv.org/html/2607.01581#S1.p2.1)\.
- X\. Wu, H\. Wang, Y\. Zhang, B\. Zou, and H\. Hong \(2024b\)A tutorial\-generating method for autonomous online learning\.IEEE Transactions on Learning Technologies17,pp\. 1532–1541\.Cited by:[§1](https://arxiv.org/html/2607.01581#S1.p2.1)\.
- X\. Wu, Y\. Zhang, K\. Lai, M\. Yang, G\. Yang, and H\. Wang \(2024c\)A novel centralized federated deep fuzzy neural network with multi\-objectives neural architecture search for epistatic detection\.IEEE Transactions on Fuzzy Systems33\(1\),pp\. 94–107\.Cited by:[§1](https://arxiv.org/html/2607.01581#S1.p1.1)\.
- X\. Wu, Y\. Zhang, M\. Shi, P\. Li, R\. Li, and N\. N\. Xiong \(2022\)An adaptive federated learning scheme with differential privacy preserving\.Future Generation Computer Systems127,pp\. 362–372\.Cited by:[§1](https://arxiv.org/html/2607.01581#S1.p1.1)\.
- D\. Xiang, S\. Hsieh,et al\.\(2025\)G\-good\-neighbor diagnosability under the modified comparison model for multiprocessor systems\.Theoretical Computer Science1028,pp\. 115027\.Cited by:[§2\.1](https://arxiv.org/html/2607.01581#S2.SS1.p1.1)\.
- Y\. Xin, J\. Du, Q\. Wang, Z\. Lin, and K\. Yan \(2024\)Vmt\-adapter: parameter\-efficient transfer learning for multi\-task dense scene understanding\.InProceedings of the AAAI conference on artificial intelligence,Vol\.38,pp\. 16085–16093\.Cited by:[§2\.1](https://arxiv.org/html/2607.01581#S2.SS1.p1.1)\.
- Y\. Xin, Q\. Qin, S\. Luo, K\. Zhu, J\. Yan, Y\. Tai, J\. Lei, Y\. Cao, K\. Wang, Y\. Wang,et al\.\(2025a\)Lumina\-dimoo: an omni diffusion large language model for multi\-modal generation and understanding\.arXiv preprint arXiv:2510\.06308\.Cited by:[§2\.1](https://arxiv.org/html/2607.01581#S2.SS1.p1.1)\.
- Y\. Xin, J\. Yan, Q\. Qin, Z\. Li, D\. Liu, S\. Li, V\. S\. Huang, Y\. Zhou, R\. Zhang, L\. Zhuo,et al\.\(2025b\)Lumina\-mgpt 2\.0: stand\-alone autoregressive image modeling\.arXiv preprint arXiv:2507\.17801\.Cited by:[§2\.1](https://arxiv.org/html/2607.01581#S2.SS1.p1.1)\.
- S\. Xu, H\. L\. Kao, T\. Xu, H\. Zhang, J\. Wang, R\. Ding, G\. Liu, T\. Shi, Z\. Yu, G\. Pan,et al\.\(2025\)Adaptive detector\-verifier framework for zero\-shot polyp detection in open\-world settings\.arXiv preprint arXiv:2512\.12492\.Cited by:[§2\.3](https://arxiv.org/html/2607.01581#S2.SS3.p1.1)\.
- C\. Yang, Y\. He, A\. X\. Tian, D\. Chen, J\. Wang, T\. Shi, A\. Heydarian, and P\. Liu \(2025\)Wcdt: world\-centric diffusion transformer for traffic scene generation\.In2025 IEEE International Conference on Robotics and Automation \(ICRA\),pp\. 6566–6572\.Cited by:[§1](https://arxiv.org/html/2607.01581#S1.p3.1)\.
- M\. You, K\. Chen, and D\. Cheng \(2026\)Drdgrl: dual\-relational dynamic graph representation learning for delay\-sensitive stock trend prediction\.InInternational Conference on Database Systems for Advanced Applications,pp\. 35–50\.Cited by:[§6\.4](https://arxiv.org/html/2607.01581#S6.SS4.p1.1)\.
- W\. You, Z\. Yu, Z\. Han, X\. Liu, and Y\. Zhang \(2025\)Large language models for enhanced user experience in virtual and augmented reality: a comprehensive framework for ranking and recommendation systems\.Available at SSRN 5964834\.Cited by:[§2\.3](https://arxiv.org/html/2607.01581#S2.SS3.p2.1)\.
- L\. Yu, X\. Han, Y\. Kang, C\. Tseng, D\. Zhang, Z\. Bi, and Z\. Han \(2025a\)Affective multimodal agents with proactive knowledge grounding for emotionally aligned marketing dialogue\.arXiv preprint arXiv:2511\.21728\.Cited by:[§2\.3](https://arxiv.org/html/2607.01581#S2.SS3.p1.1)\.
- Z\. Yu, M\. Y\. I\. Idris, P\. Wang, Y\. Xia, and Y\. Xiang \(2025b\)Forgetme: benchmarking the selective forgetting capabilities of generative models\.Engineering Applications of Artificial Intelligence161,pp\. 112087\.Cited by:[§3\.2\.2](https://arxiv.org/html/2607.01581#S3.SS2.SSS2.p1.8)\.
- Z\. Yu, J\. Wang, and M\. Y\. I\. Idris \(2025c\)Iidm: improved implicit diffusion model with knowledge distillation to estimate the spatial distribution density of carbon stock in remote sensing imagery\.Knowledge\-Based Systems,pp\. 115131\.Cited by:[§3\.4](https://arxiv.org/html/2607.01581#S3.SS4.p2.1)\.
- Z\. Yu \(2025\)Ai for science: a comprehensive review on innovations, challenges, and future directions\.International Journal of Artificial Intelligence for Science \(IJAI4S\)1\(1\)\.Cited by:[§2\.1](https://arxiv.org/html/2607.01581#S2.SS1.p1.1)\.
- C\. Zhang, B\. Peng, X\. Sun, Q\. Niu, J\. Liu, K\. Chen, M\. Li, P\. Feng, Z\. Bi, M\. Liu,et al\.\(2024\)From word vectors to multimodal embeddings: techniques, applications, and future directions for large language models\.arXiv preprint arXiv:2411\.05036\.Cited by:[§2\.3](https://arxiv.org/html/2607.01581#S2.SS3.p1.1)\.
- H\. Zhang, X\. Mao, G\. Dong, Z\. Li, X\. Su, K\. Chen, J\. Yang, and Z\. Lin \(2026\)MemMark: state\-evolution attribution watermarking for agent long\-term memory systems\.arXiv preprint arXiv:2605\.25002\.Cited by:[§6\.4](https://arxiv.org/html/2607.01581#S6.SS4.p1.1)\.
- L\. Zhao, M\. Wang, X\. Zhang, Y\. Lin, and S\. Wang \(2017\)An algorithm for the orientation of complete bipartite graphs\.In2017 International Conference on Applied Mathematics, Modelling and Statistics Application \(AMMSA 2017\),pp\. 361–364\.Cited by:[§3\.1](https://arxiv.org/html/2607.01581#S3.SS1.p4.1)\.
- Q\. Zhao, Z\. Dou, D\. Zhang, X\. Li, C\. Song, Z\. Wan, X\. Li, Y\. Zhang, K\. Chen, Q\. Pan,et al\.\(2026\)STRIDE: strategic trajectory reasoning via discriminative estimation for verifiable reinforcement learning\.arXiv preprint arXiv:2606\.15866\.Cited by:[§6\.4](https://arxiv.org/html/2607.01581#S6.SS4.p1.1)\.
- Y\. Zhou, Y\. He, Y\. Su, S\. Han, J\. Jang, G\. Bertasius, M\. Bansal, and H\. Yao \(2025\)ReAgent\-v: a reward\-driven multi\-agent framework for video understanding\.arXiv preprint arXiv:2506\.01300\.Cited by:[§1](https://arxiv.org/html/2607.01581#S1.p1.1)\.
- A\. Zou, Z\. Wang, J\. Z\. Kolter, and M\. Fredrikson \(2023\)Universal and transferable adversarial attacks on aligned language models\.arXiv preprint arXiv:2307\.15043\.Cited by:[§1](https://arxiv.org/html/2607.01581#S1.p2.1),[§2\.2](https://arxiv.org/html/2607.01581#S2.SS2.p1.1)\.Similar Articles
Theory of Mind and Persuasion Beyond Conversation: Assessing the Capacity of LLMs to Induce Belief States via Planning and Action
This paper introduces Non-Conversational Planning Theory of Mind (NCP-ToM) and a novel evaluation framework, NCP-ExploreToM, to assess whether LLMs can induce specific belief states in other agents through actions rather than conversation. Testing on frontier models and humans across 600 tasks, GPT-5 achieved ~80% success, outperforming humans, though all models struggled more with false belief states.
Towards Just-in-Time Adaptive Feedback: Enhancing Student Learning via Knowledge-Grounded LLM
This paper presents a framework that uses domain-specific expert knowledge to ground large language models for providing Just-in-Time adaptive feedback to students based on their written reasoning, achieving over 80% improvement in student performance in a large university course.
CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning
CLExEval introduces a human-in-the-loop framework for evaluating LLM clinical reasoning under progressive information masking, revealing failure patterns such as verbosity bias, hidden knowledge paradox, and reasoning-to-output mismatch in models like GPT-4o-mini and HuatuoGPT-o1.
Evaluating Second-Order Bias of LLMs Through Epistemic Entitlement
This paper introduces 'second-order bias', the bias LLMs exhibit when judging biased content, and proposes a reasoning task grounded in epistemic entitlement to evaluate it. Experiments show that the task evades safety guardrails and reveals systematic demographic biases in LLM judges.
Beyond Perplexity: A Behavioral Evaluation Framework for Deployment-Memory Claims in LLM Test-Time Training
This paper introduces a behavioral evaluation framework for calibrating claims about deployment-time memory in LLM test-time training, proposing an evidence ladder and explicit baselines to bridge proxy metrics and behavioral evidence.