SICI: A Semantic-Pragmatic Complexity Index Reveals Regime Shifts in LLM Stance Detection

arXiv cs.CL Papers

Summary

Introduces SICI, a seven-dimensional diagnostic measure to assess semantic-pragmatic complexity in stance detection for LLMs, revealing regime shifts in error patterns across models and prompting strategies.

arXiv:2606.13189v1 Announce Type: new Abstract: Prompt-based LLMs are increasingly used for stance detection, but harder examples are not always repaired by clearer instructions, reasoning prompts, retrieval, or debate. We introduce SICI (Stance Inference Complexity Index), a seven-dimensional diagnostic measure of the semantic-pragmatic burden imposed by a target--text pair. Across SemEval-2016 and VAST, SICI predicts LLM accuracy better than surface proxies and shows substantial cross-scorer reliability ($\alpha=0.771$). More importantly, LLM errors change regime as SICI increases: low-complexity examples invite over-attribution, especially Against predictions; intermediate examples form an unstable boundary; and high-complexity examples rapidly concentrate on None. This phase-transition-like structure persists across GPT-3.5, GPT-4o-mini, DeepSeek-V3, and GPT-4o, although stronger models move the boundaries. A 15-method intervention study further shows that prompting, retrieval, and debate often shift models along the attribution--abstention axis rather than removing the high-complexity bottleneck.
Original Article
View Cached Full Text

Cached at: 06/12/26, 08:51 AM

# SICI: A Semantic-Pragmatic Complexity Index Reveals Regime Shifts in LLM Stance Detection
Source: [https://arxiv.org/html/2606.13189](https://arxiv.org/html/2606.13189)
Fuqiang Niu1,Bowen Zhang2 1School of Cyber Science and Technology, University of Science and Technology of China, Hefei, China 2School of Artificial Intelligence, Shenzhen Technology University, Shenzhen, China

###### Abstract

Prompt\-based LLMs are increasingly used for stance detection, but harder examples are not always repaired by clearer instructions, reasoning prompts, retrieval, or debate\. We introduceSICI\(Stance Inference Complexity Index\), a seven\-dimensional diagnostic measure of the semantic\-pragmatic burden imposed by a target–text pair\. Across SemEval\-2016 and VAST,SICIpredicts LLM accuracy better than surface proxies and shows substantial cross\-scorer reliability \(α=0\.771\\alpha=0\.771\)\. More importantly, LLM errors change regime asSICIincreases: low\-complexity examples invite over\-attribution, especiallyAgainstpredictions; intermediate examples form an unstable boundary; and high\-complexity examples rapidly concentrate onNone\. This phase\-transition\-like structure persists across GPT\-3\.5, GPT\-4o\-mini, DeepSeek\-V3, and GPT\-4o, although stronger models move the boundaries\. A 15\-method intervention study further shows that prompting, retrieval, and debate often shift models along the attribution–abstention axis rather than removing the high\-complexity bottleneck\.

SICI: A Semantic\-Pragmatic Complexity Index Reveals Regime Shifts in LLM Stance Detection

Fuqiang Niu1, Bowen Zhang21School of Cyber Science and Technology,University of Science and Technology of China, Hefei, China2School of Artificial Intelligence, Shenzhen Technology University, Shenzhen, China

## 1Introduction

Stance detection asks whether a text expressesFavor,Against, orNonetoward a given target\(Mohammadet al\.,[2016](https://arxiv.org/html/2606.13189#bib.bib1); Aldayel and Magdy,[2021](https://arxiv.org/html/2606.13189#bib.bib2)\)\. With the rise of large language models \(LLMs\), recent work increasingly treats stance detection as prompt\-based inference: a model receives the text, target, label definitions, and sometimes demonstrations or reasoning instructions, then outputs one of the stance labels\. This paradigm has naturally led to richer elicitation strategies, including zero\-shot and few\-shot prompting, chain\-of\-thought reasoning, evidence\-oriented prompting, retrieval augmentation, and multi\-agent discussion or voting\.

These strategies are motivated by a plausible assumption: if an LLM already possesses the relevant linguistic and world knowledge, then clearer instructions, more examples, explicit reasoning, or additional context should elicit better stance judgments\. The empirical picture for hard stance examples is less settled\. Prior work has explored chain\-of\-thought, counterfactual, and knowledge\-enhanced prompting for stance detection\(Weinzierl and Harabagiu,[2024](https://arxiv.org/html/2606.13189#bib.bib3); Taranukhinet al\.,[2024](https://arxiv.org/html/2606.13189#bib.bib4); Zhanget al\.,[2024](https://arxiv.org/html/2606.13189#bib.bib5)\), while recent work also studies multi\-path reasoning, implicit target augmentation, and stereotype\-sensitive LLM evaluation\(Zhanget al\.,[2025](https://arxiv.org/html/2606.13189#bib.bib6); Jiet al\.,[2025](https://arxiv.org/html/2606.13189#bib.bib7); Dubreuilet al\.,[2025](https://arxiv.org/html/2606.13189#bib.bib8)\)\. These studies suggest that additional reasoning or external information does not automatically yield reliable gains when the stance is implicit or the target\-related evidence is weak\. This raises a basic question: what limits prompt\-based LLM stance detection on hard examples?

We argue that this limitation cannot be understood solely as a problem of how the model is prompted\. Stance examples differ substantially in the inferential burden they impose\. Some texts explicitly mention the target and state a direct attitude\. Others discuss an adjacent issue, rely on pragmatic implication or background knowledge, contain sentiment whose target is ambiguous, or simply lack enough evidence to support a stance judgment\(zhang2022sentiment\)\. Hard examples are therefore not merely harder versions of the same label\-selection problem: they may require qualitatively different amounts of semantic\-pragmatic inference before a stance label can be licensed\.

To study this source of variation, we introduceSICI, theStance Inference Complexity Index\.SICIcombines seven semantic\-pragmatic dimensions: target visibility, scope alignment, pragmatic implicitness, knowledge requirement, context dependence, label ambiguity, and polarity–stance gap\. It is not intended to replace standard metrics such as accuracy, macro\-F1, or the SemEvalFavgF\_\{\\mathrm\{avg\}\}score\. Instead, it provides an axis for asking a different question: how do LLM errors change as stance inference becomes more complex?

Our main finding is that LLM errors do not degrade smoothly withSICI\. Instead, they exhibit a phase\-transition\-like shift between error regimes\. At low complexity, models tend to over\-attribute stance, with a pronounced bias towardAgainst\. At intermediate complexity, accuracy drops sharply, forming an unstable boundary region\. At high complexity, predictions rapidly concentrate onNone, producing an abstention\-dominated regime\. Piecewise regression significantly outperforms a linear model, indicating that this pattern is not well explained as a simple monotonic decline in difficulty\.

This regime structure is qualitatively consistent across models and datasets\. Stronger models improve aggregate performance and move transition boundaries, but they do not remove the shift itself: GPT\-4o reduces low\-complexityAgainstover\-prediction while becoming more prone toNonepredictions in high\-complexity regions\. Cross\-dataset analysis further shows thatNoneis not semantically uniform\. In SemEval, high\-SICIexamples often contain implicit stance, soNonepredictions frequently reflect false abstention\. In VAST, many high\-SICIexamples are genuinely underspecified and labeledNone, making abstention more often appropriate\.

Finally, we return to the prompting bottleneck through a systematic intervention study\. We evaluate a broad set of prompting\-based interventions, including chain\-of\-thought, multi\-step reasoning,SICI\-aware prompting, retrieval augmentation, multi\-agent debate, and voting\. These interventions do not reliably break the high\-complexity bottleneck\. Instead, they often move the model along the attribution–abstention axis: prompts that encourage indirect inference can reduce someNonepredictions but increase false attribution, whereas prompts that emphasize caution or scope clarification can reduce over\-attribution while pushing the model toward excessiveNone\.

Our contributions are:

\(1\) We identify a prompting bottleneck in LLM stance detection: hard examples are not only under\-served by stronger elicitation, but organized by semantic\-pragmatic inference complexity\.

\(2\) We introduceSICI, a seven\-dimensional diagnostic measure of stance inference complexity, and show that it explains model behavior beyond surface features such as length, lexical diversity, and negation density\.

\(3\) We show that LLM stance errors exhibit a phase\-transition\-like regime shift: from low\-complexity over\-attribution, through an unstable boundary region, to high\-complexityNoneabstention\.

\(4\) We demonstrate that this structure is qualitatively robust across models and datasets, and that prompting, retrieval, debate, and voting interventions often trade false attribution against false abstention rather than resolving both\.

## 2Related Work

### Stance detection\.

Stance detection has been formalized in shared tasks and benchmark datasets such as SemEval\-2016 Task 6\(Mohammadet al\.,[2016](https://arxiv.org/html/2606.13189#bib.bib1)\), VAST\(Allaway and McKeown,[2020](https://arxiv.org/html/2606.13189#bib.bib9)\), P\-Stance\(Liet al\.,[2021](https://arxiv.org/html/2606.13189#bib.bib10)\), and multi\-target or conversation\-based variants\(Weiet al\.,[2018](https://arxiv.org/html/2606.13189#bib.bib11); Liet al\.,[2023b](https://arxiv.org/html/2606.13189#bib.bib12),[c](https://arxiv.org/html/2606.13189#bib.bib13)\)\. Earlier neural approaches use target\-specific attention, memory networks, graph models, and transfer learning to model the relation between a text and a target\(Duet al\.,[2017](https://arxiv.org/html/2606.13189#bib.bib14); Wei and Mao,[2019](https://arxiv.org/html/2606.13189#bib.bib15); Lianget al\.,[2021](https://arxiv.org/html/2606.13189#bib.bib16); Zhanget al\.,[2020](https://arxiv.org/html/2606.13189#bib.bib17)\)\. Pretrained language models and tweet\-specific encoders such as BERT and BERTweet further improve in\-domain and cross\-target performance\(Devlinet al\.,[2019](https://arxiv.org/html/2606.13189#bib.bib18); Nguyenet al\.,[2020](https://arxiv.org/html/2606.13189#bib.bib19)\)\. Recent work also incorporates background knowledge, commonsense, or prompt\-based reasoning\(Liuet al\.,[2021](https://arxiv.org/html/2606.13189#bib.bib20); Liet al\.,[2023a](https://arxiv.org/html/2606.13189#bib.bib21); Dinget al\.,[2024b](https://arxiv.org/html/2606.13189#bib.bib22)\)\. Our work is complementary: rather than proposing another stance classifier, we ask which instance properties systematically determine when LLM classifiers fail\.

### Zero\-shot and LLM\-based stance inference\.

Zero\-shot stance detection is challenging because test targets may not appear during training\(Allaway and McKeown,[2020](https://arxiv.org/html/2606.13189#bib.bib9); Lianget al\.,[2022](https://arxiv.org/html/2606.13189#bib.bib23); Allawayet al\.,[2021](https://arxiv.org/html/2606.13189#bib.bib24)\)\. LLM prompting provides a natural zero\-shot interface, and chain\-of\-thought or explanation\-based variants have been explored for implicit stance and social media reasoning\(Weiet al\.,[2022b](https://arxiv.org/html/2606.13189#bib.bib25); Gattoet al\.,[2023](https://arxiv.org/html/2606.13189#bib.bib26); Dinget al\.,[2024a](https://arxiv.org/html/2606.13189#bib.bib27)\)\. Recent LLM\-era stance work has explored reasoning over ideological or tree\-structured perspectives, retrieval\-augmented knowledge, and stance\-specific prompting for zero\-shot targets\(zhang2024knowledge; Taranukhinet al\.,[2024](https://arxiv.org/html/2606.13189#bib.bib4); Zhanget al\.,[2024](https://arxiv.org/html/2606.13189#bib.bib5)\)\. Other recent studies emphasize multi\-path reasoning for interpretability\(Zhanget al\.,[2025](https://arxiv.org/html/2606.13189#bib.bib6)\), LLM\-driven implicit target augmentation for target\-sparse examples\(Jiet al\.,[2025](https://arxiv.org/html/2606.13189#bib.bib7)\), and stereotype\-sensitive evaluation of zero\-shot LLM stance detection\(Dubreuilet al\.,[2025](https://arxiv.org/html/2606.13189#bib.bib8);zhang2025logic\)\. We differ from these lines by treating difficulty itself as the object of measurement\. Instead of adding reasoning paths, targets, or fairness probes,SICIasks which semantic\-pragmatic properties predict when such systems fail\.

### Difficulty, calibration, and regime shifts\.

Instance difficulty is often estimated from model uncertainty, agreement, or surface features\. Such signals are useful but can become circular when the same model both defines and evaluates difficulty\. We instead defineSICIusing semantic\-pragmatic attributes of the target–text pair, then validate its agreement across scorers and its relationship to independent prediction behavior\. Our notion of a regime shift is inspired by work on emergent abilities and phase\-transition\-like behavior in machine learning systems\(Weiet al\.,[2022a](https://arxiv.org/html/2606.13189#bib.bib28)\)\. Unlike scaling\-law studies, which analyze model capability as a function of model or compute scale, we study behavioral transitions as a function of*instance complexity*\.

## 3The SICI Framework

### Seven Dimensions\.

SICIassigns each target–text pair seven 0–4 scores covering target visibility, scope alignment, pragmatic implicitness, knowledge need, context dependence, label ambiguity, and polarity–stance mismatch \(Table[1](https://arxiv.org/html/2606.13189#S3.T1)\)\. Details of the corresponding inference stages are given in Appendix[A](https://arxiv.org/html/2606.13189#A1)\.

DimensionQuestion measuredLow scoreHigh\-score exampleVV: Target visibilityIs the target explicitly mentioned?target named directlystance toward a politician inferred from a policy tweetSS: Scope alignmentIs the text mainly about the target?text is on targettext discusses a related but shifted topicPP: Pragmatic implicitnessIs stance expressed indirectly?direct support/oppositionsarcasm, metaphor, rhetorical implicationKK: Knowledge requirementIs background knowledge needed?self\-contained textrequires knowing a law, event, or group relationCC: Context dependencyIs external conversational context needed?standalone textreply or quote lacking prior contextAA: Label ambiguityIs the gold label semantically contestable?clear label boundarymixed or underspecified stanceGG: Polarity–stance gapDoes sentiment polarity align with stance?sentiment matches stancepositive affect used to oppose via ironyTable 1:The seven semantic\-pragmatic dimensions ofSICI\. Each dimension is scored from 0 \(low inference burden\) to 4 \(high inference burden\)\. Together they track a multi\-stage stance inference chain from target identification to polarity–stance bridging\.
### Index Definition\.

Given seven scoresd1,…,d7∈\{0,1,2,3,4\}d\_\{1\},\\ldots,d\_\{7\}\\in\\\{0,1,2,3,4\\\}, we define:

SICI​\(x,t\)=\\displaystyle\\textsc\{SICI\}\{\}\(x,t\)=0\.65⋅mean​\(d1,…,d7\)4\\displaystyle\\ 0\.65\\cdot\\frac\{\\mathrm\{mean\}\(d\_\{1\},\\ldots,d\_\{7\}\)\}\{4\}\+0\.35⋅max⁡\(d1,…,d7\)4\.\\displaystyle\+0\.35\\cdot\\frac\{\\max\(d\_\{1\},\\ldots,d\_\{7\}\)\}\{4\}\.\(1\)The index ranges from 0 to 1\. The mean term captures cumulative load, while the max term captures bottlenecks where one severe ambiguity can dominate the instance\. We use equal dimension weights because a pilot comparison against a manually weighted variant produced almost identical scores \(Pearsonr=0\.9996r=0\.9996\)\.

### Scoring Protocol and Reliability\.

The mainSICIscores are produced by GPT\-4o\-mini using dimension\-specific rubrics and independent 0–4 judgments\. To test whetherSICIis a model idiosyncrasy, we additionally score a stratified sample of 200 instances with Claude Haiku 4\.5 and DeepSeek\-V3\. Pairwise Spearman correlations are high: 0\.829 \(GPT vs\. Claude\), 0\.853 \(GPT vs\. DeepSeek\), and 0\.884 \(Claude vs\. DeepSeek\)\. The three\-way ordinal Krippendorff’sα\\alphais 0\.771, above the conventional threshold for substantial agreement\. This does not remove the need for human validation, but it supports the claim thatSICIcaptures stable properties of the target–text pair rather than a single model’s prediction preference\.

The scoring prompt is deliberately separated from stance prediction\. The scorer is asked to judge the seven attributes of the instance, not to predictFavor,Against, orNone\. This separation matters because a difficulty measure based on the same prediction confidence that later enters the evaluation would risk circularity\. In our analysis,SICIis computed before any intervention comparison and is held fixed across all downstream models and prompts\. The sameSICIvalue is therefore used to analyze GPT\-3\.5, GPT\-4o\-mini, DeepSeek\-V3, GPT\-4o, and all intervention variants\.

### From Dimensions to Regimes\.

SICIsupports two complementary uses\. First, as a continuous score, it orders instances by expected inference complexity\. Second, with empirically fitted boundaries, it partitions the data into regimes\. We use two transition points,b1=0\.45b\_\{1\}=0\.45andb2=0\.70b\_\{2\}=0\.70, obtained from segmented\-regression analysis on SemEval\. Low\-complexity instances fall belowb1b\_\{1\}; intermediate instances occupy the boundary region where over\-attribution and abstention compete; high\-complexity instances exceedb2b\_\{2\}and are dominated by target invisibility, scope mismatch, andNone\-oriented abstention behavior\. The thresholds are not intended as universal constants\. Their role is diagnostic: they expose where a model changes its decision strategy as the target–text relation becomes less direct\.

## 4Experimental Setup

### Datasets\.

We evaluate on two core benchmarks\. SemEval\-2016 Task 6 contains English tweets labeledFavor,Against, orNonetoward social and political targets\(Mohammadet al\.,[2016](https://arxiv.org/html/2606.13189#bib.bib1)\)\. Our main SemEval analysis uses 1,249 test instances across five targets\. VAST is a zero\-shot stance dataset designed for unseen\-topic generalization\(Allaway and McKeown,[2020](https://arxiv.org/html/2606.13189#bib.bib9)\); we use the unseen\-topic test split with 1,460 instances\. We also report supporting cross\-dataset model scores on MTSD and P\-Stance where available, but the central cross\-dataset regime analysis uses SemEval and VAST\.

DatasetNNTargetsLabelsMeanSICISemEval1,2495F/A/N∼\\sim0\.35VAST unseen1,460manyF/A/N∼\\sim0\.38MTSD5002F/A/N∼\\sim0\.33P\-Stance7771F/A0\.267Table 2:Datasets used in the analysis\. F/A/N denotesFavor,Against, andNone\.
### Models and Metrics\.

We evaluate GPT\-3\.5\-turbo, GPT\-4o\-mini, DeepSeek\-V3, and GPT\-4o under zero\-shot prompting\. SemEval and P\-Stance useFavgF\_\{\\mathrm\{avg\}\}=\(F​1Favor\+F​1Against\)/2=\(F1\_\{\\textsc\{Favor\}\}\+F1\_\{\\textsc\{Against\}\}\)/2, the official SemEval\-style metric that excludesNonefrom the average\. MTSD and VAST use macro\-F1 or accuracy depending on the analysis; for cross\-dataset phase analysis we focus on accuracy withinSICIregions and separately inspectNonelabel composition to avoid inflated high\-SICIscores\.

We report both instance\-level and binned statistics\. Instance\-level correlations test whether higherSICIpredicts correctness at the sample level\. Binned correlations test whether the aggregate trend is monotonic after grouping instances into SICI intervals\. This distinction is important for stance detection because individual labels are noisy, while regime\-level behavior can still be stable\. For SemEval, we also reportFavgF\_\{\\mathrm\{avg\}\}because it is the official metric and prevents a highNoneprior from artificially improving the score\.

We also compute two diagnostic quantities\. The first is the systematic\-bias rate: the proportion of examples where the model predictsAgainstwhen the gold label is notAgainst, plus the proportion where it predictsNonewhen the gold label is notNone\. The second is BII, a bias–SICI interaction index that compares confidence and correctness withinSICIbins\. BII is intended to capture cases where the model is not merely wrong, but wrong with high confidence\.

### Surface Baselines\.

To test whetherSICIadds information beyond simple proxies, we compare against text length, target visibility as lexical target coverage, type\-token ratio, and negation density\. These baselines represent common surface\-level difficulty hypotheses: longer texts may contain more evidence, low target mention rate may make inference harder, lexical diversity may increase complexity, and negation may confuse polarity\-based decisions\.

### Interventions\.

For the high\-complexity SemEval subset \(SICI≥0\.70\\textsc\{SICI\}\{\}\\geq 0\.70,N=187N=187\), we evaluate 15 inference\-time interventions: self\-consistency, few\-shot prompting, self\-reflection, counterfactual reasoning, target decomposition, generated knowledge, multi\-agent debate, evidence chaining, SICI\-neighbor few\-shot retrieval, dimension\-targeted routing, Wikipedia RAG, RAG plus scope routing, cultural\-camp debate, and cultural\-camp debate with RAG\. This set spans prompting, retrieval, routing, and debate\-style methods\.

The intervention suite is designed to distinguish three possible explanations for high\-SICIfailure\. If the bottleneck is unstable decoding, self\-consistency should help\. If it is missing reasoning, chain\-of\-thought, reflection, counterfactual analysis, or target decomposition should help\. If it is missing information, few\-shot retrieval or Wikipedia RAG should help\. Failure across all three families would instead support the stronger interpretation that many Phase\-3 examples are underdetermined by the available text\-target pair\.

## 5Results

### SICIPredicts Accuracy Better Than Surface Proxies\.

On the merged non\-NoneSemEval\+VAST set \(N=1,960N=1\{,\}960\),SICIhas the strongest and directionally correct relationship with accuracy\. At the sample level, the point\-biserial correlation betweenSICIand correctness isr=−0\.2405r=\-0\.2405\(p=3\.4×10−27p=3\.4\\times 10^\{\-27\}\)\. At the binned level, the Spearman correlation isr=−0\.9515r=\-0\.9515\(p=2\.3×10−5p=2\.3\\times 10^\{\-5\}\)\. Table[3](https://arxiv.org/html/2606.13189#S5.T3)shows that surface baselines are either weaker or directionally misleading\. Target visibility has a high positive binned correlation because explicit targets are easier, but it captures only one dimension and cannot explain high\-SICIcollapse caused by scope and pragmatic ambiguity\.

MetricSamplerrBinnedρ\\rhoSICI\-0\.2405\-0\.9515Target visibility\+0\.1328\+0\.9245Text length\+0\.0648\+0\.6991Type\-token ratio\-0\.0508\-0\.4877Negation density\-0\.0174\+0\.0813Table 3:Correlation with correctness on merged non\-NoneSemEval\+VAST instances\. Negative values mean that higher difficulty predicts lower accuracy\.
### A Dual\-Fixation Regime Shift\.

Figure[1](https://arxiv.org/html/2606.13189#S5.F1)summarizes the core phenomenon\. AsSICIincreases, model performance decreases, but the more diagnostic signal is the systematic prediction\-bias rate\. Lower\-complexity errors are dominated byAgainstover\-prediction\. AroundSICI≈0\.45\\approx 0\.45, accuracy reaches a trough:Againstfixation begins to fail, whileNoneescape has not yet become dominant\. Above roughly 0\.70,Nonepredictions rise sharply\. A segmented\-regression comparison confirms that the three\-regime structure fits better than a linear trend \(F​\(2,14\)=16\.82F\(2,14\)=16\.82,p=1\.89×10−4p=1\.89\\times 10^\{\-4\}\)\.

The fit improvement is large in absolute terms\. On 18 SICI bins, the residual sum of squares drops from 0\.2141 for a single linear model to 0\.0629 for the two\-breakpoint segmented model, a 70\.6% reduction\. The first breakpoint is stable around 0\.45 across GPT\-3\.5\-turbo, GPT\-4o\-mini, and DeepSeek\-V3 \(0\.425–0\.450\)\. The second breakpoint varies more \(0\.700–0\.800\), suggesting that models differ mainly in when they begin to rely onNoneescape\.

![Refer to caption](https://arxiv.org/html/2606.13189v1/figures/fig1a_favg_cross_models.png)\(a\)Official SemEvalFavgF\_\{\\mathrm\{avg\}\}across four models\.
![Refer to caption](https://arxiv.org/html/2606.13189v1/figures/fig1b_systematic_bias.png)\(b\)Systematic prediction\-bias rate for GPT\-4o\-mini\.

Figure 1:User\-provided main regime\-shift figures\. AsSICIincreases, stance prediction quality declines across models, while systematic bias reveals two critical jumps:Againstfixation nearb1=0\.45b\_\{1\}=0\.45andNoneescape nearb2=0\.70b\_\{2\}=0\.70\.This pattern motivates an attribution–abstention account of the failure\. The model is not simply uncertain\. It falls back on different heuristics in different regions: first a tendency to treat stance\-bearing social media text as opposition, then a tendency to abstain withNonewhen the target–text relation becomes indirect\.

Table[4](https://arxiv.org/html/2606.13189#S5.T4)makes the shift concrete\. The low\-SICIbins contain manyAgainstpredictions and high accuracy\. The boundary bin at 0\.4–0\.5 is the weakest point: accuracy drops to 42\.3%, whileAgainstremains common andNonehas not yet become the dominant output\. After 0\.7, the model’s output distribution changes qualitatively\.Nonepredictions rise from 54\.5% in the 0\.7–0\.8 bin to 75\.0% above 0\.8, whileFavornearly disappears\. This is why a single monotone “harder means lower accuracy” story is incomplete: the same increasing complexity first produces over\-commitment to opposition and then over\-abstention\.

SICIbinNN%Against%NoneAcc\.0\.0–0\.1244\.20\.087\.50\.1–0\.259961\.11\.285\.80\.2–0\.315861\.41\.971\.50\.5–0\.625070\.016\.857\.20\.6–0\.733766\.829\.452\.20\.7–0\.823543\.454\.538\.30\.8–1\.06423\.475\.037\.5Table 4:SemEval GPT\-4o\-mini prediction trajectory bySICIbin\. Percent columns are model prediction rates, not gold label rates\.Figure[2](https://arxiv.org/html/2606.13189#S5.F2)shows two complementary views\. Raw accuracy exhibits a partial rebound after the first boundary becauseNonebecomes more frequent in some high\-SICIbins\. This rebound is misleading if interpreted as easier inference: a model can appear more accurate simply because the label prior has shifted towardNone\. The officialFavgF\_\{\\mathrm\{avg\}\}view and the systematic\-bias view avoid this artifact\. The former ignoresNonewhen computing SemEval quality; the latter directly measures when the model predictsAgainstorNoneagainst the gold label\. Together, the four views show why the regime\-shift claim is not a visual artifact of one metric\.

![Refer to caption](https://arxiv.org/html/2606.13189v1/figures/figA_accuracy_cross_models.png)\(a\)Accuracy bySICIbin across four models\.
![Refer to caption](https://arxiv.org/html/2606.13189v1/figures/figA_favg_cross_models_alt.png)\(b\)AlternativeFavgF\_\{\\mathrm\{avg\}\}view excludingNone\.

Figure 2:Additional user\-provided visualizations of the same regime structure\. Accuracy alone can obscure the transition because high\-SICIbins often contain moreNonelabels;FavgF\_\{\\mathrm\{avg\}\}and systematic\-bias rate isolate the stance\-inference failure more directly\.
### Stronger Models Move Boundaries but Do Not Remove the Regime\.

GPT\-4o substantially improves overall performance compared with GPT\-4o\-mini, but theSICIrelationship remains significant\. Table[5](https://arxiv.org/html/2606.13189#S5.T5)shows that GPT\-4o reduces the lower\-complexityAgainstpeak and improves the 0\.4–0\.5 boundary region, but its high\-SICINonerate is even higher\. Thus scale or capability changes the failure profile; it does not erase complexity\-conditioned behavior\.

The cross\-model pattern also clarifies what “stronger” means in this setting\. GPT\-4o is better at recovering explicit and moderately implicit stances, but its improvement is not equivalent to solving the inference chain\. The first breakpoint moves left, which means the model exits theAgainst\-fixation regime earlier\. The second breakpoint moves right, which means the model postpones its strongestNoneescape\. Yet the correlation remains negative, and the high\-SICIregion is still governed by abstention behavior\. Capability therefore stretches the phase boundaries rather than flattening the phase diagram\.

MeasureGPT\-4o\-miniGPT\-4oOverall macro\-F1∼\\sim0\.610\.737Againstpeak \(0\.3–0\.4\)81\.451\.4Accuracy at 0\.4–0\.542\.370\.3Nonerate at 0\.7–0\.854\.584\.3First boundaryb1b\_\{1\}0\.4500\.250Second boundaryb2b\_\{2\}0\.7000\.825SICI–accuracyρ\\rho\-0\.243\-0\.191Table 5:Stronger models shift the regime boundaries but preserve a significant relationship betweenSICIand accuracy\. Percent\-valued rows omit percent signs for compactness\.
### Cross\-Dataset Validation Separates Two Kinds ofNone\.

VAST provides an important contrast\. Its high\-SICIregion has high overall accuracy because the gold labels are mostlyNone: 250 of 288 Phase\-3 examples \(86\.8%\)\. This could appear to contradict the SemEval high\-SICIfailure pattern\. However, after separating non\-Nonecases, the difficulty remains: VAST Phase\-3 non\-Noneaccuracy is 0\.579, and merged SemEval\+VAST non\-Noneaccuracy drops monotonically from 0\.755 to 0\.605 to 0\.323 across the threeSICIphases\.

PhaseNNAccuracySICI<0\.45<0\.451,0690\.7550\.45≤\\leqSICI<0\.70<0\.707600\.605SICI≥0\.70\\geq 0\.701310\.323Table 6:Merged SemEval\+VAST non\-Noneaccuracy bySICIphase\.This comparison reveals a theoretical distinction\. In SemEval Phase 3, manyAgainstinstances are incorrectly mapped toNone: 68 of 87Againstcases becomeNone\. In VAST Phase 3,Noneis often a reasonable abstention because the target–text relation is genuinely underspecified\.SICItherefore separates two superficially similar outputs:*failed inference fixation*and*information\-insufficient abstention*\.

Additional dimension\-level analysis, reported in Appendix[B](https://arxiv.org/html/2606.13189#A2), shows that target visibility and scope alignment are the strongest individual drivers, while the remaining dimensions help distinguish target sparsity from genuinely complex stance inference\.

### Inference\-Time Interventions Hit a Ceiling\.

Table[7](https://arxiv.org/html/2606.13189#S5.T7)shows the high\-SICIintervention results\. The best methods are still zero\-shot and self\-consistency at 58\.3%\. Wikipedia RAG comes closest at 57\.8%, showing that real external knowledge is somewhat better than generated knowledge but still does not exceed the baseline\. SICI\-guided retrieval and dimension\-targeted routing underperform zero\-shot\. Multi\-agent debate is harmful, and cultural\-camp debate collapses by over\-predictingFavor\.

MethodAcc\.Δ\\Deltavs\. ZSZero\-shot58\.3–Self\-consistency58\.30\.0Wikipedia RAG57\.8\-0\.5Target decomposition55\.1\-3\.2Dimension routing52\.9\-5\.4Self\-reflection50\.8\-7\.5SICI retrieval49\.2\-9\.1Few\-shot47\.1\-11\.2Multi\-agent debate47\.1\-11\.2Cultural\-camp debate31\.0\-27\.3Table 7:Representative intervention results onSICI≥0\.70\\textsc\{SICI\}\{\}\\geq 0\.70SemEval examples \(N=187N=187\)\.These failures clarify the nature of the ceiling\. Additional reasoning does not help when the missing ingredient is not a reasoning step but an underdetermined pragmatic link between the text and target\. Retrieval helps only marginally because the most difficult cases often require conversational, author\-specific, or discourse\-level context that generic Wikipedia knowledge cannot supply\.

The failures also reveal a label\-prior pendulum\. Some interventions, such as indirect inference and scope clarification, reduceAgainsterrors but over\-correct towardNone\. In the original local analysis, indirect stance inference predictsNonefor 232 of 299 triggered cases \(77\.6%\), although the goldNonerate is only 56%\. Scope clarification is more extreme: 142 of 154 triggered cases are predicted asNone\(92\.2%\), while the goldNonerate is 45\.5%\. Debate\-style prompting moves in the opposite direction in some variants\. Cultural\-camp debate predictsFavor105 times even though only five gold examples areFavor, and its RAG\-augmented variant preserves almost the same skew\. These shifts can improve global macro\-F1 slightly when they counteract a dataset\-level bias, but they do not solve the local high\-SICIdecision\. A useful intervention for stance detection must therefore do more than change a model’s prior over labels; it must calibrate the strength of the target–text link\.

Appendix[B](https://arxiv.org/html/2606.13189#A2)further reports routing and confidence analyses:SICI\-guided routes can improve global macro\-F1 by shifting label priors, but they do not solve the local high\-SICIsubset, and high complexity is not reducible to low confidence\.

## 6Discussion

### SICIas a diagnostic rather than a leaderboard metric\.

SICIshould not replace task metrics\. Instead, it complements them by identifying where a model’s aggregate score comes from\. Two models with similar macro\-F1 may differ in whether they fail throughAgainstfixation,Noneescape, or boundary\-region confusion\. Reporting performance stratified bySICIwould make stance evaluation more informative\.

This diagnostic role is especially important for prompt\-based evaluation\. A low score can arise because the model lacks task competence, because the prompt induces the wrong label prior, or because the instance itself provides too little target\-conditioned evidence\.SICIhelps separate these cases by making the source of difficulty inspectable: the target may be invisible, the text may be off\-scope, the stance may be pragmatic, external knowledge may be required, or the gold label may be intrinsically ambiguous\. This turns error analysis from a post\-hoc list of mistakes into a structured account of where the target–text inference chain breaks\.

### Why high\-complexity prompting fails\.

The intervention results suggest a practical warning for LLM\-based stance systems\. Prompting strategies often shift label priors rather than improve fine\-grained calibration\. Indirect\-inference and scope\-clarification prompts reduce one kind of error but overshoot towardNone; debate prompts can introduce new hallucinated stances\. This explains the “pendulum” effect observed in our experiments: interventions swing the model from one fixation mode to another\.

The failure is therefore not simply that the tested prompts are weak\. Many interventions add exactly the resources that prompt engineering usually assumes to be helpful: explicit reasoning, multiple samples, retrieved knowledge, or disagreement among agents\. Their limited effect suggests that the hard cases often require information that is not recoverable from the visible text\-target pair, or require pragmatic commitments that the model cannot license without over\-interpreting the author\. In such cases, a better prompt may only choose a different trade\-off between false attribution and false abstention\.

### Toward model\-level and context\-level solutions\.

If high\-SICIfailures are caused by missing pragmatic context or ambiguous target–text relations, inference\-time prompting may be insufficient\. More promising directions include SICI\-stratified training, calibrated abstention policies, retrieval of conversational context rather than encyclopedic background, and human\-in\-the\-loop treatment of intrinsically ambiguous cases\.

### Implications for benchmark construction\.

The analysis also suggests that stance benchmarks should report their complexity distribution\. A dataset dominated by low\-SICIexamples primarily tests direct target matching and sentiment\-to\-stance mapping\. A dataset with many high\-SICIexamples tests pragmatic inference, context recovery, and abstention calibration\. Without this distribution, two datasets with the same label set can evaluate very different skills\. ReportingSICIhistograms and phase\-stratified scores would make cross\-dataset comparisons less dependent on hidden differences in target visibility and topic alignment\.

The per\-target results from the original analysis illustrate this point\. In VAST, the*election*target is especially difficult: all evaluated models are near 26–34% macro\-F1, close to random\-level behavior for a three\-way task\. By contrast, P\-Stance removes theNonelabel and produces much higher scores for stronger models\. These differences are not just dataset names; they reflect distinct inference regimes\. A hard benchmark for stance detection should therefore be constructed from high\-SICInon\-Noneexamples rather than from examples that are merely label\-balanced\.

### Implications for LLM evaluation\.

For LLMs, the most concerning failures are not always low\-confidence errors\. In our analysis, high\-SICIexamples often elicit confident but wrong predictions, especially around the boundary whereAgainstfixation andNoneescape compete\. This behavior matters for downstream use: a stance system that returns a single label without exposing instance complexity may be least reliable exactly when its output looks decisive\.SICIcan therefore serve as a triage signal: low\-complexity predictions may be used normally, intermediate cases may require calibration checks, and high\-complexity cases should trigger abstention, context retrieval, or human review\.

## 7Conclusion

We introducedSICI, a semantic\-pragmatic complexity index for diagnosing LLM stance detection\. Across datasets and models,SICIpredicts where LLMs fail and reveals an attribution–abstention regime shift from low\-complexity over\-prediction to high\-complexityNoneescape\. Stronger models move the transition boundaries but do not eliminate the phenomenon, and prompting, retrieval, debate, and voting interventions often shift the model’s operating point rather than remove the high\-complexity bottleneck\. The main implication is that stance detection errors should be analyzed as structured failures of target\-conditioned inference rather than as undifferentiated classification mistakes\. Future stance systems should therefore combine model improvements with complexity\-aware evaluation, calibrated abstention, and richer context acquisition\.

## Limitations

First, the mainSICIscores are LLM\-generated\. Although three independent LLM scorers show substantial agreement, human annotation is needed to fully validate the scale and rule out shared model biases\. Second, the experiments focus on English social\-media datasets; long\-form, multilingual, and multimodal stance settings may exhibit different dimension weights\. Third, the high\-SICInon\-Nonesubset is relatively small, especially in VAST, so future work should construct larger hard\-instance benchmarks\. Finally, our intervention study is inference\-time only; it does not test whether supervised fine\-tuning on SICI\-stratified data can reduce the high\-complexity ceiling\.

## Ethical Considerations

Stance detection can be used for beneficial analysis of public discourse, but also for political profiling, surveillance, and manipulation\. Our work is diagnostic and does not release a new classifier intended for deployment\. We use publicly available benchmark datasets and API\-accessible LLMs under their respective terms of use, and we do not redistribute dataset contents or model weights\. Because high\-SICIinstances are often ambiguous or context\-dependent, automated decisions on such cases should not be treated as ground truth\. Systems using stance predictions in sensitive settings should report uncertainty, allow human review, and avoid inferring personal beliefs from sparse or indirect text\.

## References

- Stance detection on social media: state of the art and trends\.Information Processing & Management58\(4\),pp\. 102597\.External Links:[Document](https://dx.doi.org/10.1016/j.ipm.2021.102597),[Link](https://www.sciencedirect.com/science/article/pii/S0306457321000960)Cited by:[§1](https://arxiv.org/html/2606.13189#S1.p1.1)\.
- E\. Allaway and K\. R\. McKeown \(2020\)Zero\-shot stance detection: A dataset and model using generalized topic representations\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16\-20, 2020,pp\. 8913–8931\.Cited by:[§2](https://arxiv.org/html/2606.13189#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.13189#S2.SS0.SSS0.Px2.p1.1),[§4](https://arxiv.org/html/2606.13189#S4.SS0.SSS0.Px1.p1.1)\.
- E\. Allaway, M\. Srikanth, and K\. Mckeown \(2021\)Adversarial learning for zero\-shot stance detection on social media\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 4756–4767\.Cited by:[§2](https://arxiv.org/html/2606.13189#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),pp\. 4171–4186\.Cited by:[§2](https://arxiv.org/html/2606.13189#S2.SS0.SSS0.Px1.p1.1)\.
- D\. Ding, G\. Dai, C\. Peng, X\. Peng, B\. Zhang, and H\. Huang \(2024a\)Distantly supervised explainable stance detection via chain\-of\-thought supervision\.Mathematics12\(7\)\.External Links:[Link](https://www.mdpi.com/2227-7390/12/7/1119),ISSN 2227\-7390,[Document](https://dx.doi.org/10.3390/math12071119)Cited by:[§2](https://arxiv.org/html/2606.13189#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Ding, X\. Fu, X\. Peng, X\. Fan, H\. Huang, and B\. Zhang \(2024b\)Leveraging chain\-of\-thought to enhance stance detection with prompt\-tuning\.Mathematics12\(4\)\.External Links:[Link](https://www.mdpi.com/2227-7390/12/4/568),ISSN 2227\-7390,[Document](https://dx.doi.org/10.3390/math12040568)Cited by:[§2](https://arxiv.org/html/2606.13189#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Du, R\. Xu, Y\. He, and L\. Gui \(2017\)Stance classification with target\-specific neural attention\.InProceedings of the Twenty\-Sixth International Joint Conference on Artificial Intelligence, IJCAI\-17,pp\. 3988–3994\.Cited by:[§2](https://arxiv.org/html/2606.13189#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Dubreuil, A\. Gourru, C\. Largeron, and A\. Trabelsi \(2025\)Are stereotypes leading LLMs’ zero\-shot stance detection ?\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,Suzhou, China,pp\. 31517–31530\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1605/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1605),ISBN 979\-8\-89176\-332\-6Cited by:[§1](https://arxiv.org/html/2606.13189#S1.p2.1),[§2](https://arxiv.org/html/2606.13189#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Gatto, O\. Sharif, and S\. Preum \(2023\)Chain\-of\-thought embeddings for stance detection on social media\.InFindings of the Association for Computational Linguistics: EMNLP 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 4154–4161\.External Links:[Link](https://aclanthology.org/2023.findings-emnlp.273/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.273)Cited by:[§2](https://arxiv.org/html/2606.13189#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Ji, J\. Ning, Y\. Zhang, Z\. Liu, and H\. Lin \(2025\)LLM\-driven implicit target augmentation and fine\-grained contextual modeling for zero\-shot and few\-shot stance detection\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,Suzhou, China,pp\. 5872–5884\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.299/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.299),ISBN 979\-8\-89176\-332\-6Cited by:[§1](https://arxiv.org/html/2606.13189#S1.p2.1),[§2](https://arxiv.org/html/2606.13189#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Li, B\. Liang, J\. Zhao, B\. Zhang, M\. Yang, and R\. Xu \(2023a\)Stance detection on social media with background knowledge\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Singapore,pp\. 15703–15717\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.972),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.972)Cited by:[§2](https://arxiv.org/html/2606.13189#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Li, T\. Sosea, A\. Sawant, A\. J\. Nair, D\. Inkpen, and C\. Caragea \(2021\)P\-stance: a large dataset for stance detection in political domain\.InFindings of the Association for Computational Linguistics: ACL\-IJCNLP 2021,pp\. 2355–2365\.Cited by:[§2](https://arxiv.org/html/2606.13189#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Li, H\. He, S\. Wang, F\. C\. Lau, and Y\. Song \(2023b\)Improved target\-specific stance detection on social media platforms by delving into conversation threads\.IEEE Transactions on Computational Social Systems\.Cited by:[§2](https://arxiv.org/html/2606.13189#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Li, D\. Wen, H\. He, J\. Guo, X\. Ning, and F\. C\. M\. Lau \(2023c\)Contextual target\-specific stance detection on twitter: dataset and method\.In2023 IEEE International Conference on Data Mining \(ICDM\),Vol\.,pp\. 359–367\.External Links:[Document](https://dx.doi.org/10.1109/ICDM58522.2023.00045)Cited by:[§2](https://arxiv.org/html/2606.13189#S2.SS0.SSS0.Px1.p1.1)\.
- B\. Liang, Z\. Chen, L\. Gui, Y\. He, M\. Yang, and R\. Xu \(2022\)Zero\-shot stance detection via contrastive learning\.InProceedings of the ACM Web Conference 2022,pp\. 2738–2747\.Cited by:[§2](https://arxiv.org/html/2606.13189#S2.SS0.SSS0.Px2.p1.1)\.
- B\. Liang, Y\. Fu, L\. Gui, M\. Yang, J\. Du, Y\. He, and R\. Xu \(2021\)Target\-adaptive graph for cross\-target stance detection\.InProceedings of the Web Conference 2021,pp\. 3453–3464\.Cited by:[§2](https://arxiv.org/html/2606.13189#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Liu, Z\. Lin, Y\. Tan, and W\. Wang \(2021\)Enhancing zero\-shot and few\-shot stance detection with commonsense knowledge graph\.InFindings of the Association for Computational Linguistics: ACL\-IJCNLP 2021,pp\. 3152–3157\.Cited by:[§2](https://arxiv.org/html/2606.13189#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Mohammad, S\. Kiritchenko, P\. Sobhani, X\. Zhu, and C\. Cherry \(2016\)Semeval\-2016 task 6: detecting stance in tweets\.InProceedings of the 10th international workshop on semantic evaluation \(SemEval\-2016\),pp\. 31–41\.Cited by:[§1](https://arxiv.org/html/2606.13189#S1.p1.1),[§2](https://arxiv.org/html/2606.13189#S2.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2606.13189#S4.SS0.SSS0.Px1.p1.1)\.
- D\. Q\. Nguyen, T\. Vu, and A\. T\. Nguyen \(2020\)BERTweet: A pre\-trained language model for english tweets\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP 2020 \- Demos, Online, November 16\-20, 2020,pp\. 9–14\.External Links:[Link](https://doi.org/10.18653/v1/2020.emnlp-demos.2),[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-demos.2)Cited by:[§2](https://arxiv.org/html/2606.13189#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Taranukhin, V\. Shwartz, and E\. Milios \(2024\)Stance reasoner: zero\-shot stance detection on social media with explicit reasoning\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\),N\. Calzolari, M\. Kan, V\. Hoste, A\. Lenci, S\. Sakti, and N\. Xue \(Eds\.\),Torino, Italia,pp\. 15257–15272\.External Links:[Link](https://aclanthology.org/2024.lrec-main.1326/)Cited by:[§1](https://arxiv.org/html/2606.13189#S1.p2.1),[§2](https://arxiv.org/html/2606.13189#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Wei, Y\. Tay, R\. Bommasani, C\. Raffel, B\. Zoph, S\. Borgeaud, D\. Yogatama, M\. Bosma, D\. Zhou, D\. Metzler,et al\.\(2022a\)Emergent abilities of large language models\.arXiv preprint arXiv:2206\.07682\.Cited by:[§2](https://arxiv.org/html/2606.13189#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022b\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in Neural Information Processing Systems35,pp\. 24824–24837\.Cited by:[§2](https://arxiv.org/html/2606.13189#S2.SS0.SSS0.Px2.p1.1)\.
- P\. Wei, J\. Lin, and W\. Mao \(2018\)Multi\-target stance detection via a dynamic memory\-augmented network\.InThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval,pp\. 1229–1232\.Cited by:[§2](https://arxiv.org/html/2606.13189#S2.SS0.SSS0.Px1.p1.1)\.
- P\. Wei and W\. Mao \(2019\)Modeling transferable topics for cross\-target stance detection\.InProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 1173–1176\.Cited by:[§2](https://arxiv.org/html/2606.13189#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Weinzierl and S\. Harabagiu \(2024\)Tree\-of\-counterfactual prompting for zero\-shot stance detection\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 861–880\.External Links:[Link](https://aclanthology.org/2024.acl-long.49/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.49)Cited by:[§1](https://arxiv.org/html/2606.13189#S1.p2.1)\.
- B\. Zhang, M\. Yang, X\. Li, Y\. Ye, X\. Xu, and K\. Dai \(2020\)Enhancing cross\-target stance detection with transferable semantic\-emotion knowledge\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,pp\. 3188–3197\.Cited by:[§2](https://arxiv.org/html/2606.13189#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Zhang, Y\. Li, J\. Zhang, and H\. Xu \(2024\)LLM\-driven knowledge injection advances zero\-shot and cross\-target stance detection\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 2: Short Papers\),K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\),Mexico City, Mexico,pp\. 371–378\.External Links:[Link](https://aclanthology.org/2024.naacl-short.32/),[Document](https://dx.doi.org/10.18653/v1/2024.naacl-short.32)Cited by:[§1](https://arxiv.org/html/2606.13189#S1.p2.1),[§2](https://arxiv.org/html/2606.13189#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Zhang, J\. Zhang, H\. Xu, J\. Guo, and X\. Cheng \(2025\)MPRF: interpretable stance detection through multi\-path reasoning framework\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,Suzhou, China,pp\. 454–470\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.24/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.24),ISBN 979\-8\-89176\-332\-6Cited by:[§1](https://arxiv.org/html/2606.13189#S1.p2.1),[§2](https://arxiv.org/html/2606.13189#S2.SS0.SSS0.Px2.p1.1)\.

## Appendix AAdditional Experimental Details

### Inference\-chain interpretation\.

The sevenSICIdimensions can be read as a multi\-stage stance inference chain\. A model must identify the target \(VV\), decide whether the text’s scope bears on that target \(SS\), decode pragmatic cues \(PP\), retrieve needed knowledge \(KK\), integrate missing context \(CC\), resolve label ambiguity \(AA\), and bridge sentiment polarity to stance \(GG\)\. A severe failure at any stage can derail the final label\.

### Phase definitions\.

We useb1=0\.45b\_\{1\}=0\.45andb2=0\.70b\_\{2\}=0\.70as empirical transition points, yielding Phase 1 \(SICI<0\.45\\textsc\{SICI\}\{\}<0\.45\), Phase 2 \(0\.45≤SICI<0\.700\.45\\leq\\textsc\{SICI\}\{\}<0\.70\), and Phase 3 \(SICI≥0\.70\\textsc\{SICI\}\{\}\\geq 0\.70\)\. These thresholds are selected from the SemEval segmented\-regression analysis and then reused for VAST\.

### Full intervention set\.

The full Phase\-3 intervention suite includes zero\-shot, self\-consistency, few\-shot, self\-reflection, counterfactual reasoning, target decomposition, generated knowledge, standard multi\-agent debate, evidence chaining, SICI\-nearest\-neighbor few\-shot retrieval, dimension\-targeted adaptive prompting, Wikipedia RAG, RAG plus scope routing, cultural\-camp debate, and cultural\-camp debate plus RAG\. Evidence chaining is excluded from substantive comparisons because a formatting failure caused label parsing to collapse\.

## Appendix BAdditional Dimension, Routing, and Confidence Analyses

### Dimension\-level gradients\.

Single\-dimension analysis on SemEval indicates that target visibility and scope alignment are the strongest drivers\. Accuracy drops from 0\.909 to 0\.607 betweenV=0V=0andV=4V=4, and from 0\.886 to 0\.627 betweenS=0S=0andS=4S=4\. Their interaction is especially diagnostic: in the SemEvalV=4,S=4V=4,S=4cell \(N=247N=247\), the model predictsNonefor 85\.8% of examples while the goldNonerate is only 50\.2%\. In VAST, the sameV=4,S=4V=4,S=4pattern corresponds to a mostly validNoneabstention:None\-prediction rate 95\.5% and goldNonerate 96\.6%\. This suggests that the same inference\-chain break can either expose a dataset’s legitimate underspecification or reveal a model’s erroneous escape behavior\.

The remaining dimensions help explain why target visibility alone is insufficient\. Pragmatic implicitness captures cases where the target is visible but the stance is conveyed through irony, metaphor, or indirect evaluation\. Knowledge requirement captures cases where a stance depends on knowing an event, policy, or social relation\. Label ambiguity captures instances where the text may support multiple defensible interpretations\. Polarity–stance gap captures the familiar failure of treating sentiment as stance\. These dimensions are individually weaker thanVVandSSin our SemEval analysis, but they are essential for distinguishing a merely target\-sparse example from a genuinely complex stance inference problem\.

Ablation results add a second perspective\. RemovingSSorPPchanges the binned correlation most strongly, indicating that scope alignment and pragmatic implicitness are central to the monotonic regime signal\. RemovingGGslightly improves the correlation in our current equal\-weight setting, which suggests that polarity–stance gap is noisier than the other dimensions on these datasets\. We keepGGbecause the construct is theoretically important for stance detection, but the result indicates that future versions ofSICIshould refine how affective polarity is separated from target\-conditioned stance\.

DimensionLow acc\.High acc\.DropVV: target visibility0\.9090\.607\-0\.302SS: scope alignment0\.8860\.627\-0\.258KK: knowledge need0\.7690\.549\-0\.220CC: context dependence0\.7460\.600\-0\.146Table 8:Representative SemEval single\-dimension gradients\. “Low” and “high” refer to low\- versus high\-complexity levels for that dimension; exact sample sizes vary by dimension\.
### SICI\-guided routing\.

The original experiments also tested staged routing on the full 2,899\-instance pool\. A simple CoT route triggered onSICI\>0\.6\\textsc\{SICI\}\{\}\>0\.6improves global macro\-F1 from 0\.5916 to 0\.6132 \(\+2\.16pp\)\. A more targeted ISI\+SC route, which applies indirect stance inference to high\-VVcases and scope clarification to high\-SScases, reaches 0\.6209 \(\+2\.93pp\)\. The best full routing policy combines ISI, SC, and CoT on 790 triggered examples \(27\.2% of the pool\), reaching 0\.6296 \(\+3\.80pp\)\.

This result prevents an overly pessimistic reading of the intervention study\.SICIcan guide useful global routing: when the model’s dominant error is an overactiveAgainstprior, shifting some examples toward more cautious reasoning can improve macro\-F1\. However, the same mechanism does not solve the high\-SICIsubset\. Local ISI and SC scores fall below their local baselines because they overshoot towardNone\. ThusSICIis valuable for diagnosis and routing, but routing is not equivalent to resolving the semantic\-pragmatic ceiling\.

This global\-local gap is useful diagnostically\. A routing policy can improve the dataset\-level score by correcting the most common prior error, but its gains depend on the distribution of phases in the evaluation set\. In a deployment setting where high\-SICInon\-Nonecases are the main concern, the same routing rule could be harmful\. We therefore treat routing as an operational use ofSICI, not as evidence that the underlying semantic\-pragmatic ambiguity has been removed\.

### Confidence failure\.

High\-SICIexamples are not simply low\-confidence cases\. BII is negatively correlated withSICI\(ρ=−0\.308\\rho=\-0\.308\), indicating that the gap between confidence and correctness worsens as complexity rises\. The most problematic region is the transition band, whereAgainstfixation has become unreliable butNoneescape has not yet become a calibrated abstention strategy\. In this region, the model can select a wrong label with apparent confidence\.

This finding changes how we interpretNone\. A well\-calibrated model should reserveNonefor cases where the text does not support a stance toward the target\. In the high\-SICISemEval subset,Noneoften functions instead as an uncalibrated failure mode: the model gives up on implicitAgainstexamples even when the gold label is notNone\. In VAST, by contrast, high\-SICINoneis often a valid response\. Confidence alone cannot distinguish these two cases; the target–text complexity structure is needed\.

Similar Articles

Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs

arXiv cs.CL

This paper proposes a semantic verification framework using Natural Language Inference (NLI) to evaluate the sensitivity of clinical LLMs to meaning-preserving prompt variations, introducing metrics such as MVS, ΔC, and WCI. Results show that domain specialization does not consistently improve robustness, with both domain-specific and general-purpose models showing mixed performance.