Slogans or Stance? A Label-Light Diagnostic for Entrepreneurial-Discourse Measurement on Chinese SOE Speeches

arXiv cs.CL Papers

Summary

This paper proposes a label-light measurement diagnostic to evaluate whether popular text analysis methods (dictionaries, topic models, embeddings, LLMs) capture substantive stance versus symbolic rhetoric in entrepreneurial-discourse measurement, using a corpus of 80 Chinese SOE speeches and a natural experiment with same-company different-speaker pairs. The authors find that zero-shot LLMs show higher sensitivity but a significant portion of the effect may be due to speaker idiolect rather than substantive stance.

arXiv:2605.29188v1 Announce Type: new Abstract: Dictionary methods, topic models, and embedding-similarity scorers are widely used in CSS and management research to measure constructs such as "entrepreneurial spirit" in corporate speeches. We contribute a label-light measurement diagnostic for such instruments rather than a new extraction model. On a corpus of 80 speeches by leaders of centrally administered Chinese state-owned enterprises, we exploit a natural experiment of 24 same-company different-speaker pairs and 5 same-company same-speaker pairs to test whether a method's per-document indices vary with leader identity holding firm constant. LDA fails (Cohen d=0.20, 95% CI [-0.72, 1.20]); a dictionary scorer reaches d=0.81 and a Chinese sentence encoder d=0.65 on doc-vector distances of order 10^-3. A zero-shot 9B open-weight LLM (Qwen3.5:9b) raises paired-contrast d to 1.09 (exact permutation p1=0.034). We downgrade three claims accordingly: gold F1 measures consistency with the LLM's own prompt rule rather than external construct recovery; doc-level style residualisation cuts the LLM's d to 0.43 (p1=0.22), so roughly half of the effect is consistent with leader idiolect; and a confidence-weighted calibration trades Delta for variance with an auto-mined slogan lexicon near-inert in ablation. We release the 2,190-segment scored corpus, the 170-paragraph pilot, the slogan lexicon, two-family LLM scores, and the evaluation harness.
Original Article
View Cached Full Text

Cached at: 05/29/26, 09:16 AM

# Slogans or Stance? A Label-Light Diagnostic for Entrepreneurial-Discourse Measurement on Chinese SOE Speeches
Source: [https://arxiv.org/html/2605.29188](https://arxiv.org/html/2605.29188)
###### Abstract

Dictionary methods, topic models, and embedding\-similarity scorers are widely used in CSS and management research to measure constructs such as “entrepreneurial spirit” in corporate speeches\. We contribute a*label\-light measurement diagnostic*for such instruments rather than a new extraction model\. On a corpus of 80 speeches by leaders of centrally administered Chinese state\-owned enterprises, we exploit a natural experiment of 24 same\-company different\-speaker pairs and 5 same\-company same\-speaker pairs to test whether a method’s per\-document indices vary with leader identity holding firm constant\. LDA fails \(Cohend=0\.20d=0\.20, 95% CI\[−0\.72,1\.20\]\[\-0\.72,1\.20\]\); a dictionary scorer reachesd=0\.81d=0\.81and a Chinese sentence encoderd=0\.65d=0\.65on doc\-vector distances of order10−310^\{\-3\}\. A zero\-shot 9B open\-weight LLM \(Qwen3\.5:9b\) raises paired\-contrastddto1\.091\.09\(exact permutationp1=0\.034p\_\{1\}=0\.034\)\. We downgrade three claims accordingly: goldF1F\_\{1\}measures consistency with the LLM’s own prompt rule rather than external construct recovery; doc\-level style residualisation cuts the LLM’sddto0\.430\.43\(p1=0\.22p\_\{1\}=0\.22\), so roughly half of the effect is consistent with leader idiolect; and a confidence\-weighted calibration tradesΔ\\Deltafor variance with an auto\-mined slogan lexicon near\-inert in ablation\. We release the 2,190\-segment scored corpus, the 170\-paragraph pilot, the slogan lexicon, two\-family LLM scores, and the evaluation harness\.

Slogans or Stance? A Label\-Light Diagnostic for Entrepreneurial\-Discourse Measurement on Chinese SOE Speeches

Ting Gong, Shangquan Sun🖂Tsinghua Universitygongting@umich\.edu

## 1Introduction

Empirical work on entrepreneurship and corporate governance relies on textual indices derived from corporate disclosures and leadership addresses\. Such indices serve as inputs to studies of firm performance, innovation, and political affiliation, and they often rest on one of three method families: keyword dictionariesLoughran and McDonald \([2011](https://arxiv.org/html/2605.29188#bib.bib1)\); Bakeret al\.\([2016](https://arxiv.org/html/2605.29188#bib.bib2)\); Huang and Luk \([2020](https://arxiv.org/html/2605.29188#bib.bib3)\), latent topic modelsBleiet al\.\([2003](https://arxiv.org/html/2605.29188#bib.bib4)\), and pre\-trained word or sentence embeddingsMikolovet al\.\([2013](https://arxiv.org/html/2605.29188#bib.bib90)\); Devlinet al\.\([2019](https://arxiv.org/html/2605.29188#bib.bib88)\)\. All three measure topical*coverage*: how much a document discusses innovation, risk, or sustainability\. None was designed to separate the speaker’s*substantive*stance on a construct from the speaker’s*symbolic\-rhetorical*performance about it\.

This separation matters in discourse genres where symbolic performance is part of the institutional contract\. Speeches by leaders of centrally administered Chinese state\-owned enterprises \(SOEs\) are a clear example\. An SOE chairperson’s address is expected to include canonical political invocations — for example the policy slogan*“cultivate world\-class enterprises with global competitiveness”*or the doctrinal formulation*“state\-owned enterprises are an important material and political foundation of socialism with Chinese characteristics”*\(see Table[2](https://arxiv.org/html/2605.29188#S4.T2)for further entries\) — alongside operational reporting such as a specific divestiture, a named joint venture, or a numerically anchored R&D programme\. The two coexist in the same speech, often in the same paragraph\. A dictionary\-based “innovation index” computed over such a speech is largely driven by the former, because slogans contain most of the keywords that a hand\-curated dictionary recognises and are interchangeable across speeches\.

We therefore ask:*do standard entrepreneurial\-discourse extraction methods, as widely deployed on Chinese corporate corpora, measure leader\-level stance, or do they measure recurrent political symbolism?*

To answer this without commissioning a new annotation round we exploit a property of our corpus: 29 of 51 companies appear in two interview waves\. Twenty\-four of those wave pairs involve a*change of speaker*at the same company; five involve the*same speaker*reappearing\. If a method captures leader\-level stance, its per\-document vectors should differ more across leader\-change pairs than across same\-leader pairs\. If a method captures the company’s industry topic or shared political ritual, the two distributions should be indistinguishable\.

Our contributions are:

1. 1\.A*leader\-change paired evaluation*for stance extraction in performative corporate discourse that requires no in\-domain annotation \(§[5](https://arxiv.org/html/2605.29188#S5)\)\.
2. 2\.An audit of four method families on 80 central\-SOE speeches that shows that dictionary, LDA, and BGE\-encoder methods fail the evaluation: their paired\-contrast effect either crosses zero in its 95% bootstrap CI or is computed on doc\-vector distances of order10−310^\{\-3\}\(§[6](https://arxiv.org/html/2605.29188#S6)\)\. A complementary*paraphrase\- robustness*experiment confirms that the dictionary baseline*increases*its score when substantive content is rewritten in slogan style \(§[6\.6](https://arxiv.org/html/2605.29188#S6.SS6)\)\.
3. 3\.A*confidence\-weighted calibration*on a zero\-shot 9B open\-weight LLM \(Qwen3\.5:9b\) that trades two paired\-contrast metrics: it raises absoluteΔ\\Deltaby 27% but reduces Cohen’sddfrom1\.091\.09to0\.830\.83; ablation localises the lift to LLM self\-confidence, not to the auto\-mined slogan lexicon\. Qualitative ordering is preserved under a within\-family cross\-LLM check \(Qwen3\.5:27b, §[6\.5](https://arxiv.org/html/2605.29188#S6.SS5)\)\.
4. 4\.Release of the 2,190\-segment scored corpus, the 170\-paragraph pilot gold set, the 53\-entry auto\-mined slogan lexicon, and the full evaluation harness\.

## 2Related Work

#### Entrepreneurial orientation and entrepreneurial leadership\.

The management and entrepreneurship literature has long treated entrepreneurial orientation \(EO\) as a structured construct of innovativeness, proactiveness, risk\-taking, autonomy, and competitive aggressivenessMiller \([1983](https://arxiv.org/html/2605.29188#bib.bib73)\); Covin and Slevin \([1989](https://arxiv.org/html/2605.29188#bib.bib74)\); Lumpkin and Dess \([1996](https://arxiv.org/html/2605.29188#bib.bib75)\), and subsequent reviews emphasise both its firm\-level centrality and the instability of its operationalisation across contextsRauchet al\.\([2009](https://arxiv.org/html/2605.29188#bib.bib76)\); Wales \([2016](https://arxiv.org/html/2605.29188#bib.bib77)\)\. Parallel work on corporate entrepreneurship and entrepreneurial leadership argues that such constructs are enacted through strategic posture and leader discourse rather than through isolated keywordsDess and Lumpkin \([2005](https://arxiv.org/html/2605.29188#bib.bib78)\); Kuratko and Hornsby \([1999](https://arxiv.org/html/2605.29188#bib.bib79)\); Harrisonet al\.\([2016](https://arxiv.org/html/2605.29188#bib.bib80)\); Bagheri and Harrison \([2020](https://arxiv.org/html/2605.29188#bib.bib81)\)\. Our paper inherits this measurement target and asks whether standard text\-as\-data pipelines recover it from SOE speeches\.

#### Dictionary\-based discourse measurement\.

The Loughran–McDonald financial sentiment dictionaryLoughran and McDonald \([2011](https://arxiv.org/html/2605.29188#bib.bib1)\), Tetlock’s media\-sentiment workTetlock \([2007](https://arxiv.org/html/2605.29188#bib.bib91)\), and the Baker–Bloom–Davis economic\-policy uncertainty indexBakeret al\.\([2016](https://arxiv.org/html/2605.29188#bib.bib2)\), with its Chinese\-language extensionHuang and Luk \([2020](https://arxiv.org/html/2605.29188#bib.bib3)\), are emblematic of a research programme that operationalises latent corporate\-discourse constructs via hand\-curated word lists\. Within the Chinese\-corporate literature, the same template has been applied to “entrepreneurial spirit”, innovation orientation, and political\-mission framing, typically as document\-level proportions of seed\-word hits\. The implicit assumption is that the distribution of construct\-related vocabulary in a document is informative about the document’s underlying stance\. Our paired\-contrast evaluation makes this assumption testable: if the assumption holds, dictionary scores must vary more across leaders than within them\. Our findings \(§[6](https://arxiv.org/html/2605.29188#S6)\) indicate that, for SOE leadership speeches at least, the assumption is empirically weak\.

#### Topic models, hierarchical classification, and embedding similarity\.

Latent Dirichlet AllocationBleiet al\.\([2003](https://arxiv.org/html/2605.29188#bib.bib4)\)is ubiquitous in CSS analyses of long political and corporate textGrimmer and Stewart \([2013](https://arxiv.org/html/2605.29188#bib.bib12)\); Robertset al\.\([2014](https://arxiv.org/html/2605.29188#bib.bib13)\)\. Embedding\-based topic modelsAngelov \([2020](https://arxiv.org/html/2605.29188#bib.bib5)\)generalise the idea to a continuous space; hierarchical multi\-label text classificationChalkidiset al\.\([2020](https://arxiv.org/html/2605.29188#bib.bib44)\); Shenet al\.\([2021](https://arxiv.org/html/2605.29188#bib.bib23)\); Xuet al\.\([2021](https://arxiv.org/html/2605.29188#bib.bib28)\); Faliset al\.\([2021](https://arxiv.org/html/2605.29188#bib.bib45)\)is a closer match to our L1/L2 taxonomy but requires labelled training data, which is unavailable here\. Modern Chinese sentence encoders such as BGEXiaoet al\.\([2024](https://arxiv.org/html/2605.29188#bib.bib6)\), building on the Sentence\-BERT lineReimers and Gurevych \([2019](https://arxiv.org/html/2605.29188#bib.bib89)\), achieve strong zero\-shot retrieval performance and are a common default for unsupervised dimension scoring in CSS\. We benchmark all three families and analyse why each underperforms on our paired\-contrast task: LDA recovers industry topics rather than stance dimensions, and a domain\-monolithic sentence encoder collapses per\-document distances on a homogeneous corpus \(§[6](https://arxiv.org/html/2605.29188#S6), §[7](https://arxiv.org/html/2605.29188#S7)\)\.

#### Stance, framing, and performative discourse\.

A long line of NLP work distinguishes stance from topic, mostly for short\-form social\-media dataMohammadet al\.\([2016](https://arxiv.org/html/2605.29188#bib.bib7)\); Allaway and McKeown \([2020](https://arxiv.org/html/2605.29188#bib.bib8)\)\. Framing analysis in political communicationCardet al\.\([2015](https://arxiv.org/html/2605.29188#bib.bib9)\); Fieldet al\.\([2018](https://arxiv.org/html/2605.29188#bib.bib14)\)treats discourse as a choice among competing emphases rather than a position on a single axis, and several recent papers extend it to corporate communicationZiemset al\.\([2024](https://arxiv.org/html/2605.29188#bib.bib15)\)\. A separate strand of work in political theory and linguistics treats institutional speech as*performative*: an utterance whose primary function is not to assert but to perform an institutional actAustin \([1962](https://arxiv.org/html/2605.29188#bib.bib16)\); Searle \([1969](https://arxiv.org/html/2605.29188#bib.bib17)\)\. SOE leadership addresses are performative in this sense: the political invocations they contain serve role enactment rather than information transfer\. Our slogan/substance separation operationalises this distinction quantitatively for a corporate\-discourse setting in which performative and substantive content coexist within the same paragraph\. This move also connects to entrepreneurship research that treats entrepreneurial communication as rhetoric, discourse, and meaning\-making rather than as transparent disclosureHolt and Macpherson \([2010](https://arxiv.org/html/2605.29188#bib.bib82)\); Roundy and Asllani \([2018](https://arxiv.org/html/2605.29188#bib.bib84)\); Riedy \([2022](https://arxiv.org/html/2605.29188#bib.bib86)\); Salmivaara and Kibler \([2020](https://arxiv.org/html/2605.29188#bib.bib87)\); Caliskan and Lounsbury \([2022](https://arxiv.org/html/2605.29188#bib.bib83)\); Steyaert \([2005](https://arxiv.org/html/2605.29188#bib.bib85)\)\. Those studies motivate our central concern: what appears as “entrepreneurial” language may partly reflect a discursive template or institutional genre rather than the leader’s substantive strategic stance\.

#### Construct validity in NLP measurement\.

Jacobs and Wallach \([2021](https://arxiv.org/html/2605.29188#bib.bib18)\)argue that NLP\-based measurement instruments routinely conflate operational measurement with theoretical construct, and call for measurement\-validity diagnostics analogous to those standard in psychometrics\. Our leader\-change paired evaluation contributes one such diagnostic for performative corporate discourse: because the design holds the firm constant while varying the leader, methods that score the firm’s industry rather than the leader’s stance are unmasked without requiring construct\-level labels\. Corporate\-NLP benchmarks for financial NLI, risk text, and ESG disclosureMathuret al\.\([2022](https://arxiv.org/html/2605.29188#bib.bib66)\); Magomereet al\.\([2025](https://arxiv.org/html/2605.29188#bib.bib62)\); Tang and Yang \([2025](https://arxiv.org/html/2605.29188#bib.bib57)\); Heet al\.\([2025](https://arxiv.org/html/2605.29188#bib.bib71)\); Padhiet al\.\([2024](https://arxiv.org/html/2605.29188#bib.bib36)\)mostly treat the corpus as declarative rather than performative\.

#### LLMs as annotators and as measurement instruments\.

Frontier LLMs have been shown to match or exceed crowd\-worker agreement on stance and framing tasksGilardiet al\.\([2023](https://arxiv.org/html/2605.29188#bib.bib10)\), and a growing literature uses them as scalable replacements for human annotators or as judges and probes in CSS workZiemset al\.\([2024](https://arxiv.org/html/2605.29188#bib.bib15)\); Heseltine and Clemm von Hohenberg \([2024](https://arxiv.org/html/2605.29188#bib.bib19)\); Kovalet al\.\([2024](https://arxiv.org/html/2605.29188#bib.bib63)\); Liscioet al\.\([2022](https://arxiv.org/html/2605.29188#bib.bib40)\); Chuanget al\.\([2025](https://arxiv.org/html/2605.29188#bib.bib69)\)\. The question that follows is whether downstream methods built on top of LLM scores contribute beyond re\-packaging the LLM\. Our ablation \(§[6\.4](https://arxiv.org/html/2605.29188#S6.SS4)\) addresses this directly: the slogan\-aware calibration term that contributes the most absolute lift is the multiplier on the LLM’s own substantive\-confidence output, not the externally mined slogan lexicon\. We interpret this as evidence that, on this corpus, much of the value added by post\-hoc calibration is auditable re\-weighting of confidence rather than orthogonal signal recovery\.

## 3Data

#### Corpus\.

We use 80 publicly released speeches delivered between 2018 and 2021 by leaders of centrally administered Chinese state\-owned enterprises and by officers of the State\-owned Assets Supervision and Administration Commission \(SASAC\)\. After normalising firm names, the 80 documents cover 51 unique organisations; 29 of these appear in both interview waves\. Document\-level statistics are summarised in Table[1](https://arxiv.org/html/2605.29188#S3.T1)\.

Table 1:Corpus summary\.
#### Segment construction\.

Each speech is split into paragraphs by blank lines\. Any paragraph exceeding 600 characters is recursively split, first at enumeration markers \(e\.g\.*first*,*second*\) and contrastive connectives \(e\.g\.*however*,*moreover*\), and, if necessary, at sentence boundaries\. Fragments shorter than 10 characters \(likely headings\) are dropped\. The resulting 2,190 segments have a median length of 525 characters \(P90 = 589\)\.

#### Gold pilot\.

A single annotator coded all paragraphs of five documents under a written protocol\. Documents were stratified\-sampled across leader archetypes \(one regulator, one operator\-style chairperson, one technology\-driven firm, one CSR\-themed firm, one capital\-management firm\) to span the expected range of stance density\. The L1 label is one ofslogan,substantive, orirrelevant, with the operational rule*“if the paragraph could be transplanted verbatim into a different SOE speech and still read naturally, label itslogan\.”*Substantiveparagraphs additionally received a dimension label \(cf\. §[4](https://arxiv.org/html/2605.29188#S4)\)\. The pilot yields 170 paragraph labels \(74slogan, 93substantive, 3irrelevant\)\. No method is trained on the pilot; it is a small held\-out test set for goldF1F\_\{1\}\(§[5](https://arxiv.org/html/2605.29188#S5)\)\. The operational rule given to the annotator matches the rule passed to the LLM in Appendix[A](https://arxiv.org/html/2605.29188#A1), so goldF1F\_\{1\}is a rule\-consistency metric rather than a construct\-validity check\.

#### Pilot–segment alignment\.

The auto\-segmenter is slightly more granular than the hand coding\. We align each annotated paragraph to a unique auto\-segment by \(i\) strict character\-prefix match after whitespace removal, \(ii\) substring fallback, and \(iii\) character\-4\-gram Jaccard fallback; each segment is claimed by at most one paragraph\. Of the 170 paragraphs, 86 align with match score≥0\.7\\geq 0\.7; goldF1F\_\{1\}is computed on the 83 of these whose L1 issloganorsubstantive\. Lower\-confidence matches are excluded; §[Limitations](https://arxiv.org/html/2605.29188#Sx1)returns to this\.

## 4Methods

### 4\.1The five dimensions

We follow the operationalisation of entrepreneurial spirit inGonget al\.\([2024](https://arxiv.org/html/2605.29188#bib.bib20)\): five coarse\-grained dimensions, each associated with 25 Chinese seed words\. We refer to them asInnovation,Competition–Cooperation,Organisation–Market,Social Responsibility, andNational Mission\.

This is a localised adaptation of the entrepreneurial orientation \(EO\) constructMiller \([1983](https://arxiv.org/html/2605.29188#bib.bib73)\); Covin and Slevin \([1989](https://arxiv.org/html/2605.29188#bib.bib74)\); Lumpkin and Dess \([1996](https://arxiv.org/html/2605.29188#bib.bib75)\):InnovationandCompetition–Cooperationcorrespond to innovativeness and competitive aggressiveness;Organisation–Marketbroadly covers proactiveness and autonomy via organisational/market\-mechanism reform;Social ResponsibilityandNational Missionare SOE\-specific additions without direct EO counterparts\. Risk\-taking is not scored separately because SOE addresses rarely articulate risk preferences — a known operationalisation gapWales \([2016](https://arxiv.org/html/2605.29188#bib.bib77)\)\.

For non\-dictionary methods we additionally provide a one\-sentence natural\-language description per dimension that explicitly contrasts substantive realisation with slogan realisation \(e\.g\.Innovation: “concrete R&D investment, patent results, key\-technology breakthroughs, new product or new business launches; does not include the abstract slogan*adhere to innovation\-driven development*”\)\. These descriptions serve as anchor texts for the sentence\-encoder baseline and as context for the LLM prompt\.

### 4\.2Baselines

#### Dictionary \(Dict\)\.

Per segment, we tokenise with the jieba Chinese segmenter and report, for each dimension, the fraction of tokens that match the dimension’s seed\-word set\.

#### Latent Dirichlet allocation \(Lda\)\.

We train a gensim LDA withK=5K=5topics on the 80 documents and map each topic to a dimension by greedy seed\-word argmax \(ties broken by raw score\)\. Per\-segment scores are projected topic probabilities under this mapping\. This vanilla configuration is intentionally a weak baseline; seeded LDA, largerKKwith merging, or STM with leader covariates may do better and are left to future work\.

#### Sentence\-encoder cosine \(Bge\)\.

We embed each segment and each dimension description with theBAAI/bge\-small\-zh\-v1\.5encoderXiaoet al\.\([2024](https://arxiv.org/html/2605.29188#bib.bib6)\)and use the cosine similarity as the dimension score\.

### 4\.3LLM extractor \(Llm\)

We prompt Qwen3\.5:9bYanget al\.\([2025](https://arxiv.org/html/2605.29188#bib.bib11)\)running locally under Ollama in its native generate endpoint with thinking disabled\. The prompt asks for a JSON object with four fields, which we will refer to throughout by their short names: a stance\-type labelL1∈\{slogan,substantive,irrelevant\}L\_\{1\}\\in\\\{\\textsc\{slogan\},\\textsc\{substantive\},\\textsc\{irrelevant\}\\\}with the same operational rule given to the human annotator; a substantive confidencecsub∈\[0,1\]c\_\{\\text\{sub\}\}\\in\[0,1\]giving the probability that the segment is substantive; per\-dimension stance scoressraw​\(d\)∈\[0,1\]s\_\{\\text\{raw\}\}\(d\)\\in\[0,1\], each*conditional on the segment being substantive*; and a slogan densityρllm∈\[0,1\]\\rho\_\{\\text\{llm\}\}\\in\[0,1\], the model’s estimate of the character fraction occupied by political symbolism\. The compositional structure of the stance scores and the substantive confidence is deliberate: it lets us compute either the rawLlmreading \(stance conditional on being substantive\) or a substance\-weighted reading \(stance discounted by how likely the segment is to be substantive at all\), which is the basis of the calibration below\.

### 4\.4Confidence\-weighted calibration \(Calibrated\)

We weight the LLM’s per\-dimension stance score by its self\-reported substantive confidence and by two slogan\-density penalties; the ablation in §[6\.4](https://arxiv.org/html/2605.29188#S6.SS4)attributes essentially all of the paired\-contrast lift to the confidence multiplier and shows the slogan\-density terms to be near\-inert \(we retain them for interpretability\)\. The product is a single\-source weak\-supervision aggregatorPruthiet al\.\([2020](https://arxiv.org/html/2605.29188#bib.bib21)\); Mekalaet al\.\([2020](https://arxiv.org/html/2605.29188#bib.bib22)\)\. In parallel we compute a corpus\-derived*n\-gram slogan density*per segment: a weighted character\-overlap with a lexicon mined from cross\-document n\-gram repetition\. The lexicon \(53 entries\) comprises every jieba 5\-gram that appears in at least 15% of the 80 documents; each entry is weighted bylog⁡\(df\+1\)\\log\(\\mathrm\{df\}\+1\), wheredf\\mathrm\{df\}is its document frequency\. The mining procedure recovers, without supervision, the recurring political set phrases catalogued in Table[2](https://arxiv.org/html/2605.29188#S4.T2)\.

Table 2:Six representative entries from the 53\-entry auto\-mined slogan lexicon, ranked by document frequency \(df\\mathrm\{df\}, out of 80 documents\)\. English glosses are literal translations provided for non\-Chinese readers and are not used by the model\.The calibrated per\-dimension score is:

scal​\(d\)=sraw​\(d\)⋅csub⋅mllm⋅mngs\_\{\\text\{cal\}\}\(d\)=s\_\{\\text\{raw\}\}\(d\)\\,\\cdot\\,c\_\{\\text\{sub\}\}\\,\\cdot\\,m\_\{\\text\{llm\}\}\\,\\cdot\\,m\_\{\\text\{ng\}\}\(1\)withmllm=max⁡\(0,1−λllm​ρllm\)m\_\{\\text\{llm\}\}=\\max\(0,1\-\\lambda\_\{\\text\{llm\}\}\\,\\rho\_\{\\text\{llm\}\}\)andmng=max⁡\(0,1−λng​ρng\)m\_\{\\text\{ng\}\}=\\max\(0,1\-\\lambda\_\{\\text\{ng\}\}\\,\\rho\_\{\\text\{ng\}\}\), wheresraw​\(d\)s\_\{\\text\{raw\}\}\(d\)is the LLM’s stance score for dimensiondd,csubc\_\{\\text\{sub\}\}is the LLM’s substantive confidence,ρllm\\rho\_\{\\text\{llm\}\}is the LLM\-reported slogan density, andρng\\rho\_\{\\text\{ng\}\}is the n\-gram\-derived slogan density\. We report two configurations\. The first setsλllm=λng=0\\lambda\_\{\\text\{llm\}\}=\\lambda\_\{\\text\{ng\}\}=0\(i\.e\. confidence\-only calibration\), which theλ\\lambdagrid search in §[6](https://arxiv.org/html/2605.29188#S6)identifies as the optimum on the paired\-contrast Cohenddlandscape\. The second setsλllm=1\.0\\lambda\_\{\\text\{llm\}\}=1\.0,λng=2\.0\\lambda\_\{\\text\{ng\}\}=2\.0\(full calibration\), which retains the slogan\-density multipliers for interpretability at a small cost indd\. We refer to the full calibration asCalibratedthroughout\.

For all five methods, the per\-document dimension vector is the arithmetic mean of segment scores per dimension over that document’s segments; no further normalisation is applied before evaluation\.

## 5Evaluation

### 5\.1Channel 1: leader\-change paired contrast

For each method we form two paired distance distributions over per\-document dimension vectors:

DlcD\_\{\\text\{lc\}\}=\{cos⁡\-dist​\(vA,vB\)\}\\\{\\cos\\text\{\-dist\}\(v\_\{A\},v\_\{B\}\)\\\}for each of the 24 same\-company, different\-speaker A/B pairs\.

DslD\_\{\\text\{sl\}\}=\{cos⁡\-dist​\(vA,vB\)\}\\\{\\cos\\text\{\-dist\}\(v\_\{A\},v\_\{B\}\)\\\}for each of the 5 same\-company, same\-speaker A/B pairs\.

We report the means, the absolute differenceΔ=D¯lc−D¯sl\\Delta=\\bar\{D\}\_\{\\text\{lc\}\}\-\\bar\{D\}\_\{\\text\{sl\}\}, Cohen’sddwith pooled within\-group standard deviationCohen \([2013](https://arxiv.org/html/2605.29188#bib.bib92)\), and a Hedges’ggsmall\-sample correction\. Two complementary uncertainty channels are reported: bootstrap CIs with 2,000 resamplesTibshirani and Efron \([1993](https://arxiv.org/html/2605.29188#bib.bib93)\)\(subject to known small\-nnlimitations atnsl=5n\_\{\\text\{sl\}\}=5\) and exact permutationpp\-values under the null that pair labels are exchangeable, enumerating all\(295\)=118,755\\binom\{29\}\{5\}=118\{,\}755partitions\.

A method that captures leader\-specific stance should produceD¯lc≫D¯sl\\bar\{D\}\_\{\\text\{lc\}\}\\gg\\bar\{D\}\_\{\\text\{sl\}\}\. A method that captures the company’s industry or topic vocabulary should produceD¯lc≈D¯sl\\bar\{D\}\_\{\\text\{lc\}\}\\approx\\bar\{D\}\_\{\\text\{sl\}\}, because both pairs hold the company constant\.

### 5\.2Channel 2: gold\-pilot binaryF1F\_\{1\}

On the 83\-paragraph gold subset, each method predicts an L1 label in\{slogan,substantive\}\\\{\\textsc\{slogan\},\\textsc\{substantive\}\\\}and we reportF1F\_\{1\}on the substantive class\. ForLlmandCalibratedwe use the model’s predicted L1 directly\.Dict,Lda, andBgehave no native verdict, so we threshold their maximum\-dimension score at the median over the 2,190 segments; the ordering of methods is robust to threshold choice in our experiments\. This channel should be read alongside paired\-contrast \(§[6](https://arxiv.org/html/2605.29188#S6)\) and paraphrase\-robustness \(§[6\.6](https://arxiv.org/html/2605.29188#S6.SS6)\): the gold L1 rule matches the LLM prompt rule, soF1F\_\{1\}here is rule\-consistency, not construct\-validity, and the median threshold fixes baseline positive rate at 50% by construction against a≈55\\approx 55% gold base rate\.

External behavioural validity \(per\-firm per\-year correlation with patent counts, R&D intensity, or ESG scores\) is the natural third evaluation channel for this construct\. We did not collect those signals for the present corpus; we discuss what completing that step would add in §[Limitations](https://arxiv.org/html/2605.29188#Sx1)\.

## 6Results

### 6\.1Main results

![Refer to caption](https://arxiv.org/html/2605.29188v1/x1.png)Figure 1:Method comparison on the two evaluation channels\. Markers are point estimates; error bars are 2,000\-resample bootstrap 95% CIs\.Lda’s paired\-contrast Cohendd\(0\.200\.20\) has a CI that includes zero;Dict,Lda, andBgecluster nearF1≈0\.5F\_\{1\}\\approx 0\.5on the gold channel;LlmandCalibratedlead on both axes\.Bge’sddof0\.650\.65is computed on doc\-vector distances of order10−310^\{\-3\}; see §[6](https://arxiv.org/html/2605.29188#S6)\.Table[3](https://arxiv.org/html/2605.29188#S6.T3)and Figure[1](https://arxiv.org/html/2605.29188#S6.F1)report the headline numbers\. We summarise four observations\.

Table 3:Main results\.D¯lc\\bar\{D\}\_\{\\text\{lc\}\}andD¯sl\\bar\{D\}\_\{\\text\{sl\}\}are mean cosine distances over the 24 leader\-change and 5 same\-leader pairs \(fullD¯\\bar\{D\}values:Dict0\.147/0\.049,Lda0\.213/0\.148,Bge0\.001/0\.001,Llm0\.149/0\.044,Calibrated0\.179/0\.049\);Δ=D¯lc−D¯sl\\Delta=\\bar\{D\}\_\{\\text\{lc\}\}\-\\bar\{D\}\_\{\\text\{sl\}\}\. Cohen’sdduses pooled SD; Hedges’gg\(in parentheses\) applies a small\-sample correction; 95% CIs are 2,000\-resample bootstrap\.p1p\_\{1\}is the exact one\-sided permutationpp\-value enumerating all\(295\)=118,755\\binom\{29\}\{5\}=118\{,\}755partitions under the null of exchangeable pair labels\. ROC\-AUC and PR\-AUC are threshold\-free measures on the 83\-paragraph gold subset \(substantive vs slogan\)\.Three signals corroborate\. The exact permutation test gives one\-sidedp<0\.05p<0\.05forLlm\(0\.0340\.034\) andCalibrated\(0\.0420\.042\); the three baselines fail this one\-sided threshold\. Threshold\-free PR\-AUC peaks atCalibrated\(0\.740\.74\);DictandBgeROC\-AUC sit below0\.50\.5, indicating anti\-correlation with the gold substantive label\.

#### Lda’s paired\-contrast effect is not distinguishable from zero\.

The bootstrap 95% CI onLda’s Cohenddis\[−0\.72,1\.20\]\[\-0\.72,1\.20\]and includes zero \(permutationp1=0\.35p\_\{1\}=0\.35\): under the leader\-change paradigm we cannot reject the null that the per\-document dimension vectors of a same\-leader pair are drawn from the same distribution as those of a leader\-change pair\. Inspection of the trained topics is consistent with this finding: the five topics correspond to industry clusters \(oil and petrochemicals; electricity and grid; automotive and transport; construction and infrastructure; aerospace and military\) rather than to the entrepreneurial\-spirit dimensions\. Projecting industry topics through a seed\-word mapping yields a per\-document vector that varies with the firm but not with the leader\.

#### Bge’s paired\-contrast effect is on doc\-vector distances of order10−310^\{\-3\}\.

TheBgeper\-document vectors lie in a narrow cone of the embedding space:D¯lc=0\.0012\\bar\{D\}\_\{\\text\{lc\}\}=0\.0012andD¯sl=0\.0005\\bar\{D\}\_\{\\text\{sl\}\}=0\.0005\. Although the resulting Cohend=0\.65d=0\.65would be regarded as a medium effect on its own, the gap on which it is computed is two orders of magnitude smaller than that of the other methods, leaving no operating margin for downstream use\. We attribute this collapse to the homogeneity of SOE speech vocabulary, against which a domain\-agnostic sentence encoder cannot resolve leader\-level variation\.

#### Llmleads on both channels\.

Zero\-shot Qwen3\.5:9b reachesF1=0\.78F\_\{1\}=0\.78on the gold subset \(CI\[0\.68,0\.86\]\[0\.68,0\.86\], non\-overlapping with any baseline CI\) and Cohend=1\.09d=1\.09on the paired contrast \(CI\[0\.51,1\.77\]\[0\.51,1\.77\]\)\. Both effects are large\. A natural interpretation is that the model’s prompt\-level operationalisation of stance is already most of what is needed; the ablation in §[6\.4](https://arxiv.org/html/2605.29188#S6.SS4)sharpens this reading\.

#### Calibration enlarges absolute paired\-contrast difference but not Cohendd\.

Slogan\-aware calibration raises the absolute paired\-contrast differenceΔ\\Deltafrom0\.1050\.105to0\.1300\.130\(\+27%\+27\\%\); the same\-leader distanceD¯sl\\bar\{D\}\_\{\\text\{sl\}\}rises by only0\.0050\.005while the leader\-change distanceD¯lc\\bar\{D\}\_\{\\text\{lc\}\}rises by0\.0300\.030, so the gap grows\. However, the within\-group standard deviation also rises proportionally, so Cohendddrops from1\.091\.09to0\.830\.83\. We report both metrics explicitly: Cohenddrewards methods whose paired distances are internally stable, while absoluteΔ\\Deltarewards methods that discriminate at usable magnitudes on the corpus\. For downstream uses that compare per\-document vectors at face value \(e\.g\. leader trajectories or firm\-year aggregates\),Δ\\Deltais the more relevant quantity\.

### 6\.2Per\-dimension breakdown

![Refer to caption](https://arxiv.org/html/2605.29188v1/x2.png)Figure 2:Per\-dimension Cohenddfor the leader\-change vs same\-leader paired contrast, by method\. Larger and bluer cells indicate stronger discrimination on that dimension; numeric values are the per\-dimensiondd\.Llmis uniformly positive across dimensions;Ldais negative on*Innovation*\(leader\-change pairs are more similar than same\-leader pairs in LDA’s mapped innovation distribution\)\.Bgehas high Cohenddbut on the doc\-vector distances reported in Table[3](https://arxiv.org/html/2605.29188#S6.T3)the underlying gaps remain∼0\.02\\sim 0\.02\.Figure[2](https://arxiv.org/html/2605.29188#S6.F2)decomposes Table[3](https://arxiv.org/html/2605.29188#S6.T3)’s aggregateddinto per\-dimension contributions\. Three patterns are visible\. First,*Organisation–Market*is the dimension on which every method discriminates best \(d≥0\.66d\\geq 0\.66across all methods\), a reflection of the fact that operational decisions \(mixed\-ownership reform, listings, executive incentives\) are the most concretely realised stance content in our corpus\. Second,Lda’s projection onto*Innovation*is negatively correlated with the leader contrast \(d=−0\.39d=\-0\.39\): paragraphs about technical innovation fall under the industry topics LDA discovers \(aerospace, automotive, etc\.\), so leader changes within a firm are barely visible\. Third,*National Mission*is the hardest dimension for every method;Llm\(0\.44\) leads, but the absolute differences are smaller than for the other dimensions because national\-mission discourse is itself heavily formulaic\.

### 6\.3L2 sub\-type analysis

The LLM additionally classifies everysubstantivesegment into one of three sub\-types:firm\-action\(the leader’s own firm\-specific decisions\),policy\-history\(recitation of national policy and institutional history\), orsystem\-aggregate\(industry\- or system\-level statistics, e\.g\. “central SOEs cut 1\.6 Mt of steel capacity”\)\. Across the 1,419substantivesegments, the distribution is 57%firm\-action, 29%system\-aggregate, and 14%policy\-history\. Per\-doc breakdowns are consistent with stereotype: an operator\-style chairperson’s speech is dominated byfirm\-action\(e\.g\. 81% for the China National Building Materials chairperson in our pilot\), while a regulator’s address contains a balanced mix\.

Aggregating the L2 fractions to per\-document vectors and re\-running the paired contrast gives a finer\-grained answer to where leader identity is visible:D¯lc−D¯sl=\+0\.144\\bar\{D\}\_\{\\text\{lc\}\}\-\\bar\{D\}\_\{\\text\{sl\}\}=\+0\.144on thefirm\-actionfraction, but only\+0\.060\+0\.060onsystem\-aggregateand\+0\.051\+0\.051onpolicy\-history\. The leader\-change signal is concentrated in firm\-action content, not in policy or aggregate recitation — a finding that is invisible if one collapses allsubstantivecontent into a single channel\.

### 6\.4Component ablation

Table 4:Ablation of calibration components on paired\-contrast\.The substantive\-confidence multipliercsubc\_\{\\text\{sub\}\}is the only term that materially shiftsΔ\\Delta; the two slogan\-density multipliers are near\-inert in isolation and slightly over\-correct when combined withcsubc\_\{\\text\{sub\}\}\.

### 6\.5Cross\-model robustness

We re\-scored two 85\-segment stratified subsamples under Qwen3\.5:27b \(within\-family\) and DeepSeek\-r1:8b \(cross\-family\)\. Within\-family agreement is high: Cohen’sκ\\kappaCohen \([1960](https://arxiv.org/html/2605.29188#bib.bib94)\)is0\.750\.75on L1 and0\.510\.51on L2; mean per\-dim Pearsonr=0\.88r=0\.88\. Cross\-family agreement is substantial:κL​1=0\.70\\kappa\_\{L1\}=0\.70, mean per\-dimr=0\.76r=0\.76, substantive\-confidencer=0\.84r=0\.84\(L2 degrades toκ=0\.30\\kappa=0\.30\)\. These mitigate but do not eliminate pretraining\-overlap concerns \(§[Limitations](https://arxiv.org/html/2605.29188#Sx1)\); full numbers Appendix[E](https://arxiv.org/html/2605.29188#A5)\.

### 6\.6Paraphrase robustness

We test how each method responds to*surface\-form change*\. For 50 segments Qwen3\.5:9b labelledsubstantivewith confidence≥0\.6\\geq 0\.6, the same LLM rewrites each in slogan style while preserving the claim; we then re\-score\. Mean retention ratio \(rewrite/original\) on the maximum\-dimension score is1\.551\.55forDict,0\.750\.75for rawLlm, and0\.69\\mathbf\{0\.69\}forCalibrated\(Appendix[D](https://arxiv.org/html/2605.29188#A4)\)\.Dict*rises*33% because rewrites are more keyword\-dense;Calibrateddrops most because slogan density rises \(0\.16→0\.400\.16\\\!\\to\\\!0\.40\) and substantive confidence falls \(0\.95→0\.740\.95\\\!\\to\\\!0\.74\), propagating through the product\.

### 6\.7Sensitivity to small\-nnSL and to leader style

Two sensitivity checks test how much of the LLM advantage survives worst\-case framing\.Leave\-one\-SL\-out:Llm’s Cohenddstays in\[0\.98,1\.25\]\[0\.98,1\.25\]\(permp1≤0\.071p\_\{1\}\\leq 0\.071\) andCalibrated’s in\[0\.76,0\.94\]\[0\.76,0\.94\], whileLdaswings between0\.080\.08and0\.930\.93— confirming the small\-nnSL group is the dominant source of LDA’s wide CI\.Style residualisation: regressing each method’s per\-document score on five doc\-level stylometric features \(sentence\-length mean/SD, numeric density, long\-run density, type\-token ratio\) cutsLlm’s Cohenddfrom1\.091\.09to0\.430\.43\(p1=0\.22p\_\{1\}=0\.22\) andCalibrated’s from0\.830\.83to0\.390\.39; absoluteΔ\\Deltastays positive \(0\.220\.22and0\.190\.19\)\. Method ordering is preserved but no method is significant post\-residualisation, so roughly half of the LLM’s measured leader\-change effect is consistent with leader style rather than stance\. Full numbers in Appendix[F](https://arxiv.org/html/2605.29188#A6)\.

### 6\.8Where the methods diverge

A representative paragraph pair \(Appendix[G](https://arxiv.org/html/2605.29188#A7)\) illustrates the failure: a SASAC\-chairperson political invocation receives a higher dictionary*Innovation*score than a CNBM operational narrative receives for*Organisation–Market*\. This is systematic: across the gold pilot,sloganparagraphs have mean dictionary max\-score0\.0710\.071vs0\.0540\.054forsubstantiveones — dictionary scoring is anti\-correlated with the gold L1 label, the failure the calibration corrects\.

A two\-dimensional grid overλllm∈\{0,0\.5,1\.0,1\.5,2\.0\}\\lambda\_\{\\text\{llm\}\}\\in\\\{0,0\.5,1\.0,1\.5,2\.0\\\}andλng∈\{0,1,2,3\}\\lambda\_\{\\text\{ng\}\}\\in\\\{0,1,2,3\\\}produces Cohenddin the narrow range\[0\.83,0\.86\]\[0\.83,0\.86\]\(full grid in Appendix[B](https://arxiv.org/html/2605.29188#A2)\)\. The maximum is attained atλllm=0\\lambda\_\{\\text\{llm\}\}=0, consistent with the ablation finding that the slogan\-density multipliers contribute little oncecsubc\_\{\\text\{sub\}\}is active\.

## 7Discussion

#### Topic models and encoders recover industry, not stance\.

LDA’s discovered topics align with industrial sectors rather than with entrepreneurial\-spirit dimensions; post\-hoc re\-labelling inherits the original distributional semantics, not the new label\.Bge’s per\-document vectors lie in a narrow cone of the embedding space because every SOE speech shares the same corporate and political vocabulary; dimension\-aware anchor texts only mildly mitigate this\.

#### LLM advantage, with caveats\.

A 9B LLM picks up signal the three baselines missGilardiet al\.\([2023](https://arxiv.org/html/2605.29188#bib.bib10)\), but three caveats apply: gold L1 matches the LLM prompt rule \(§[Limitations](https://arxiv.org/html/2605.29188#Sx1)\); Qwen3\.5’s pretraining likely overlaps the corpus; and roughly half the effect is consistent with leader style \(§[6\.7](https://arxiv.org/html/2605.29188#S6.SS7)\)\. Calibration trades\+27%\+27\\%onΔ\\Deltafor−0\.26\-0\.26on Cohendd\. The diagnostic transfers to corpora with performative\-register discourse and adequate speaker\-rotation coverage \(parliamentary speeches, earnings calls\)\.

## 8Conclusion

On 80 Chinese SOE speeches, dictionary, LDA, and BGE produce weak leader\-specific signal; a zero\-shot 9B LLM gives a larger contrast \(p1=0\.034p\_\{1\}=0\.034; cross\-familyκ=0\.70\\kappa=0\.70\), though half the effect is consistent with leader style\. We frame the contribution as a measurement\-validity diagnostic and release the full evaluation harness\.

## Limitations

#### Corpus scope\.

80 speeches across 51 firms is small\. All speeches are from central SOEs in the 2018–2021 window; we do not claim our findings transfer to provincial SOEs, private firms, or non\-Chinese corporate discourse\. In particular our claim that BGE collapses is corpus\-specific: on more topically diverse corpora a domain\-agnostic Chinese sentence encoder might not collapse and the picture could improve\.

#### Single\-annotator gold pilot; no human\-human IAA\.

Gold L1 verdicts come from one annotator on five documents; we do not report inter\-annotator agreement, and the coding rule was written in tandem with the LLM prompt \(Appendix[A](https://arxiv.org/html/2605.29188#A1)\), so goldF1F\_\{1\}measures rule consistency, not construct recovery\. The reportedF1F\_\{1\}values are a ranking signal rather than precise point estimates; a noise ceiling on this channel requires a second independent annotator and would tighten the CIs\. Of 170 coded paragraphs, 86 align to a unique auto\-segment with match score≥0\.7\\geq 0\.7; the rest are excluded fromF1F\_\{1\}\.

#### Discriminative vs construct validity\.

Our paired contrast establishes that the LLM’s per\-document vectors vary with leader identity holding firm constant; it does*not*establish that what varies corresponds to the theoretical EO construct or to externally observable strategic posture\. A pure stylometric baseline \(eight features: mean paragraph length, first\-person rate, hedging, certainty, recurring\-cliché frequency, 2\-gram lexical diversity\) yields paired contrastd=0\.73d=0\.73, permutationp1=0\.099p\_\{1\}=0\.099on the same 24/5 pair structure — belowLlm’sd=1\.09d=1\.09andp1=0\.034p\_\{1\}=0\.034, but large enough that style is a non\-trivial partial confound, not fully ruled out\. Throughout we use “leader\-attributable strategic stance” for the measured quantity and reserve “entrepreneurial orientation” for the theoretical construct; bridging the two requires an external behavioural channel \(next paragraph\)\.

#### Cross\-LLM coverage and residual pretraining concerns\.

We checked robustness against Qwen3\.5:27b \(within\-family\) and DeepSeek\-r1:8b \(cross\-family\)\. Cross\-family agreement is substantially weaker than within\-family \(κL​1=0\.69\\kappa\_\{L1\}=0\.69vs0\.750\.75; mean per\-dimensionr=0\.76r=0\.76vs0\.880\.88; L2κ=0\.30\\kappa=0\.30vs0\.510\.51\), and both models still share a heavy Chinese\-pretraining exposure plausibly including our corpus\. Replication with a non\-Chinese\-native or older\-cutoff model would tighten the contamination argument further\. We also observed one anomaly — a single speech with all 28 segments labelledslogan— consistent with model\-specific bias against the military\-industry register\.

#### No external behavioural validity channel\.

The natural third channel for this construct is correlation between per\-firm per\-year aggregate dimension scores and external behavioural indicators \(patent counts and R&D intensity forInnovation; ESG scores forSocial Responsibility\)\. Completing this channel would require aligning the 51 firms to public IP\-office and financial\-disclosure databases over the 2018–2021 window\. The present paper claims only discriminative validity \(a method’s scores distinguish leaders\) and not construct validity \(the scores correspond to externally observable firm behaviour\)\.

#### Sample sizes and group imbalance\.

The paired contrast rests onnlc=24n\_\{\\text\{lc\}\}=24leader\-change pairs versusnsl=5n\_\{\\text\{sl\}\}=5same\-leader pairs\. The two groups are also imbalanced on covariates: the SL group contains the SASAC regulator \(20%; LC contains 0%\) and only two industry tags vs the LC group’s sixteen, which compresses the SL distribution toward a formulaic register and almost certainly inflates the LC\-vs\-SL contrast\. The exact permutation test partially addresses this by labelling under exchangeability, but the underlying imbalance is intrinsic to the natural experiment in this corpus and cannot be substantially corrected without a larger SL pool\.

#### Cohenddversus absoluteΔ\\Delta\.

Cohenddis variance\-controlled and conservative; absoluteΔ\\Deltais not\. We report both, but the paraphrase\-robustness experiment \(§[6\.6](https://arxiv.org/html/2605.29188#S6.SS6)\) provides additional evidence that complements both metrics: it identifies the method whose scores actually respond to the substantive\-vs\-symbolic distinction at the level of an individual paragraph rewrite\.

## Ethics statement

The corpus consists of publicly released speeches and lectures by senior officers of state\-owned enterprises and a state regulator, all of whom spoke in their official capacities\. We do not analyse private communications or attribute material to any natural person beyond their public professional role\.

#### Terminology\.

“Slogan” is a technical label for the performative\-political register following the operational rule given in §[3](https://arxiv.org/html/2605.29188#S3); it is neither pejorative nor a normative judgement on Chinese political discourse\. Equivalent terms in the literature include*performative\-political register*and*symbolic invocation*; both refer to the same phenomenon and could replace our label without loss\. Both performative and substantive content serve legitimate communicative functions and our analysis does not endorse a preference for either\.

#### Name handling in released artefacts\.

Speaker names appear in our case\-study \(Appendix[G](https://arxiv.org/html/2605.29188#A7)\) and in segment metadata because the speeches are public and attributable to public officials; we provide a name\-redacted variant of the corpus alongside the original for researchers who prefer it\. We discourage repurposing the released scores for individual\-leader political\-loyalty scoring or other ranking applications outside the benchmark’s intended measurement use\.

The released benchmark may be of use to researchers in computational social science studying performative corporate discourse, and to management\-science researchers seeking to validate or replace existing dictionary\-based measurement instruments\. We see no foreseeable risk of harmful application beyond the routine risks of any benchmark whose labels reflect a specific researcher’s coding protocol\.

## Reproducibility

#### Resources released\.

Source code, the 2,190\-segment scored corpus, the 170\-paragraph hand\-coded pilot, the 53\-entry auto\-mined slogan lexicon, the LLM\-produced scores under both models, and all evaluation and visualisation scripts are released at the URL given on the camera\-ready version\.

#### Hardware and runtime\.

All experiments were run on a 48 GB Apple M4 Pro workstation\. The LLM extractor uses the Qwen3\.5:9b open\-weight model \(Q4\_K\_M quantisation\) served by the Ollama runtime via its native generate endpoint with thinking disabled, temperature0, andnum\_predict=320=320\. The full 2,190\-segment scoring run completes in approximately 30 minutes of wall\-clock time; the cross\-LLM subsample under Qwen3\.5:27b runs in approximately 3 minutes per segment\. No fine\-tuning is performed; no hyperparameter is tuned on the gold pilot\.

#### Determinism and seeds\.

Bootstrap CIs use 2,000 resamples with a fixed Python random seed \(42\)\. The LLM is invoked at temperature 0 and is deterministic modulo Ollama\-internal scheduling; we re\-ran the full corpus once and verified identical L1 verdicts on a 100\-segment spot\-check\.

#### Baseline implementations\.

The sentence\-encoder baseline usesBAAI/bge\-small\-zh\-v1\.5\(Apache\-2\.0 licence;∼108\\sim 10^\{8\}parameters\); LDA uses the gensim implementation withK=5K=5topics, 10 passes, and 200 iterations; dictionary scoring uses the jieba Chinese segmenter and the 25\-keyword\-per\-dimension seed list ofGonget al\.\([2024](https://arxiv.org/html/2605.29188#bib.bib20)\)\. Dictionary scoring, LDA inference, the n\-gram lexicon mining, and the calibration step all run in seconds on CPU and require no GPU\.

#### Data licence\.

All speeches in the corpus were released by SASAC or the speakers’ own firms as part of the public “SOE Open Lecture \(100 Lectures\)” initiative and were collected from publicly available sources\. Speakers delivered the speeches in their official capacities\. We redistribute only material that the original sources have made public, with attribution\.

## References

- E\. Allaway and K\. McKeown \(2020\)Zero\-shot stance detection: a dataset and model using generalized topic representations\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 8913–8931\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px4.p1.1)\.
- D\. Angelov \(2020\)Top2vec: distributed representations of topics\.arXiv preprint arXiv:2008\.09470\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px3.p1.1)\.
- J\. L\. Austin \(1962\)How to do things with words\.Oxford University Press\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px4.p1.1)\.
- A\. Bagheri and C\. Harrison \(2020\)Entrepreneurial leadership measurement: a multi\-dimensional construct\.Journal of Small Business and Enterprise Development27\(4\),pp\. 659–679\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px1.p1.1)\.
- S\. R\. Baker, N\. Bloom, and S\. J\. Davis \(2016\)Measuring economic policy uncertainty\.The quarterly journal of economics131\(4\),pp\. 1593–1636\.Cited by:[§1](https://arxiv.org/html/2605.29188#S1.p1.1),[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px2.p1.1)\.
- D\. M\. Blei, A\. Y\. Ng, and M\. I\. Jordan \(2003\)Latent dirichlet allocation\.Journal of machine Learning research3\(Jan\),pp\. 993–1022\.Cited by:[§1](https://arxiv.org/html/2605.29188#S1.p1.1),[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px3.p1.1)\.
- K\. Caliskan and M\. Lounsbury \(2022\)Entrepreneurialism as discourse: toward a critical research agenda\.InEntrepreneurialism and Society: New Theoretical Perspectives,pp\. 43–53\.External Links:[Document](https://dx.doi.org/10.1108/S0733-558X20220000081003)Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px4.p1.1)\.
- D\. Card, A\. Boydstun, J\. H\. Gross, P\. Resnik, and N\. A\. Smith \(2015\)The media frames corpus: annotations of frames across issues\.InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing \(Volume 2: Short Papers\),pp\. 438–444\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px4.p1.1)\.
- I\. Chalkidis, M\. Fergadiotis, S\. Kotitsas, P\. Malakasiotis, N\. Aletras, and I\. Androutsopoulos \(2020\)An empirical study on large\-scale multi\-label text classification including few and zero\-shot labels\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 7503–7515\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px3.p1.1)\.
- M\. Chuang, G\. Chuang, C\. Chuang, and J\. Chuang \(2025\)Judging it, washing it: scoring and greenwashing corporate climate disclosures using large language models\.InProceedings of the 2nd Workshop on Natural Language Processing Meets Climate Change \(ClimateNLP 2025\),pp\. 17–31\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px6.p1.1)\.
- J\. Cohen \(1960\)A coefficient of agreement for nominal scales\.Educational and psychological measurement20\(1\),pp\. 37–46\.Cited by:[§6\.5](https://arxiv.org/html/2605.29188#S6.SS5.p1.8)\.
- J\. Cohen \(2013\)Statistical power analysis for the behavioral sciences\.routledge\.Cited by:[§5\.1](https://arxiv.org/html/2605.29188#S5.SS1.p3.7)\.
- J\. G\. Covin and D\. P\. Slevin \(1989\)Strategic management of small firms in hostile and benign environments\.Strategic management journal10\(1\),pp\. 75–87\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.29188#S4.SS1.p2.1)\.
- G\. G\. Dess and G\. T\. Lumpkin \(2005\)The role of entrepreneurial orientation in stimulating effective corporate entrepreneurship\.Academy of Management Perspectives19\(1\),pp\. 147–156\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)Bert: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 \(long and short papers\),pp\. 4171–4186\.Cited by:[§1](https://arxiv.org/html/2605.29188#S1.p1.1)\.
- M\. Falis, H\. Dong, A\. Birch, and B\. Alex \(2021\)CoPHE: a count\-preserving hierarchical evaluation metric in large\-scale multi\-label text classification\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,pp\. 907–912\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px3.p1.1)\.
- A\. Field, D\. Kliger, S\. Wintner, J\. Pan, D\. Jurafsky, and Y\. Tsvetkov \(2018\)Framing and agenda\-setting in Russian news: a computational analysis of intricate political strategies\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,pp\. 3570–3580\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px4.p1.1)\.
- F\. Gilardi, M\. Alizadeh, and M\. Kubli \(2023\)ChatGPT outperforms crowd workers for text\-annotation tasks\.Proceedings of the National Academy of Sciences120\(30\),pp\. e2305016120\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px6.p1.1),[§7](https://arxiv.org/html/2605.29188#S7.SS0.SSS0.Px2.p1.4)\.
- T\. Gong, B\. Dou, and Y\. Wang \(2024\)Unveiling entrepreneurship in chinese state\-owned enterprises: a computational linguistic analysis\.InAcademy of Management Proceedings,Vol\.2024,pp\. 16271\.Cited by:[§4\.1](https://arxiv.org/html/2605.29188#S4.SS1.p1.1),[Baseline implementations\.](https://arxiv.org/html/2605.29188#Sx3.SS0.SSS0.Px4.p1.2)\.
- J\. Grimmer and B\. M\. Stewart \(2013\)Text as data: the promise and pitfalls of automatic content analysis methods for political texts\.Political analysis21\(3\),pp\. 267–297\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px3.p1.1)\.
- C\. Harrison, S\. Paul, and K\. Burnard \(2016\)Entrepreneurial leadership: a systematic literature review\.\.International Review of Entrepreneurship14\(2\)\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px1.p1.1)\.
- C\. He, X\. Zhou, Y\. Wu, X\. Yu, Y\. Zhang, L\. Zhang, D\. Wang, S\. Lyu, H\. Xu, W\. Xiaoqiao,et al\.\(2025\)Esgenius: benchmarking llms on environmental, social, and governance \(esg\) and sustainability knowledge\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 14623–14664\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px5.p1.1)\.
- M\. Heseltine and B\. Clemm von Hohenberg \(2024\)Large language models as a substitute for human experts in annotating political text\.Research & Politics11\(1\)\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px6.p1.1)\.
- R\. Holt and A\. Macpherson \(2010\)Sensemaking, rhetoric and the socially competent entrepreneur\.International small business journal28\(1\),pp\. 20–42\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px4.p1.1)\.
- Y\. Huang and P\. Luk \(2020\)Measuring economic policy uncertainty in china\.China economic review59,pp\. 101367\.Cited by:[§1](https://arxiv.org/html/2605.29188#S1.p1.1),[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Z\. Jacobs and H\. Wallach \(2021\)Measurement and fairness\.Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency,pp\. 375–385\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px5.p1.1)\.
- R\. Koval, N\. Andrews, and X\. Yan \(2024\)Learning to compare financial reports for financial forecasting\.InFindings of the Association for Computational Linguistics: EACL 2024,pp\. 500–512\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px6.p1.1)\.
- D\. F\. Kuratko and J\. S\. Hornsby \(1999\)Corporate entrepreneurial leadership for the 21st century\.Journal of Leadership Studies5\(2\),pp\. 27–39\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px1.p1.1)\.
- E\. Liscio, A\. E\. Dondera, A\. Geadău, C\. M\. Jonker, and P\. K\. Murukannaiah \(2022\)Cross\-domain classification of moral values\.InFindings of the association for computational linguistics: NAACL 2022,pp\. 2727–2745\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px6.p1.1)\.
- T\. Loughran and B\. McDonald \(2011\)When is a liability not a liability? textual analysis, dictionaries, and 10\-ks\.The Journal of finance66\(1\),pp\. 35–65\.Cited by:[§1](https://arxiv.org/html/2605.29188#S1.p1.1),[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px2.p1.1)\.
- G\. T\. Lumpkin and G\. G\. Dess \(1996\)Clarifying the entrepreneurial orientation construct and linking it to performance\.Academy of management Review21\(1\),pp\. 135–172\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.29188#S4.SS1.p2.1)\.
- J\. Magomere, E\. Kochkina, S\. Mensah, S\. Kaur, and C\. Smiley \(2025\)FinNLI: novel dataset for multi\-genre financial natural language inference benchmarking\.InFindings of the Association for Computational Linguistics: NAACL 2025,pp\. 4545–4568\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px5.p1.1)\.
- P\. Mathur, M\. Goyal, R\. Sawhney, R\. Mathur, J\. L\. Leidner, F\. Dernoncourt, and D\. Manocha \(2022\)Docfin: multimodal financial prediction and bias mitigation using semi\-structured documents\.InFindings of the Association for Computational Linguistics: EMNLP 2022,pp\. 1933–1940\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px5.p1.1)\.
- D\. Mekala, X\. Zhang, and J\. Shang \(2020\)Meta: metadata\-empowered weak supervision for text classification\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 8351–8361\.Cited by:[§4\.4](https://arxiv.org/html/2605.29188#S4.SS4.p1.2)\.
- T\. Mikolov, I\. Sutskever, K\. Chen, G\. S\. Corrado, and J\. Dean \(2013\)Distributed representations of words and phrases and their compositionality\.Advances in neural information processing systems26\.Cited by:[§1](https://arxiv.org/html/2605.29188#S1.p1.1)\.
- D\. Miller \(1983\)The correlates of entrepreneurship in three types of firms\.Management science29\(7\),pp\. 770–791\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.29188#S4.SS1.p2.1)\.
- S\. Mohammad, S\. Kiritchenko, P\. Sobhani, X\. Zhu, and C\. Cherry \(2016\)Semeval\-2016 task 6: detecting stance in tweets\.InProceedings of the 10th international workshop on semantic evaluation \(SemEval\-2016\),pp\. 31–41\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px4.p1.1)\.
- I\. Padhi, K\. N\. Ramamurthy, P\. Sattigeri, M\. Nagireddy, P\. Dognin, and K\. R\. Varshney \(2024\)Value alignment from unstructured text\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track,pp\. 1083–1095\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px5.p1.1)\.
- D\. Pruthi, B\. Dhingra, G\. Neubig, and Z\. C\. Lipton \(2020\)Weakly\-and semi\-supervised evidence extraction\.InFindings of the Association for Computational Linguistics: EMNLP 2020,pp\. 3965–3970\.Cited by:[§4\.4](https://arxiv.org/html/2605.29188#S4.SS4.p1.2)\.
- A\. Rauch, J\. Wiklund, G\. T\. Lumpkin, and M\. Frese \(2009\)Entrepreneurial orientation and business performance: an assessment of past research and suggestions for the future\.Entrepreneurship theory and practice33\(3\),pp\. 761–787\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px1.p1.1)\.
- N\. Reimers and I\. Gurevych \(2019\)Sentence\-bert: sentence embeddings using siamese bert\-networks\.InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing \(EMNLP\-IJCNLP\),pp\. 3982–3992\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px3.p1.1)\.
- C\. Riedy \(2022\)Discursive entrepreneurship: ethical meaning\-making as a transformative practice for sustainable futures\.Sustainability Science17\(2\),pp\. 541–554\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px4.p1.1)\.
- M\. E\. Roberts, B\. M\. Stewart, D\. Tingley, C\. Lucas, J\. Leder\-Luis, S\. K\. Gadarian, B\. Albertson, and D\. G\. Rand \(2014\)Structural topic models for open\-ended survey responses\.American journal of political science58\(4\),pp\. 1064–1082\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px3.p1.1)\.
- P\. T\. Roundy and A\. Asllani \(2018\)The themes of entrepreneurship discourse: a data analytics approach\.Journal of Entrepreneurship, Management and Innovation14\(3\),pp\. 127–158\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px4.p1.1)\.
- V\. Salmivaara and E\. Kibler \(2020\)“Rhetoric mix” of argumentations: how policy rhetoric conveys meaning of entrepreneurship for sustainable development\.Entrepreneurship Theory and Practice44\(4\),pp\. 700–732\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px4.p1.1)\.
- J\. R\. Searle \(1969\)Speech acts: an essay in the philosophy of language\.Cambridge University Press\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px4.p1.1)\.
- J\. Shen, W\. Qiu, Y\. Meng, J\. Shang, X\. Ren, and J\. Han \(2021\)TaxoClass: hierarchical multi\-label text classification using only class names\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 4239–4249\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px3.p1.1)\.
- C\. Steyaert \(2005\)Narrative and discursive approaches in entrepreneurship: a second movements in entrepreneurship book\.Edward Elgar Publishing\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px4.p1.1)\.
- Y\. Tang and Y\. Yang \(2025\)Finmteb: finance massive text embedding benchmark\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 3620–3638\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px5.p1.1)\.
- P\. C\. Tetlock \(2007\)Giving content to investor sentiment: the role of media in the stock market\.The Journal of finance62\(3\),pp\. 1139–1168\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px2.p1.1)\.
- R\. J\. Tibshirani and B\. Efron \(1993\)An introduction to the bootstrap\.Monographs on statistics and applied probability57\(1\),pp\. 1–436\.Cited by:[§5\.1](https://arxiv.org/html/2605.29188#S5.SS1.p3.7)\.
- W\. J\. Wales \(2016\)Entrepreneurial orientation: a review and synthesis of promising research directions\.International Small Business Journal34\(1\),pp\. 3–15\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.29188#S4.SS1.p2.1)\.
- S\. Xiao, Z\. Liu, P\. Zhang, N\. Muennighoff, D\. Lian, and J\. Nie \(2024\)C\-pack: packed resources for general chinese embeddings\.InProceedings of the 47th international ACM SIGIR conference on research and development in information retrieval,pp\. 641–649\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px3.p1.1),[§4\.2](https://arxiv.org/html/2605.29188#S4.SS2.SSS0.Px3.p1.1)\.
- L\. Xu, S\. Teng, R\. Zhao, J\. Guo, C\. Xiao, D\. Jiang, and B\. Ren \(2021\)Hierarchical multi\-label text classification with horizontal and vertical category correlations\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,pp\. 2459–2468\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px3.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§4\.3](https://arxiv.org/html/2605.29188#S4.SS3.p1.4)\.
- C\. Ziems, W\. Held, O\. Shaikh, J\. Chen, Z\. Zhang, and D\. Yang \(2024\)Can large language models transform computational social science?\.InComputational Linguistics,Vol\.50,pp\. 237–291\.Cited by:[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px4.p1.1),[§2](https://arxiv.org/html/2605.29188#S2.SS0.SSS0.Px6.p1.1)\.

## Appendix ALLM extraction prompt \(original Chinese\)

The prompt below is sent verbatim to Qwen3\.5:9b for each segment\.\{dim\_descriptions\}is replaced by the five short\-form dimension descriptions defined in §[4](https://arxiv.org/html/2605.29188#S4);\{segment\_text\}is replaced by the segment text\. The model is invoked through the Ollama native/api/generateendpoint withthink: false,temperature: 0, andnum\_predict: 320\. We provide a literal English gloss beneath each Chinese line for readers unfamiliar with the language\.

> 分析下面这段中国国企领导讲话片段。仅输出 JSON。 \(Analyse the following segment of a Chinese SOE leadership speech\. Output JSON only\.\) 五个维度:\{dim\_descriptions\} \(Five dimensions: \[substituted at run time\]\) 判定 l1: \(Decide L1:\) \- slogan:抽象口号 / 政治表态 / 领导人引文 / 四字格 / 无具体主体\-动词\-数字\-取舍。判别:可原封不动搬到任意国企讲话→\\toslogan \(slogan: abstract slogans / political statements / leader quotations / four\-character formulas / no specific subject\-verb\-number\-tradeoff\. Heuristic: if the paragraph could be transplanted verbatim into any other SOE speech, labelslogan\.\) \- substantive:具体动词\+客体 / 资源配置数字 / 可观测后果 / 对比取舍 / 时地特异性,满足 ≥2 项 \(substantive: at least two of\{specific verb\+object, resource allocation in numbers, observable outcome, contrast or tradeoff, specificity of time/place/object\}\.\) \- irrelevant:寒暄、过渡、与 5 维度均无关 \(irrelevant: greetings, transitions, not related to any of the five dimensions\.\) l2(仅 substantive 填):firm\_action(本企业 leader 决策)/ policy\_history(政策制度叙事)/ system\_aggregate(系统聚合统计) \(L2, only when L1 =substantive: firm\_action / policy\_history / system\_aggregate; defined in §[6\.3](https://arxiv.org/html/2605.29188#S6.SS3)\.\) 输出严格 JSON: \(Strict JSON output:\) \{ "l1": "slogan\|substantive\|irrelevant", "l2": "", "confidence\_substantive": 0\.0, "stance\_scores": \{\.\.\.five keys\.\.\.\}, "slogan\_density": 0\.0 \} confidence\_substantive:是 substantive 的概率(0\-1) stance\_scores\[dim\]:假设这段是 substantive 时该维度的 stance 强度(0\-1)。维度词出现 ≠ 高分,要看具体行动/数字/取舍 slogan\_density:套话密度(Xi 引文\+四字格\+抽象口号 占段比例,0\-1) \(Field semantics: confidence\_substantive is P\(segment is substantive\); stance\_scores\[dim\] is stance strength on the dimension*conditional on substantive*; the presence of dimension keywords does not entail a high score, what matters is concrete actions, numbers, and tradeoffs; slogan\_density is the character fraction occupied by political set phrases, in \[0,1\]\.\) 段落:\{segment\_text\} \(Segment: \[substituted at run time\]\)

## Appendix BFullλ\\lambdagrid

Table 5:Selected entries from theλ\\lambdagrid search \(§[6](https://arxiv.org/html/2605.29188#S6)\)\. The landscape is flat: across all 20 grid points, Cohenddremains within\[0\.83,0\.86\]\[0\.83,0\.86\]\.
## Appendix CSlogan lexicon \(full\)

The full 53\-entry auto\-mined slogan lexicon, along with the six high\-frequency entries shown in Table[2](https://arxiv.org/html/2605.29188#S4.T2), is released with the corpus\. Entries are jieba 5\-grams that appear in at least 15% of the 80 documents\. Examples not shown in the main text include the four\-character formulations*“two unwaverings”*\(the dual commitment to public\-sector and non\-public\-sector economic development\),*“new development concept”*, and recurring references to “the 18th Party Congress” and onwards\.

## Appendix DParaphrase robustness table

Table 6:Paraphrase robustness on 50 substantive\-to\-slogan rewrites \(§[6\.6](https://arxiv.org/html/2605.29188#S6.SS6)\)\. meanorig\{\}\_\{\\text\{orig\}\}and meanrew\{\}\_\{\\text\{rew\}\}are mean per\-segment maximum\-dimension scores; retention is the mean per\-pair ratio of rewrite to original\. Stance\-aware methods should retain less than 1; surface methods retain near 1 \(or above, if the rewrite is more keyword\-dense than the original\)\.
## Appendix ECross\-LLM agreement details

We report two cross\-LLM checks against Qwen3\.5:9b \(the model used in the main run\), each on an 85\-segment stratified subsample \(see §[6\.5](https://arxiv.org/html/2605.29188#S6.SS5)\)\.

#### Within\-family \(Qwen3\.5:27b\)\.

L1 agreement 86%;κL​1=0\.746\\kappa\_\{L1\}=0\.746;κL​2=0\.510\\kappa\_\{L2\}=0\.510on the 45 mutually\-substantivesegments\. Per\-dimension Pearsonrron raw stance scores:Innovation0\.9000\.900,Competition–Cooperation0\.8160\.816,Organisation–Market0\.9210\.921,Social Responsibility0\.9290\.929,National Mission0\.8440\.844; mean0\.8820\.882\. Slogan density correlates atr=0\.729r=0\.729and substantive confidence atr=0\.890r=0\.890\.

#### Cross\-family \(DeepSeek\-r1:8b\)\.

L1 agreement 83%;κL​1=0\.695\\kappa\_\{L1\}=0\.695;κL​2=0\.297\\kappa\_\{L2\}=0\.297on the 50 mutually\-substantivesegments\. Per\-dimension Pearsonrr:Innovation0\.8460\.846,Competition–Cooperation0\.7270\.727,Organisation–Market0\.7140\.714,Social Responsibility0\.7790\.779,National Mission0\.7520\.752; mean0\.7640\.764\. Slogan density correlates atr=0\.719r=0\.719and substantive confidence atr=0\.844r=0\.844\. Cross\-family agreement is weaker than within\-family but still substantial on L1 and the confidence channel, suggesting that the slogan\-vs\-substance distinction is recovered consistently across architecture families\.

## Appendix FSensitivity analyses \(full\)

#### Leave\-one\-SL\-out\.

Dropping each of the 5 same\-leader pairs and recomputing \(nlc=24n\_\{\\text\{lc\}\}=24,nsl=4n\_\{\\text\{sl\}\}=4\):LlmCohend∈\[0\.98,1\.25\]d\\in\[0\.98,1\.25\], permp1∈\[0\.024,0\.071\]p\_\{1\}\\in\[0\.024,0\.071\];Calibratedd∈\[0\.76,0\.94\]d\\in\[0\.76,0\.94\],p1∈\[0\.029,0\.100\]p\_\{1\}\\in\[0\.029,0\.100\];Dictd∈\[0\.73,0\.98\]d\\in\[0\.73,0\.98\],p1∈\[0\.072,0\.142\]p\_\{1\}\\in\[0\.072,0\.142\];Ldad∈\[0\.08,0\.93\]d\\in\[0\.08,0\.93\],p1∈\[0\.048,0\.457\]p\_\{1\}\\in\[0\.048,0\.457\];Bged∈\[0\.55,0\.95\]d\\in\[0\.55,0\.95\],p1∈\[0\.020,0\.262\]p\_\{1\}\\in\[0\.020,0\.262\]\.

#### Placebo \(random SL/LC reassignment\)\.

2000 random partitions of the 29 doc pairs into a size\-5 “SL” group and a size\-24 “LC” group; report the fraction of placebo trials with Cohend≥d\\geqobserved\.Llm3\.9%3\.9\\%,Calibrated3\.5%3\.5\\%,Dict7\.4%7\.4\\%,Bge14\.4%14\.4\\%,Lda35\.7%35\.7\\%\. Only the LLM methods reject the null atα=0\.05\\alpha=0\.05\.

#### Style\-residualised paired contrast\.

Per\-document dimension scores are residualised against five stylometric features \(sentence\-length mean and SD, numeric density, long\-run density as a named\-entity proxy, character type\-token ratio\)\. Paired contrast on residuals:Llmd=0\.43d=0\.43, permp1=0\.22p\_\{1\}=0\.22;Calibratedd=0\.39d=0\.39,p1=0\.23p\_\{1\}=0\.23;Bged=0\.71d=0\.71,p1=0\.12p\_\{1\}=0\.12;Dictd=0\.51d=0\.51,p1=0\.16p\_\{1\}=0\.16;Ldad=0\.35d=0\.35,p1=0\.26p\_\{1\}=0\.26\. AbsoluteΔ\\Deltaremains positive forLlm\(0\.220\.22\) andCalibrated\(0\.190\.19\); the rank ordering of methods is preserved but no method is significant atα=0\.05\\alpha=0\.05after this aggressive residualisation\.

## Appendix GCase\-study paragraph pair

Table 7:The qualitative case\-study paragraph pair referenced in §[6\.8](https://arxiv.org/html/2605.29188#S6.SS8)\. The dictionary baseline scores the slogan paragraph as more strongly indicative of*Innovation*than the substantive paragraph is of*Organisation–Market*\. TheLlm’s substantive\-confidence multiplier collapses the slogan paragraph’s stance vector while preserving the substantive paragraph’s\.

Similar Articles

Does Topic Sentiment Cause Perceived Ideology? Comparing Human and LLM Annotations in Political News Articles

arXiv cs.CL

This paper investigates whether topic sentiment causally affects perceived political ideology in news articles, comparing human annotations from AllSides with those from LLMs including GPT-4o-mini and Llama-3.3-70B. It finds that fine-tuned GPT-4o-mini exhibits a spurious sentiment-ideology coupling not present in human judgments, highlighting risks of using LLM annotations as proxies in causal analyses.