Proposing Topic Models and Evaluation Frameworks for Analyzing Associations with External Outcomes: An Application to Leadership Analysis Using Large-Scale Corporate Review Data

arXiv cs.CL 04/22/26, 04:00 AM Papers
Summary
The paper introduces an LLM-based topic modeling method and evaluation framework that simultaneously achieves interpretability, topic specificity, and polarity stance consistency, demonstrating superior explanatory power for external outcomes like employee morale using large-scale Japanese corporate review data.
arXiv:2604.18919v1 Announce Type: new Abstract: Analyzing topics extracted from text data in relation to external outcomes is important across fields such as computational social science and organizational research. However, existing topic modeling methods struggle to simultaneously achieve interpretability, topic specificity (alignment with concrete actions or characteristics), and polarity stance consistency (absence of mixed positive and negative evaluations within a topic). Focusing on leadership analysis using corporate review data, this study proposes a method leveraging large language models to generate topics that satisfy these properties, along with an evaluation framework tailored to external outcome analysis. The framework explicitly incorporates topic specificity and polarity stance consistency as evaluation criteria and examines automated evaluation methods based on existing metrics. Using employee reviews from OpenWork, a major corporate review platform in Japan, the proposed method achieves improved interpretability, specificity, and polarity consistency compared to existing approaches. In analyses of external outcomes such as employee morale, it also produces topics with higher explanatory power. These results suggest that the proposed method and evaluation framework provide a generalized approach for topic analysis in applications involving external outcomes.
Original Article
View Cached Full Text
Cached at: 04/22/26, 08:29 AM
# Proposing Topic Models and Evaluation Frameworks for Analyzing Associations with External Outcomes: An Application to Leadership Analysis Using Large-Scale Corporate Review Data
Source: [https://arxiv.org/html/2604.18919](https://arxiv.org/html/2604.18919)
\\\_\_setCJKmainfont

\[\]ipaexm\.ttf\\\_\_setCJKsansfont\[\]ipaexg\.ttf

,Masato KanaiAccenture JapanJapan,Masataka NakayamaKyoto UniversityJapan,Haruki OhsawaOpenworkJapan,Yukiko UchidaKyoto UniversityJapan,Arata Yuminaga,Gakuse HoshinaAccenture JapanJapanandNobuo SayamaIntegralJapan

###### Abstract\.

Analyzing topics extracted from text data in relation to external outcomes is important across a wide range of fields, including computational social science, organizational research, and marketing\. However, in existing topic modeling methods, it is difficult to simultaneously achieve topic interpretability, which is important for interpreting relationships with external outcomes; topic specificity, defined as alignment with specific and concrete actions or characteristics; and polarity stance consistency, defined as the absence of mixed positive and negative evaluations within the same topic\. Focusing on leadership analysis based on corporate review data, this study proposes a method that leverages large language models to generate topics with interpretability, topic specificity, and polarity stance consistency, while simultaneously introducing an evaluation framework suitable for external outcome analysis\. The proposed evaluation framework explicitly positions topic specificity and polarity stance consistency, which have not been sufficiently addressed in the existing literature, as evaluation criteria, and examines the validity of automated evaluation methods applied to existing metrics\. This framework enables a multidimensional examination of the characteristics of generated topics\. Analyses using reviews posted by current and former employees on OpenWork, one of the largest corporate review platforms in Japan, showed that the proposed method simultaneously achieves interpretability, topic specificity, and polarity stance consistency\. In analyses of external outcomes such as employee morale, the method generated topics with consistently higher explanatory power than existing methods\. This study newly proposes extended methodological approaches and evaluation criteria for topic analysis aimed at understanding relationships with external outcomes, and demonstrates their potential to generalize to a wide range of application domains with similar analytical requirements\.

Topic models and Evaluation metrics

††copyright:none††conference:; ;## 1\.Introduction

### 1\.1\.Leadership, Performance, and Employee Engagement

Improving organizational performance has long been a central managerial objective, and in recent years, psychological aspects of employees, such as vitality and work engagement, have also been recognized as playing a key role in organizational performance\(Harter2002;Judge2001\)\. Based on these findings, it is important to simultaneously examine factors that promote both performance outcomes and employees’ psychological well\-being\.

In this context, leadership has been widely recognized as a key factor influencing both organizational performance and employee engagement\. Leadership at multiple organizational levels, ranging from top executives to middle managers and direct supervisors, is associated with employees’ high performance and favorable psychological states\(Judge2004;Montano2017\)\.

Since the 1930s, leadership has been a central topic in management and psychology, and diverse theoretical frameworks have been developed, primarily in Western contexts\(House1997;Solansky2017\)\. Early trait and behavioral theories later evolved into more complex frameworks, such as transformational and transactional leadership\(DeRue2011\)\. Building on these foundations, various measurement approaches have been developed, including leadership scales and multidimensional frameworks\(Avolio1999;WarnerSoderholm2020\)\. As a result, a substantial body of empirical research has accumulated, and meta\-analytic studies have synthesized findings from individual studies \(e\.g\.,\(DeRue2011;Judge2004;Montano2017\)\)\.

However, although leadership research has produced a substantial body of knowledge, several challenges remain\. Existing evidence is heavily concentrated in Western cultural contexts, analyses often rely on specific theoretical frameworks, and cross\-firm studies tend to abstract leadership characteristics into broad categories\. These limitations motivate the need for data\-driven, cross\-company approaches that enable fine\-grained analysis of leadership behaviors across diverse cultural contexts\.

### 1\.2\.Approach of This Study

This study aims to address these three challenges\. To this end, we employ large\-scale text data accumulated on the corporate review platform "OpenWork" and applies topic modeling to reviews posted by current and former employees, primarily focusing on Japanese companies\(openwork\)\. This approach enables an examination of the relationships between leadership characteristics reflected in the reviews and corporate performance as well as employee morale\. It is made possible by the combination of user\-generated, objective cross\-company data such as OpenWork reviews and recent advances in natural language processing techniques, including large language models \(LLMs\)\. The review data analyzed in this study reflect employees’ spontaneous evaluations of their companies, enabling a detailed and realistic understanding of leadership practices in Japanese companies from the employees’ perspective\. Advances in LLMs make it possible to conduct highly accurate and flexible analyses of large\-scale unstructured data efficiently\.

## 2\.Related Work

### 2\.1\.Limitations of Existing Leadership Research

Despite the accumulation of empirical evidence reviewed above, at least three important challenges remain in leadership research\.

First, leadership research has predominantly relied on studies conducted in Western organizations, as reflected in major theoretical frameworks and many meta\-analyses\(House1997;Solansky2017;DeRue2011;Judge2004;Montano2017;Schimmelpfennig2025\)\. Although some theories have discussed leadership characteristics expected in Japanese organizations \(e\.g\.,\(Misumi1985\)\), and cultural differences in other\-orientation among high\-ranking individuals between Western and East Asian societies have been documented\(GobelMiyamoto2023\), systematic empirical evidence from non\-Western contexts remains limited\.

Second, leadership research has largely been structured around established theoretical frameworks \(e\.g\.,\(DeRue2011\)\)\. While theory\-driven approaches have contributed to the validation and refinement of existing theories, they are prone to mismatches between theoretical frameworks and empirical methods\(Dinh2014\), which may obscure influence processes that are not fully considered within a given theory\. Although a small number of recent studies have begun to explore inductive approaches using free\-text data \(e\.g\.,\(tonidandel2022leadership\)\), research that comprehensively identifies leadership characteristics through data\-driven analyses of large\-scale text remains scarce\.

Third, insights derived from cross\-firm leadership research tend to remain at an abstract level\. This is partly because many cross\-company studies rely primarily on meta\-analytic approaches to integrate heterogeneous findings, which has been pointed out to encourage the abstraction of leadership characteristics into broad meta\-categories, such as task\-oriented and relationship\-oriented leadership\(Yukl2019\)\. While such abstraction facilitates cross\-study comparison, it constrains detailed examination of specific leadership behaviors across firms\.

### 2\.2\.Technical Requirements for Outcome\-Oriented Topic Modeling

In applied settings considered in this study, technical requirements include topic representations that ensure interpretability, specificity and polarity stances as well as an evaluation framework to assess these characteristics\. To use topic modeling for analyzing relationships with external variables and to connect the findings to practical discussions, the extracted topics must be interpretable to humans and meaningfully linked to imaginable and specific behaviors, attributes, or actions\. Moreover, when evaluative orientations—such as positive and negative stances—are mixed within a topic, interpretation of its relationship with external variables becomes difficult\. In such cases, effects on external outcomes may cancel out, making consistency in polarity stance within topics a critical requirement\. For example, even if a topic named "decision making" shows a positive correlation with firm performance, the topic may contain a mixture of positive documents praising rapid decision making and negative documents criticizing slow decision making\. In such cases, opposing polarity stance may offset one another in their associations with external variables\. As a result, even when a relationship between the topic and external variables is detected, it remains difficult to translate the result into actionable practical implications\. Accordingly, in applied settings considered in this study, technical requirements include topic representations that ensure interpretability, specificity and polarity stances as well as an evaluation framework to assess these characteristics\.

### 2\.3\.Limitations of Existing Topic Modelings

Topic modeling aims to extract latent thematic structures embedded in collections of text and to summarize document corpora in a low\-dimensional topic space, and has developed as a representative approach for large\-scale text analysis\. Early studies proposed probabilistic generative models based on the bag\-of\-words assumption, such as Latent Dirichlet Allocation \(LDA\)\(blei2003lda\), which infer topic distributions from word co\-occurrence patterns\. However, when these topic modeling approaches are used as text\-mining technologies, several technical challenges remain\. To address these challenges, some extensions of topic modeling have been proposed\.

Probabilistic models based on the bag\-of\-words assumption, such as the Structural Topic Model \(STM\), have been widely used for estimating relationships between topics and external variables, as they provide a framework well suited for statistical inference that explicitly accounts for uncertainty\(roberts2014stm\)\. The STM was developed to incorporate document\-level covariates into topic models, allowing researchers to statistically estimate how topic prevalence and content vary as a function of external information\. As a result, STM can capture positive and negative nuances within topics while allowing uncertainty\-aware statistical evaluation based on variational posterior inference\. However, because topics are represented as probabilistic distributions over word co\-occurrences, STM cannot directly capture semantic relationships that depend on word order or contextual information\. Consequently, such models may generate incoherent topics\(lau2014interpretable\)\.

To address the challenge of producing semantically intuitive topic representations, interpretation\-oriented approaches such as BERTopic, which leverages document embeddings and distributed representations, and TopicGPT, which extracts topics in natural language using large language models, have been developed\(grootendorst2022bertopic;pham2024topicgpt\)\. However, because these methods are not primarily designed to explicitly separate polarity stance, positive and negative descriptions may coexist within the same topic\.

### 2\.4\.Limitations of Exisiting Evaluation Metrics for Topic Modeling

A wide range of metrics have been used to automatically evaluate topic models, including statistical goodness\-of\-fit measures and topic coherence metrics\. However, recent studies have pointed out that these automatic evaluation metrics do not necessarily align well with human interpretations, leading to renewed scrutiny of how interpretability should be evaluated\.

In particular, topic model evaluation has traditionally relied on automatic metrics that assess interpretability based on word co\-occurrence patterns within topics\. While such metrics offer the advantage of quantitatively assessing lexical coherence, prior studies have raised concerns about the validity of automated coherence metrics in capturing human interpretations\(lau2014interpretable\)\. In contrast, human evaluation of semantic validity and interpretability can achieve high reliability, it is difficult to automate, which limits its applicability to large\-scale datasets or comparisons across many experimental conditions\. Moreover, in analyses examining relationships between topics and external variables, it is important that topics can be meaningfully linked to specific behaviors or attributes and that their polarity orientations are consistent\. To the best of our knowledge, however, no prior studies have explicitly defined these properties as evaluation criteria for topic models\. Taken together, evaluating topic modeling approaches for analyses involving external variables requires not only reliable automation of interpretability assessment but also the incorporation of topic specificity and polarity stance consistency as explicit evaluation criteria\.

## 3\.Objectives and Contributions

This study first develops a topic modeling methodology and an automated evaluation framework for analyzing relationships with external variables, explicitly incorporating not only interpretability but also topic specificity and polarity stance consistency\. This enables topics derived from text data to be interpreted in a more practically meaningful manner\. Furthermore, as an applied demonstration, we examine leadership characteristics that may contribute to firm performance and employee morale using large\-scale employee experience review data\. Through this empirical analysis, we demonstrate the practical usefulness of the proposed approach for social science research, while also providing insights that are practically valuable for companies\.

## 4\.Method

### 4\.1\.Proposed Topic Modeling Framework

#### 4\.1\.1\.Overview of the Proposed Topic Modeling Framework

In this study, we propose a topic modeling approach that simultaneously achieves interpretability as well as topic specificity and polarity stance consistency\. Figure[1](https://arxiv.org/html/2604.18919#S4.F1)provides an overview of the proposed framework\. Specifically, the method proceeds as follows: \(1\) a collection of documents is provided as the initial data; \(2\) initial topics are generated using BERTopic; \(3\) An LLM is used to assign each document to zero or more initial topics, allowing documents to be associated with multiple topics or with none\. \(4\) topics are split by polarity using LLMs; and \(5\) semantically related topics are integrated\.

![Refer to caption](https://arxiv.org/html/2604.18919v1/x1.png)Figure 1\.Overview of the proposed topic modeling framework\.
#### 4\.1\.2\.Initial Topic Modeling Using BERTopic

We apply document embeddings to the input documents to obtain continuous vector representations\. To improve computational efficiency, principal component analysis \(PCA\) is applied to the embedded document vectors to reduce their dimensionality\.

Using the dimension\-reduced document embeddings as input, we generate an initial set of topics with BERTopic\. For each initial topic, we provide an LLM with representative keywords and exemplar documents, and the model outputs a topic name and a topic description\.

#### 4\.1\.3\.Topic Assignment Refinement with LLM

Next, the collection of generated topic names and topic descriptions is provided to an LLM together with all documents, and the model is used to reassign each document to topics\.

BERTopic uses HDBSCAN for topic clustering, which is known to produce a substantial number of outlier documents \(topic−1\-1\)\(grootendorst2022bertopic;McInnes2017;BERTopicOutlierReduction\)\. We therefore use an LLM to reassign these documents to topics instead of discarding them as noise\. Moreover, because a single document may be relevant to multiple topics, topic assignment is performed in a soft\-clustering manner\.

#### 4\.1\.4\.Splitting Topics by polarity with LLM

A randomly sampled set of documents is drawn from each topic and evaluated using an LLM to determine whether the topic contains documents with differing polarity stances; if at least one such document is identified, the topic is split into polarity\-specific topics accordingly\. Subsequently, documents associated with the pre\-split topic are assigned, based on their content, to one of the polarity\-specific topics or to neither of them\.

#### 4\.1\.5\.Topic Integration

After splitting topics by polarity, we integrate semantically similar topics to obtain a more compact set of topic representations\. During this process, to prevent the erroneous merging of topics with opposing polarity stances, we define topic similarity by jointly considering semantic proximity and polarity stance similarity\. First, for each topic, we construct a multi\-view topic representation by taking a weighted average of the embeddings of the topic name, topic description, and the documents assigned to the topic, following prior work on LLM\-guided multi\-view cluster representations\(pattnaik2024improving\)\. The semantic distance between topicsiiandjj, denoted asdi,jmeaningd\_\{i,j\}^\{\\text\{meaning\}\}, is defined as the cosine distance between their topic representation vectors\.

Next, we define the polarity stance similarity between topics, denoted assi,jstances\_\{i,j\}^\{\\text\{stance\}\}, which quantifies the degree to which the polarity stances of topicsiiandjjare aligned\. This score is automatically evaluated using the G\-Eval framework, which employs an LLM as an evaluator and takes the topic name and topic description as input\(liu2023geval\)\. Larger values ofsi,jstances\_\{i,j\}^\{\\text\{stance\}\}indicate greater similarity in polarity stance between the two topics\. Polarity stance information is intended solely as secondary information for determining whether topics should be integrated\. It is therefore important to avoid cases in which topics with differing polarity stances are nevertheless considered similar solely because they are semantically related\. To this end, we introduce a semantic distance thresholdτmeaning\\tau^\{\\text\{meaning\}\}and incorporate stance information into the distance measure only for topic pairs that are sufficiently close in semantic space\. We define the Heaviside functionHi,jH\_\{i,j\}as

Hi,j=\{1ifdi,jmeaning≤τmeaning,0otherwise\.H\_\{i,j\}=\\begin\{cases\}1&\\text\{if \}d\_\{i,j\}^\{\\text\{meaning\}\}\\leq\\tau^\{\\text\{meaning\}\},\\\\ 0&\\text\{otherwise\}\.\\end\{cases\}Using this function, the overall distance between topicsiiandjjis defined as

di,j=di,jmeaning−\(1−si,jstance\)Hi,j\.d\_\{i,j\}=d\_\{i,j\}^\{\\text\{meaning\}\}\-\\left\(1\-s\_\{i,j\}^\{\\text\{stance\}\}\\right\)H\_\{i,j\}\.As a result, topic pairs are ordered by distance, from semantically close and stance\-aligned pairs, through semantically close but stance\-misaligned pairs, to semantically distant pairs\.

All prompts used in this study are provided in Appendix[C](https://arxiv.org/html/2604.18919#A3)\.

### 4\.2\.Proposed Evaluation Metrics

In this study, we introduce additional evaluation metrics to complement conventional measures used in topic model evaluation, such as coherence and topic diversity metrics based on word co\-occurrence and frequency statistics\. Specifically, we incorporate metrics for topic label alignment and semantic–based topic diversity\. More importantly, we introduce evaluation metrics that focus on topic specificity and polarity stance consistency, which constitute the core criteria of the proposed evaluation framework\.

#### 4\.2\.1\.Topic Label Alignment

Topic Label Alignment is a metric that evaluates the semantic alignment between the content of a topic label and the content of the document collection assigned to that topic\. As discussed above, coherence metrics based on lexical consistency have been shown to be insufficient for fully capturing topic interpretability\(lau2014interpretable\)\. Doogan et al\.\(doogan\-2021\-topic\)further demonstrate that such metrics do not necessarily align with human evaluations, and argue that interpretability should instead be evaluated in terms of how descriptively a "Topic Word\-set" represents its corresponding "Topic Document\-collection"\. In this study, we operationalize this notion of alignment between topic\-representative information and topic documents by automatically evaluating, using an LLM, the semantic alignment between a topic name, a topic description, and the documents assigned to that topic\.

#### 4\.2\.2\.Semantic\-based Topic Diversity

We define Semantic\-based Topic Diversity as a metric for evaluating the overall semantic diversity of a set of topics\. Specifically, for all pairs of topics, an LLM is used to automatically assess their semantic similarity based on the topic names and topic descriptions\. Topic pairs with similarity scores of 9 or higher on a 10\-point scale are regarded as semantically equivalent\. Based on these semantically equivalent topic pairs, topics are grouped into clusters of semantically similar topics, while topics with no identified similarities form singleton clusters\. Finally, the Semantic\-Based Topic Diversity score is computed as the ratio of the number of unique topic clusters to the total number of topics\.

#### 4\.2\.3\.Specificity

We define Specificity as an evaluation metric that assesses whether a topic is expressed in a manner that enables readers to clearly imagine a specific situation or action\. This metric is informed by prior discussions ofimaginabilityin Altarriba et al\.\(altarriba1999imaginability\)and the notion ofspecific situationsproposed by Mischel\(mischel1968personality\)\. Based on these perspectives, we define topic specificity along two aspects\. First, we assess whether the topic representation allows readers to clearly envision the relevant actor and the specific state or behavioral change involved\. For example, "a supervisor interrupting subordinates during weekly meetings" represents a high level of specificity, whereas "problems in organizational communication" is insufficiently specific\. Second, we assess whether the topic describes a narrowly defined situation, condition, or phenomenon\. For instance, "workflow delays caused by slow approval processes" exhibits higher specificity than a general description such as "declining work efficiency\." Based on these two aspects, we use an LLM to automatically evaluate the extent to which a topic name and topic description are expressed with sufficient specificity\.

#### 4\.2\.4\.Polarity Stance Consistency

We define Polarity Stance Consistency as an evaluation metric that assesses whether a single topic gives rise to mutually opposing polarity interpretations\. Specifically, this metric reflects whether a topic name and description are expressed in an ambiguous way that can simultaneously refer to states with opposing polarity, such as presence versus absence, high versus low degrees, or strong versus weak levels\. For example, expressions such as "rapid decision making" exhibit clear polarity stance consistency, whereas more ambiguous topic names such as "decision making" may simultaneously imply both positive and negative stance\. In this study, polarity stance consistency is automatically evaluated using an LLM based on the topic name and topic description\.

## 5\.Experiments

### 5\.1\.Input Data and Dataset Construction

This study used two types of data are required: employee experience reviews that contains corporate leader characteristics, and firm\-level financial indicators used to measure organizational outcomes\.

First, we analyzed employee experience review data posted on OpenWork, one of the largest corporate review platforms in Japan\(openwork\)\. OpenWork contains freely written reviews submitted by current and former employees\. The reviews provide rich accounts of organizational realities, including characteristics of leaders such as CEOs and managers\. In addition to the textual content, each review includes five\-point employee ratings \(e\.g\., employee morale\) as well as associated metadata, such as company identifiers, posting dates, and contributor attributes\. Reviews posted between 2017 and 2024 were included in the analysis\. According to OpenWork’s privacy policy, users consent to the provision of their data for third\-party use after appropriate processing and for academic research purposes\.The data used in this study were anonymized by OpenWork prior to being provided to the authors\. From each review text, we used an LLM to extract multiple passages referring to leaders, allowing multiple extractions per review\. At the same time, the extracted passages were classified along two dimensions: \(i\) leader type, distinguishing between top executives \(e\.g\., CEOs\) and non\-top leaders below the CEO level, and \(ii\) leader characteristics, categorizing each passage as referring to behaviors, attitudes, or abilities\.

Second, we obtained corporate financial data through Japan’s Electronic Disclosure for Investors’ NETwork \(EDINET\), a disclosure system provided by Japan’s Financial Services Agency\(edinet\)\. In this study, we used return on assets \(ROA\) as an external outcome variable representing firm performance\. ROA was calculated based on annual securities reports disclosed on EDINET, defined as net income divided by total assets\.

Based on these steps, final set of input documents was constructed from reviews posted for 1356 Japanese publicly listed firms\. Table[1](https://arxiv.org/html/2604.18919#S5.T1)reports the document counts of the extracted leadership\-related texts across the six groups\. To ensure consistent temporal coverage between the review data and corporate financial data, the reviews were restricted to those posted for Japanese publicly listed firms for which financial statement data can be obtained for all fiscal years from 2017 to 2024\. The LLM\-based extraction was evaluated along category alignment for leader type and leader characteristic, and semantic consistency between the original and extracted texts, and was found to be largely valid \(see Appendix\)\.

Table 1\.Counts of extracted leadership\-related documents
### 5\.2\.Application of the Proposed Topic Modeling Method

In this analysis, we applied the proposed topic modeling approach under the following settings\. For Step \(1\),Input Documents, each of the six text groups was treated as a separate set of input documents for the topic modeling \(Table[1](https://arxiv.org/html/2604.18919#S5.T1)\)\. For Step \(2\),Initial Topic Modeling Using BERTopic, we generated document embeddings for the leader\-related text corpus described above usingtext\-embedding\-3\-largeprovided by OpenAI, resulting in 3,072\-dimensional vector representations for each document\. To improve computational efficiency, PCA was applied to the embedding vectors, reducing their dimensionality to 450 while preserving 90% of the explained variance\. Using the dimension\-reduced document embeddings as input, we performed clustering with HDBSCAN to generate an initial set of topics\. In this process, the minimum cluster size was set to 100, as each topic is required to contain a sufficient number of samples to allow for subsequent aggregation of topic frequencies at the firm\-year level and for outcome regression analyses\. All other hyperparameters were set to the default values of the bertopic library\.\(grootendorst2022bertopic\)For Step \(3\),Topic Assignment Refinement with LLM, we extracted top\-10 most probable words, and 30 representative documents for each topic using Maximal Marginal Relevance \(MMR;\(Carbonell1998\)\) withλ=0\.5\\lambda=0\.5\. These top\-10 words and representative documents were provided as input to an LLM to generate topic names and a topic descriptions\. For Step \(4\),Spliting Topics by Polarity with LLM, we randomly sampled 50 documents from each topic and used an LLM to assess whether the topic contained descriptions with differing polarity stances, and topics exhibiting such polarity differences were then split accordingly\. For Step \(5\),Topic Integration, polarity stance similarity between topics was computed using the G\-Eval framework, and topic integration was performed based on the resulting overall distance measure\. The threshold for semantic distanceτmeaning\\tau^\{\\text\{meaning\}\}was set to the bottom 1% of the empirical distribution, and this threshold was used to determine whether stance\-based adjustment should be applied when constructing the overall distance measure\. Then, using the resulting topic–topic distance matrix, hierarchical clustering based on Ward’s method was applied to merge topics\.\(Ward1963;Johnson1967\)\. After topic integration, topic names and topic descriptions were regenerated using the same LLM\-based naming procedure employed in Step \(4\)\. When selecting representative documents for this renaming process, we setλ=0\\lambda=0in MMR and sampled 30 documents to ensure that the resulting topic name and description comprehensively capture the full range of documents assigned to each integrated topic\. Throughout the entire pipeline, we used GPT\-4\.1\-mini as the LLM\.

### 5\.3\.Results: Evaluation of Topic Models

#### 5\.3\.1\.Validation of the Proposed Evaluation Metrics

To verify whether the evaluation metrics proposed in Section 4\.2 provide judgments aligned with human assessments, we examined the degree of agreement between LLM\-based automatic evaluations and human ratings\. Specifically, for Topic Label Alignment, Specificity, Polarity Stance Consistency, and Semantic\-based Topic Diversity, we computed the intraclass correlation coefficient, ICC\(2,2\) between LLM\-based ratings and human ratings\. Since GPT\-4\.1\-mini was used in the topic modeling stage, we employed Gemini\-2\.5\-Flash for LLM\-based automatic evaluation in order to avoid potential bias and ensure fairness\. Both the LLM and human annotators were provided with identical evaluation criteria, and each metric was assessed on 50 test cases\. The resulting agreement scores are reported in Table[2](https://arxiv.org/html/2604.18919#S5.T2)\. The results indicate that Polarity Stance Consistency, Topic Label Alignment, and semantic\-based Topic Diversity exhibit high levels of agreement between LLM and human evaluations, suggesting that these metrics are sufficiently reliable\. Although Specificity shows relatively lower agreement compared to the other three metrics, its reliability remains within an acceptable range for practical use\.

Table 2\.Inter\-rater reliability between LLM and human evaluatorsNotes:Inter\-rater reliability was assessed using ICC\(2,2\)\. Values in brackets indicate 95% confidence intervals\.

#### 5\.3\.2\.benchmarking proposed topic model

In this section, we conducted a benchmark evaluation of the proposed method\. As evaluation metrics, we employed Coherence \(NPMI\) and Bag\-of\-Words–based Topic Diversity, which have been conventionally used in topic modeling research, together with the previously introduced metrics: Topic Label Alignment, semantic\-based topic diversity, specificity, and polarity stance consistency\. Finally, we also assessed the extent to which the proposed method improves the explanatory power with respect to external outcome variables\.

#### 5\.3\.3\.Competitors

We compared the effectiveness of our method with the following baselines\. As a bag\-of\-words–based method, we used Non\-negative Matrix Factorization \(NMF\)\. As an embedding\-based method, we used BERTopic, corresponding to the pipeline up to Step \(2\) in Figure[1](https://arxiv.org/html/2604.18919#S4.F1)\. In addition, we included an LLM\-based method that applies topic relabeling, corresponding to the pipeline up to Step \(3\) in Figure[1](https://arxiv.org/html/2604.18919#S4.F1)\. For NMF, we used the default parameter settings\. For the other methods, we adopted the same experimental settings as described above\.

#### 5\.3\.4\.Evaluation metric settings

For topic coherence, we adopted Normalized Pointwise Mutual Information \(NPMI\)\. For both NPMI and topic diversity, the topic representations were constructed using the top 10 words for each topic\. For LLM\-based automatic evaluation, we used Gemini\-2\.5\-flash\.

#### 5\.3\.5\.Topic coherence evaluation

Across both metrics, the proposed method generally achieved higher Topic Label Alignment and Coherence than the benchmark models \(Table[3](https://arxiv.org/html/2604.18919#S5.T3)\)\. These results suggest that the generated topics are not only lexically coherent, but that the corresponding topic names and descriptions also descriptively represent the associated document collections\.

Table 3\.Topic Coherence Evaluation ResultsCommon notes for Tables 3–6:This table reports results by model and topic granularity\. Type\. denotes leader type\. Char\. denotes leader characteristic\. The number of topics prior to integration was divided into five equally sized bins\. The first, third, and fifth bins were selected as representative topic counts and correspond to Tmin, Tmid, and Tmax, respectively, with Tmax representing the original \(non\-integrated\) number of topics\. Scores are reported for Tmin, Tmid, and Tmax\. "–" indicates that the corresponding value is not available\. The largest values in each metric and settings arebolded\.

#### 5\.3\.6\.Topic diversity evaluation

While our method underperformed NMF and BERTopic in terms of Bag\-of\-Words–based Topic Diversity, it achieved comparable performance in Semantic\-based Topic Diversity \(Table[4](https://arxiv.org/html/2604.18919#S5.T4)\)\. This suggests that our method is capable of capturing subtle semantic distinctions between topics, even when they share similar high\-frequency words\.

#### 5\.3\.7\.Specificity and Polarity Stance Consistency evaluation

The proposed method consistently achieved higher Specificity than NMF and BERTopic, and slightly higher Specificity than Relabeling with LLM \(Table[5](https://arxiv.org/html/2604.18919#S5.T5)\)\. Moreover, the proposed method consistently attained the highest Polarity Stance Consistency across all comparison models\.

Table 4\.Topic Diversity Evaluation ResultsTable 5\.Specificity and Polarity Stance Consistency Evaluation Results

### 5\.4\.Aggregation of Topic Frequencies

To analyze the relationship between the extracted topics and outcome variables in subsequent sections, we aggregated topic frequencies at the firm\-year level for the final set of topics\. In this study, the topic frequency is defined, for each firm\-year combination, as the proportion of posts associated with a given topic relative to the total number of posts\.

### 5\.5\.Results: Benchmarking the Explanatory Power of Leadership Topics for Outcomes

To assess whether the topics constructed by the proposed method are more effective for outcome\-related analyses than those generated by conventional methods, we benchmarked their explanatory power for external outcomes in terms of external outcomes\. Specifically, we compared the explanatory power of topic frequencies across different topic modeling approaches for ROA and employee morale\. For all topic modeling approaches, topic frequencies were aggregated using the procedure described above\. We modeled ROA and employee morale according to Equation[1](https://arxiv.org/html/2604.18919#S5.E1)and computed the explanatory power of each topic modeling approach\. To control for firm size, we included the logarithm of the number of employees as a control variable\. In addition, to account for the influence of year\-specific and industry\-specific macro factors, the dependent variables for ROA and employee morale were defined as deviations from their respective year\-industry means\. The model is given by

\(1\)Yi,t=∑kβkfi,t,k\+αlog⁡\(Employeesi,t\)\+εi,t,Y\_\{i,t\}=\\sum\_\{k\}\\beta\_\{k\}f\_\{i,t,k\}\+\\alpha\\log\(\\text\{Employees\}\_\{i,t\}\)\+\\varepsilon\_\{i,t\},whereYi,tY\_\{i,t\}denoted the outcome variable for firmiiin yeartt, defined as the deviation of ROA or employee morale from the corresponding year–industry average, andfi,t,kf\_\{i,t,k\}represented the frequency of posts assigned to topickk\. The regression model was estimated separately for each topic using subsamples of firms that meet minimum thresholds for the annual number of 10 posts, resulting in a sample of 373 firms\. The model was estimated using Elastic Net regression\(ZouHastie2005\)\. To quantify explanatory power, we compared a full model that included all topic frequencies and control variables with a baseline model containing only the control variables, and computed the incremental contribution of topic frequencies\. This incremental explanatory power was measured using the partial coefficient of determination \(partialR2R^\{2\}\)\.

Table 6\.Explanation Power for Outcomes MeasuresTable[6](https://arxiv.org/html/2604.18919#S5.T6)shows that the proposed method achieved consistently higher explanatory power for employee morale compared with the baseline models\. By contrast, for ROA, the proposed method didn’t exhibit a consistent advantage across datasets\. This may be because leadership\-related topics exhibit weaker contemporaneous associations with ROA than with employee morale\. In particular, changes in leadership\-related perceptions may not be immediately reflected in financial indicators and can affect firm performance with a time lag\. As a result, the corresponding effect sizes for ROA may be relatively small, which can make differences across topic modeling methods more difficult to detect\.

### 5\.6\.Results: Relationship Between Topics and External Outcomes

To examine the relationship between leadership characteristics and firm performance as well as employee morale, we estimated regression models relating topic frequencies to ROA and employee morale\. As discussed, analyses of explanatory power indicate that, in this dataset, models without topic integration generally achieved higher explanatory performance for both ROA and employee morale\. Accordingly, the outcome analyses in this section were conducted using topic frequencies without applying topic integration\. To identify which types of leadership characteristics are associated with higher ROA and employee morale, we conducted analyses relationship between each topic frequency and outcomes\. for firmii, yeartt, and topickk, we estimated the following regression model:

\(2\)Yi,t=βkfi,t,k\+αlog⁡\(Employeesi,t\)\+εi,t,Y\_\{i,t\}=\\beta\_\{k\}f\_\{i,t,k\}\+\\alpha\\log\(\\text\{Employees\}\_\{i,t\}\)\+\\varepsilon\_\{i,t\},whereYi,tY\_\{i,t\}denoted the outcome variable for firmiiin yeartt, defined as the deviation of ROA or employee morale from the corresponding year\-industry average\. The variablefi,t,kf\_\{i,t,k\}represents the frequency of posts associated with topickkfor firmiiin yeartt, andlog⁡\(Employeesi,t\)\\log\(\\text\{Employees\}\_\{i,t\}\)denotes the logarithm of the number of employees\. The coefficientβk\\beta\_\{k\}captures the association between topic frequency and the outcome,α\\alpharepresents the effect of firm size, andεi,t\\varepsilon\_\{i,t\}is the error term\. The regression model was estimated separately for each topic using subsamples of firms that meet minimum thresholds for the annual number of posts \(5, 10, and 15 posts per firm–year\), resulting in sample sizes of 628, 373, and 258 firms, respectively\. Table[19](https://arxiv.org/html/2604.18919#A4.T19)reports only topics that are robust, defined as being statistically significant at the 5% level across at least two threshold specifications, and based on at least 100 posts assigned to the topic\. For all but one topic—namely, "lack of managerial capability" \(10 out of 11 topics in Table[19](https://arxiv.org/html/2604.18919#A4.T19)\), the results are broadly consistent with prior leadership research reporting associations between leadership characteristics and firm performance or employee psychological outcomes \(e\.g\., seeDeRue2011;Judge2004;Montano2017\), while providing more specific behavioral interpretations\.

Table 7\.Relationship between leadership topics and firm outcomesNotes:For each entry, the reported value is the estimated coefficient, with the standard error shown in parentheses\.∗,∗∗, and∗∗∗indicate statistical significance at the 10%, 5%, and 1% levels, respectively\. Reported coefficients correspond to the reviewer\-count threshold of 10 posts per firm\-year\.NNdenotes the number of posts assigned to each topic\. The column "Sig\. thr\. \(5%\)" reports the reviewer\-count thresholds \(5, 10, 15\) at which the coefficient is statistically significant at the 5% level\. Topics are included only if they are statistically significant at the 5% level for both outcomes in at least 2 different thresholds\. Topic names may be manually renamed based on the topic descriptions for interpretability\.

For example, it is shown that executive extraversion is positively associated with firm performance\(Judge2004\)\. Consistent with this finding, the positive associations identified in this study between the leader behavior topic "active dialogue and communication" and both firm performance \(ROA\) and employee morale not only support prior evidence but also provide complementary insights by illustrating how an abstract personality trait such as extraversion manifests as specific leadership behaviors within organizations that are linked to performance outcomes\. In contrast, the topic "lack of managerial capability" yields results that are not necessarily consistent with prior studies \(see, e\.g\.,BloomVanReenen2007;BloomSadunVanReenen2016MAT\)\. However, since the present analysis is based on observational data and is limited to examining correlations rather than estimating causal relationships, these findings can be interpreted as reflecting the possibility of reverse causality\.

## 6\.Conclusion

This study demonstrated that it is possible to construct topic representations that simultaneously satisfy interpretability, specificity, and polarity stance consistency, and that the resulting topics can explain external outcomes as well as or better than those generated by conventional methods\. These findings indicate that designing topics with interpretability, specificity, and polarity stance consistency helps prevent the dilution of relationships between topics and outcomes, thereby enabling the derivation of more actionable insights\.

In addition to the methodological contribution described above, this study further contributes by proposing evaluation metrics for specificity and polarity stance consistency, as well as by operationalizing automatic evaluation methods for Specificity, Polarity Stance Consistency, Topic Label Alignment, and Semantic\-Based Topic Diversity\.

From the perspective of leadership research that motivated this study, these findings lead to the following three points\. First, core findings from prior leadership research largely accumulated in Western cultural contexts were broadly replicated in Japanese firms\. For example, leadership behaviors typically associated with transactional leadership, such as "proactive project execution and task management", were identified as beneficial\. Second, by adopting a data\-driven analysis of employee experience review data, this study presents patterns of leadership without being constrained by any theoretical frameworks\. Third, whereas prior cross\-firm studies—often centered on meta\-analyses—have tended to treat leadership characteristics as abstract meta\-categories, the proposed approach enables these characteristics to be decomposed into specific behaviors, attitudes, and abilities at the level of individual topics\. Taken together, the proposed topic modeling method and evaluation framework are not limited to leadership research and are applicable to a wide range of domains that analyze relationships between topics and external outcomes\.

## 7\.Limitations and Future Work

First, as the analysis in this study is primarily based on Japanese data, the generalizability of the findings is limited, and it is unclear whether similar patterns hold in other cultural contexts\. Future work should incorporate multilingual and multicultural review data from other countries to enable cross\-cultural comparisons\.

Second, methodological refinement remains possible in linking topic representations to external outcomes\. While this study adopts deterministic topic assignment, future work could employ probabilistic frameworks, such as STM, to represent topic–document associations as continuous values and enable finer\-grained analyses\.

Third, the evaluation of robustness is limited\. Due to constraints in computational cost and execution time, validation was limited to a small set of model configurations and datasets\. Future research should conduct more comprehensive evaluations across a wider range of LLMs and experimental conditions\.

## References

## Appendix AHuman Evaluation of Extraction Precision

Table 8\.Precision of leadership\-related document extraction evaluated by human, by categoryNotes:Precision values are based on human evaluation of LLM\-extracted documents along two dimensions: \(1\)Pre/Post Semantic Consistency, which measures semantic consistency between the original review text and the extracted passages before and after processing, and \(2\)Category Alignment, which measures the alignment between the assigned categories and human judgments\.

## Appendix BLibraries and Parameter Settings for the Proposed Topic Modeling

This appendix lists the main Python libraries used in this study, along with their official URLs\.

- •
- •
- •
- •
- •
- •

## Appendix CPrompts Used in this Study

Table 9\.Prompt used for extracting leadership documents \(see Section[5\.1](https://arxiv.org/html/2604.18919#S5.SS1)\)English Prompt \(Translation\)Japanese Prompt \(Original\)\# Task
From the given text, extract instances of the specified extraction target at the minimum granularity, and classify them according to the provided classification guidelines\.
\# Requirements
\- If multiple instances of the extraction target are present, extract all of them\.
\- If no instances of the extraction target are found, return an empty list\.
\- Avoid speculative or overly interpretive reasoning, and base the extraction strictly on information explicitly stated in the text\.
\# Metadata for the given text
\{input\_text\_metadata\}
\# Given text
\{input\_text\}
\# Extraction target
\{extraction\_target\}
\# Supplementary definition of the extraction target
\{extraction\_target\_supplement\}
\# Classification guidelines
\{classification\_guideline\}
\# Output format
Output the results in JSON using the following schema\.
\{output\_json\_schema\}\# タスク
与えられた文章から\#抽出対象を最小粒度で抽出し、\#分類仕様に従って分類してください。
\# 要件
・抽出対象の記述が複数ある場合は、全て抽出してください。
・抽出対象の記述がない場合は、空リストを返してください。
・飛躍した解釈や過度な推測を避け、文章に明確に記載されている内容に基づいて抽出してください。
\# 与えられた文章に関するメタ情報
\{input\_text\_metadata\}
\# 与えられた文章
\{input\_text\}
\# 抽出対象
\{extraction\_target\}
\# 抽出対象の補足定義
\{extraction\_target\_supplement\}
\# 分類仕様
\{classification\_guideline\}
\# 出力形式
以下の形式でJSONを出力してください。
\{output\_json\_schema\}
Notes:\{extraction\_target\} denotes leader\-related attributes that are explicitly described in the employee experienced review text \(not wishes/ideals\) and attributable to an individual leader \(not general employees, policies, or organizational features; passive statements are excluded\)\. \{extraction\_target\_supplement\} provides extraction constraints \(no speculative inference; preserve meaning; split into minimal concise units\)\. \{classification\_guideline\} specifies labels fortarget\_leader\_layer\(top/non\_top/unknown\) andelement\_type\(behavior/attitude/ability/other\); useunknown/otherwhen evidence is insufficient\. \{input\_text\_metadata\} includes the firm name and review category and notes that the reviewer is not necessarily a leader\. The model outputs JSON following \{output\_json\_schema\}, optionally setting flags such asimplicit\_extraction,change\_meaning, andis\_past\.

Table 10\.Prompt used in Step 2 of Figure[1](https://arxiv.org/html/2604.18919#S4.F1)English Prompt \(Translation\)Japanese Prompt \(Original\)\# Task
For a single topic generated by the topic model, determine an appropriate topic name by referring to the topic’s top words and representative documents\.
\# Requirements
\- In addition to the topic name \(topic\_name\), provide a short description of the topic \(topic\_short\_description\)\.
\- Output the result in JSON format\.
\# Supplementary naming guidelines
\- The topic name should be a noun phrase\.
\- The topic name should be concise; avoid redundant expressions such as ‘‘A and B’’ or ‘‘A and B\-related,’’ and keep the number of words to a minimum\.
\- The topic name should comprehensively reflect the content of the representative documents\.
\- The topic name should be as specific as possible\.
\- The topic name should be consistent with the overall context inferred from the metadata of the topic modeling corpus\.
\- The topic description should consist of approximately one sentence and serve as a supplementary explanation of the topic name\.
\# Metadata of the topic modeling corpus
\{document\_metadata\}
\# Top words of the topic
\{topic\_top\_words\}
\# Representative documents of the topic
\{topic\_representative\_documents\}
\# Output schema
topic\_name: string
topic\_short\_description: string
\# Output format
\{
"topic\_name": "Topic name",
"topic\_short\_description": "Short description of the topic"
\}\# タスク
トピックモデルによって作成された1つのトピックについて、\#トピックの上位単語および\#トピックの代表文章を参考に、適切なトピック名を決定してください。
\# 要件
・トピック名（topic\_name）に加えて、トピック名についての短い説明（topic\_short\_description）を付与してください。
・JSON形式で出力してください。
\# 命名規則の補足定義
・トピック名は名詞句としてください。
・トピック名は簡潔な表現とし、「AとB」「AおよびBに関する〜」のような冗長な表現は避け、単語数はできるだけ少なくしてください。
・トピック名は、\#トピックの代表文章の内容を網羅する表現としてください。
・トピック名は、可能な限り具体的な表現としてください。
・トピック名は、\#トピックモデリング対象の文章全体のメタ情報から読み取れる文脈やニュアンスに沿った表現としてください。
・トピック説明は1文程度とし、トピック名の補足説明となる内容としてください。
\# トピックモデリング対象の文章全体のメタ情報
\{document\_metadata\}
\# トピックの上位単語
\{topic\_top\_words\}
\# トピックの代表文章
\{topic\_representative\_documents\}
\# 出力型
topic\_name: str
topic\_short\_description: str
\# 出力形式
\{
"topic\_name": "トピック名",
"topic\_short\_description": "トピック名についての短い説明"
\}Table 11\.Prompt used in Step 3 of Figure[1](https://arxiv.org/html/2604.18919#S4.F1)English Prompt \(Translation\)Japanese Prompt \(Original\)\# Task
Given the input text, select the topic or topics from the list of candidate topics to which the text corresponds\.
\# Requirements
\- Judge whether the text corresponds to each topic by considering both the topic name and its description\.
\- If the text corresponds to multiple topics, select all applicable topics\.
\- For each selected topic, output the topic ID \(topic\_id\), topic name \(topic\_name\), and the reason for selection \(reason\)\.
\- Output the results in JSON format\.
\# Supplementary guidelines for topic assignment
\- Avoid judgments based on speculative or inferential reasoning\.
\- Select only topics that clearly apply to the text; apply a conservative judgment criterion\.
\- Make the judgment in accordance with the nuance inferred from the metadata of the input text\.
\# Notes on output
\- If the text does not correspond to any topic, set the topic ID to \-1 and the topic name to Other\.
\- Except for Other, do not output topics that are not included in the candidate topic list\.
\# Metadata of the text
\{document\_metadata\}
\# Candidate topics \(legend: topic\_id, topic name \(topic description\)\)
\{topic\_definitions\}
\# Input text
\{input\_text\}
\# Output format
\{
"topic\_list": \[
\{
"topic\_id": int,
"topic\_name": str,
"reason": str
\},
…
\]
\}\# タスク
与えられた文章が、候補となるトピックのうちどのトピックに該当するかを選出してください。
\# 要件
・トピック名およびトピックの説明の両方を確認したうえで、文章がトピックに該当するかを判断してください。
・複数のトピックに該当する場合は、複数選択してください。
・選択したトピックについて、トピックID（topic\_id）、トピック名（topic\_name）、および選択理由（reason）を出力してください。
・JSON形式で出力してください。
\# トピック判定における補足定義
・飛躍した推測による判断は避けてください。
・明確に該当すると判断できるトピックのみを選出してください（厳しめの判断基準としてください）。
・文章のメタ情報を踏まえたニュアンスに沿って判断してください。
\# 出力に関する注意点
・どのトピックにも該当しない場合は、トピックIDは \-1、トピック名は その他 を選択してください。
・その他 を除き、候補となるトピックに記載されていないトピックは出力しないでください。
\# 文章のメタ情報
\{document\_metadata\}
\# 候補となるトピック（凡例：トピックID，トピック名（トピック説明））
\{topic\_definitions\}
\# 文章
\{input\_text\}
\# 出力形式
\{
"topic\_list": \[
\{
"topic\_id": int,
"topic\_name": str,
"reason": str
\},
…
\]
\}Table 12\.Prompt used for splitting topics by polarity in Step 4 of Figure[1](https://arxiv.org/html/2604.18919#S4.F1)English Prompt \(Translation\)Japanese Prompt \(Original\)\# Task
For a single topic generated by the topic model, determine whether documents with opposing stances are mixed within the topic\. If opposing stances are present, split the topic accordingly\.
\# Definition of terms
\- ‘‘Opposing stances’’ refer to cases in which documents classified under the same topic convey conflicting meanings\.
\- Examples of opposing stances include contrasts such as ‘‘present vs\. absent,’’ ‘‘many vs\. few,’’ and ‘‘strong vs\. weak\.’’
\# Requirements
\- Output the judgment result indicating whether opposing stances are present \(contain\_opposing\_stance\)\.
\- If topic splitting is required, output a list of child topics \(child\_topics\)\.
\- Each element of child\_topics should be a dictionary containing the child topic name \(child\_topic\_name\), a short description \(child\_topic\_short\_description\), up to three example documents \(document\_examples\), and the reason for interpreting the stance as opposing \(opposing\_stance\_reason\)\.
\- If splitting is not required, output an empty list for child\_topics\.
\- Output the results in JSON format\.
\# Supplementary guidelines for judgment
\- Topic information is provided in the topic name, description, and the set of documents assigned to the topic\.
\- Judge opposing stances only when documents can be clearly interpreted as conveying conflicting stances\.
\# Supplementary guidelines for naming child topics
\- Child topic names should be noun phrases\.
\- Child topic names should be concise; avoid redundant expressions such as ‘‘A and B\.’’
\- Child topic names should be as specific as possible\.
\- Child topic names should be interpretable on their own without reference to the parent topic\.
\- Child topic names should reflect the overall context inferred from the metadata of the topic modeling corpus\.
\# Topic name and description
Topic name: \{topic\_name\}
Topic description: \{topic\_short\_description\}
\# Metadata of the topic modeling corpus
\{document\_metadata\}
\# Documents assigned to the topic
\{topic\_documents\}
\# Output format \(if opposing stances are present\)
\{
"contain\_opposing\_stance": true,
"child\_topics": \[
\{
"child\_topic\_name": "Child topic name 1",
"child\_topic\_short\_description": "Description of child topic 1",
"document\_examples": "Up to three example documents",
"opposing\_stance\_reason": "Reason for interpreting the stance as opposing"
\},
\{
"child\_topic\_name": "Child topic name 2",
"child\_topic\_short\_description": "Description of child topic 2",
"document\_examples": "Up to three example documents",
"opposing\_stance\_reason": "Reason for interpreting the stance as opposing"
\}
\]
\}
\# Output format \(if no opposing stances are present\)
\{
"contain\_opposing\_stance": false,
"child\_topics": \[\]
\}\# タスク
トピックモデルによって作成された1つのトピックに、スタンスが対立する文章が混在しているかどうかを判定してください。混在している場合は、そのトピックを分割してください。
\# 用語の定義
・スタンスが対立するとは、同一トピックに分類されるものの、対立する意味合いを持つ文章が含まれている場合を指します。
・スタンスが対立する例として、「有る vs 無い」「多い vs 少ない」「強い vs 弱い」などがあります。
\# 要件
・スタンスが対立する文章が混在しているかの判定結果（contain\_opposing\_stance）を出力してください。
・トピックの分割が必要な場合は、分割された子トピック（child\_topics）のリストを出力してください。
・child\_topics の各要素には、子トピック名（child\_topic\_name）、子トピックの説明（child\_topic\_short\_description）、該当する文章例（最大3件、document\_examples）、およびスタンスが対立すると解釈した理由（opposing\_stance\_reason）を含めてください。
・分割が不要な場合は、child\_topics は空リストとしてください。
・JSON形式で出力してください。
\# 判定における補足定義
・該当トピックの情報は、トピック名、トピック説明、およびトピックに該当する文章群に記載されています。
・明らかにスタンスが対立していると解釈できる文章が混在している場合のみ、スタンスが対立すると判定してください。
\# 子トピック命名規則の補足定義
・名詞句としてください。
・簡潔な表現とし、「AとB」「AおよびBに関する〜」のような冗長な表現は避け、単語数はできるだけ少なくしてください。
・可能な限り具体的な表現としてください。
・親トピック名がなくても、子トピック名のみで意味が明確に解釈できる表現としてください。
・トピックモデリング対象の文章全体のメタ情報を考慮した表現としてください。
\# トピック名と説明
トピック名: \{topic\_name\}
トピック説明: \{topic\_short\_description\}
\# トピックモデリング対象の文章全体のメタ情報
\{document\_metadata\}
\# トピックに該当する文章群
\{topic\_documents\}
\# 出力形式（スタンスが対立する文章が含まれている場合）
\{
"contain\_opposing\_stance": true,
"child\_topics": \[
\{
"child\_topic\_name": "子トピック名1",
"child\_topic\_short\_description": "子トピック名1の説明",
"document\_examples": "子トピック名1の文章例（最大3件）",
"opposing\_stance\_reason": "子トピック2とスタンスが対立すると解釈した理由"
\},
\{
"child\_topic\_name": "子トピック名2",
"child\_topic\_short\_description": "子トピック名2の説明",
"document\_examples": "子トピック名2の文章例（最大3件）",
"opposing\_stance\_reason": "子トピック1とスタンスが対立すると解釈した理由"
\}
\]
\}
\# 出力形式（スタンスが対立する文章が含まれていない場合）
\{
"contain\_opposing\_stance": false,
"child\_topics": \[\]
\}Table 13\.Prompt used for assigning documents to polarity\-specific topics in Step 4 of Figure[1](https://arxiv.org/html/2604.18919#S4.F1)English Prompt \(Translation\)Japanese Prompt \(Original\)\# Task
Determine which of the candidate child topics best matches the input text that has been classified under the parent topic\.
\# Requirements
\- Always output the topic ID \(topic\_id\), topic name \(topic\_name\), and the reason for the decision \(reason\)\.
\- If the text does not match any of the candidate child topics, select Other\.
\- Output the result in JSON format\.
\# Supplementary guidelines for judgment
\- Make the judgment by taking into account the metadata of the input text\.
\# Parent topic
\{parent\_topic\}
\# Input text
\{input\_text\}
\# Metadata of the text
\{document\_metadata\}
\# Candidate child topics \(legend: topic\_id, topic name \(topic description\)\)
\{child\_topic\_definition\}
\# Output format
\{
"topic\_id": int,
"topic\_name": str,
"reason": str
\}\# タスク
親トピックに分類されている文章が、子トピック候補のいずれに一致するかを判断してください。
\# 要件
・トピックID（topic\_id）、トピック名（topic\_name）、および判断理由（reason）を必ず出力してください。
・いずれの子トピック候補とも一致しない場合は、その他 を選択してください。
・JSON形式で出力してください。
\# 判定の際の補足定義
・文章のメタ情報を考慮して判断してください。
\# 親トピック
\{parent\_topic\}
\# 文章
\{input\_text\}
\# 文章のメタ情報
\{document\_metadata\}
\# トピック候補（凡例：トピックID，トピック名（トピック説明））
\{child\_topic\_definition\}
\# 出力形式
\{
"topic\_id": int,
"topic\_name": str,
"reason": str
\}Table 14\.Prompt used for evaluating polarity stance similarity between all pairs of topics \(described in Section[4\.1\.5](https://arxiv.org/html/2604.18919#S4.SS1.SSS5)\)\# Criteria name topic\_stance\_similarity \# Evaluation steps 1\. Read the topic labels and descriptions of the two topics carefully\. 2\. Compare the main themes, concepts, and ideas expressed in both topics\. 3\. Determine whether the topics are clearly distinct in stances\. \# Criteria \(prompt\) Are the two topics clearly distinct in stance, describing opposing or mutually exclusive positions on a theme or idea? \# Rubric \(score interpretation\) 0\-\-2: The two topics have almost the same stance \(very low stance diversity\)\. 3\-\-5: The topics are somewhat distinct in stance \(low stance diversity\)\. 6\-\-8: The topics are mostly different in stance \(moderate stance diversity\)\. 9\-\-10: The topics are clearly distinct, expressing opposing or mutually exclusive positions on a theme or idea \(high stance diversity\)\.

Notes:The evaluator is configured by defining the evaluation steps and criterion question through aCriteriafunction, while qualitative score meanings are provided independently via aRubriclist that specifies expected outcomes for each score range\.

Table 15\.Prompt used for evaluatiing Topic Label Alignment \(described in Section[4\.2\.1](https://arxiv.org/html/2604.18919#S4.SS2.SSS1)\)\# Criteria name Topic Label Alignment \# Evaluation steps 1\. Read the topic label and topic description carefully\. 2\. Read the given document associated with the topic\. 3\. For the given document, strictly judge whether its main meaning, theme, and details are fully and semantically captured by the topic label and description, and vice versa\. 4\. If any meaning\-level mismatch, omission, or extraneous concept is found between the document and the label and description, even if minor, count the document as misaligned\. \# Criteria \(prompt\) For the document, do the topic label and description align completely and semantically with its content? \# Rubric \(score interpretation\) 0\-\-2: The document is largely misaligned with the topic label and description; its main meaning, theme, or details differ substantially, and the label fails to capture the document’s semantic core\. 3\-\-5: The document shows partial alignment, but key meanings or important details are missing or incorrectly represented\. 6\-\-8: The document is mostly aligned; minor omissions or slight semantic mismatches are present, but the overall meaning is adequately captured\. 9\-\-10: The document is fully and semantically aligned; its central meaning, theme, and key details are precisely and completely represented\.

Notes:In the G\-Eval implementation, the evaluation steps and criterion question are specified through aCriteriafunction, while score meanings are defined separately via a list ofRubricobjects that map score ranges \(0–10\) to qualitative levels of topic label alignment\.

Table 16\.Prompt used for evaluating Semantic\-based Topic Diversity \(described in Section[4\.2\.2](https://arxiv.org/html/2604.18919#S4.SS2.SSS2)\)English Prompt \(Translation\)Japanese Prompt \(Original\)\# Task
To evaluate the diversity of topic modeling results, judge the semantic similarity between two topics\.
\# Requirements
\- Based on the topic names and descriptions, evaluate similarity using the criteria below and assign a score on a 10\-point scale\.
\- Output the reason for the assigned score\.
\- Output the results in JSON format\.
\# Evaluation criteria
\- Whether the topic names describe similar content\.
\- Whether the topic descriptions describe similar content\.
\# Definition of similarity
\- The two topics are described at a comparable level of granularity\.
\- The two topics share similar evaluative or affective nuances \(e\.g\., positive vs\. negative\)\.
\# Topic 1
Topic name: \{topic\_name\_1\}
Topic description: \{topic\_short\_description\_1\}
\# Topic 2
Topic name: \{topic\_name\_2\}
Topic description: \{topic\_short\_description\_2\}
\# Examples
Score 0\-\-2: Completely different content
Topic 1: ‘‘Lack of teamwork’’ \(inability or unwillingness to cooperate with team members\)\.
Topic 2: ‘‘One\-on\-one meetings’’ \(regular one\-on\-one meetings between supervisors and subordinates\)\.
Score 3\-\-5: Partially related, but different in granularity or nuance
Topic 1: ‘‘Lack of teamwork’’ \(attitudes or behaviors reflecting inability or unwillingness to cooperate\)\.
Topic 2: ‘‘Teamwork culture’’ \(organizational culture regarding collaboration and cooperation\)\.
Score 6\-\-8: Semantically similar despite lexical differences
Topic 1: ‘‘Lack of teamwork’’ \(difficulty or reluctance to collaborate\)\.
Topic 2: ‘‘Passive teamwork’’ \(collaboration characterized by passive attitudes\)\.
Score 10: Identical wording and content
Topic names and descriptions are fully identical\.
\# Output format
\{
"score": int,
"reason": str
\}\# メトリクス名
Semantic\-based Topic Diversity
\# タスク
トピックモデリング結果の多様性を評価するために、2つのトピック内容の類似性を判断してください。
\# 要件
・トピック名およびトピック説明をもとに、以下の判定基準を参考に10段階評価でスコアリングしてください。
・そのように判断した理由を出力してください。
・JSON形式で出力してください。
\# 判定基準
・2つのトピック名が似た内容であるか。
・2つのトピック説明が似た内容であるか。
\# 「似ている」の定義
・内容の粒度が同程度であること。
・ポジティブ／ネガティブなどのニュアンスが一致していること。
\# Topic 1
トピック名: \{topic\_name\_1\}
トピック説明: \{topic\_short\_description\_1\}
\# Topic 2
トピック名: \{topic\_name\_2\}
トピック説明: \{topic\_short\_description\_2\}
\# 例
評価スコア0\-\-2：全く異なる内容
Topic 1：チームワークの欠如（協力できない、または協力しようとしない態度や行動）。
Topic 2：1on1面談の実施（上司と部下が定期的に面談を行うこと）。
評価スコア3\-\-5：一部関連性はあるが粒度やニュアンスが異なる
Topic 1：チームワークの欠如（協力できない態度や行動）。
Topic 2：チームワークの風土（協力姿勢や能力に関する文化）。
評価スコア6\-\-8：用語は異なるが内容は類似
Topic 1：チームワーク不足（協力できない状況）。
Topic 2：消極的チームワーク（協力姿勢が消極的な状態）。
評価スコア10：完全に同一
トピック名および説明が完全に一致している場合。
\# 出力形式
\{
"score": int,
"reason": str
\}
Notes:The illustrative examples are provided in Japanese and may be adapted or replaced depending on the specific task or evaluation context\.

Table 17\.Prompt used for evaluating Specificity \(described in Section[4\.2\.3](https://arxiv.org/html/2604.18919#S4.SS2.SSS3)\)\# Criteria name Specificity \# Evaluation steps 1\. Read the topic label and its description carefully\. 2\. When it becomes clear that the topic has a positive or negative impact on business performance or employee engagement, evaluate whether the leader \-\-\- the subject of the topic \-\-\- can easily form an actionable mental image of the behavioral changes they should implement\. 3\. Evaluate whether the topic refers to a narrowly defined situation rather than a broad or generalized category of issues\. 4\. If the topic relies on overly broad themes or spans multiple unrelated aspects, treat it as low in specificity\. \# Criteria \(prompt\) This criterion evaluates the topic along two axes: \(i\) imaginability \-\-\- whether a concrete and actionable mental image can be formed; and \(ii\) specificity \-\-\- whether the described situation is narrow and well\-defined rather than overly broad or semantically dispersed\. Is the topic imaginable and specific enough for the leader? \# Rubric \(score interpretation\) 0\-\-2: Extremely low specificity and imaginability\. The topic is abstract, overly broad, or mixes multiple unrelated aspects, preventing a coherent mental image\. The leader cannot visualize who is acting, what is happening, or in what situation\. Example: 組織迷走と多問題化（経営層の方向性が不明確なまま、複数の問題が同時に生じている状況）。 3\-\-5: Low specificity and imaginability\. Some concrete elements are present, but the topic remains broad or semantically dispersed, making it difficult to form a single actionable scenario\. The leader can grasp the general idea but not a coherent behavioral change\. Example: 新規事業推進の負荷増大（意思決定遅延と情報共有不足により現場負荷が増加している状態）。 6\-\-8: Moderate to high specificity and imaginability\. The topic is reasonably focused with identifiable actors and actions, allowing a mostly coherent mental image, though some details may remain generalized\. Example: 承認停滞を生む業務放置（管理職によるレビュー遅延で業務進行が滞る状況）。 9\-\-10: Very high specificity and imaginability\. The topic is narrow, concrete, and semantically unified, with clear actors, actions, and context\. The leader can immediately visualize a vivid and actionable scene\. Example: 会議発言遮断による停滞（週次会議で部長が部下の発言を遮る場面）。

Notes:In the G\-Eval implementation, the evaluation procedure and criterion question are defined via aCriteriafunction, while score meanings and example\-based interpretations are specified independently through aRubriclist mapping score ranges \(0–10\) to qualitative levels of topic specificity\. The illustrative rubric examples are provided in Japanese and may be adapted depending on the task\.

Table 18\.Prompt used for evaluating Polarity Stance Consistency \(described in Section[4\.2\.4](https://arxiv.org/html/2604.18919#S4.SS2.SSS4)\)\# Criteria name Polarity Stance Consistency \# Evaluation steps 1\. Read the topic label and description carefully\. 2\. Paraphrase the main phenomenon, condition, or state described, without considering emotional or evaluative direction\. 3\. Consider whether the topic could plausibly be interpreted as describing more than one mutually exclusive or opposite state, such as presence vs\. absence, strong vs\. weak, positive vs\. negative, or increase vs\. decrease\. For example, topics like ‘‘manager influence,’’ ‘‘job satisfaction,’’ or ‘‘work\-\-life balance’’ may refer to either high or low levels, presence or absence, or improvement or decline\. 4\. List the main plausible interpretations regarding the presence, absence, or degree of the phenomenon\. If any pair of interpretations are mutually exclusive or opposites, mark the topic as inconsistent\. If only a single meaning or state is reasonably plausible, mark it as consistent\. \# Criteria \(prompt\) Do the topic label and description allow for mutually exclusive or opposite meanings \(e\.g\., presence vs\. absence, high vs\. low, increase vs\. decrease\)? If any pair of plausible interpretations are opposites or mutually exclusive, the topic is inconsistent, regardless of evaluative direction\. If only one meaning or state is reasonably plausible, the topic is consistent\. \# Rubric \(score interpretation\) 0\-\-2: The topic is clearly contradictory or contains explicitly opposing stances, making it impossible to assign a single position\. Example: ‘‘Manager’s management of subordinates’’ \(describes various and opposing behaviors and attitudes without indicating a clear stance\)\. 3\-\-5: The topic somewhat includes opposing or conflicting stances\. Both positive and negative interpretations are possible, but one may be slightly more dominant\. 6\-\-8: The topic is generally consistent in stance, though minor ambiguity or alternative interpretations are possible\. 9\-\-10: The topic is clearly consistent, expressing a single and unambiguous stance\. Example: ‘‘Supportive management practices’’ \(clearly indicates a positive stance\)\.

Notes:In the G\-Eval implementation, the evaluation steps and criterion question are defined via aCriteriafunction, while score meanings and example\-based interpretations are specified independently through aRubriclist mapping score ranges \(0–10\) to qualitative levels of polarity stance consistency\.

## Appendix DFull Results on Topic and Outcomes Relationship

Table 19\.Relationship between leadership topics and firm outcomesTable 20\.Relationship between leadership topics and firm outcomes \(continued\)Table 21\.Relationship between leadership topics and firm outcomes \(continued\)Table 22\.Relationship between leadership topics and firm outcomes \(continued\)Table 23\.Relationship between leadership topics and firm outcomes \(continued\)Table 24\.Relationship between leadership topics and firm outcomes \(continued\)Table 25\.Relationship between leadership topics and firm outcomes \(continued\)Table 26\.Relationship between leadership topics and firm outcomes \(continued\)Table 27\.Relationship between leadership topics and firm outcomes \(continued\)Table 28\.Relationship between leadership topics and firm outcomes \(continued\)Table 29\.Relationship between leadership topics and firm outcomes \(continued\)Table 30\.Relationship between leadership topics and firm outcomes \(continued\)Table 31\.Relationship between leadership topics and firm outcomes \(continued\)Table 32\.Relationship between leadership topics and firm outcomes \(continued\)Table 33\.Relationship between leadership topics and firm outcomes \(continued\)Table 34\.Relationship between leadership topics and firm outcomes \(continued\)Table 35\.Relationship between leadership topics and firm outcomes \(continued\)Table 36\.Relationship between leadership topics and firm outcomes \(continued\)Table 37\.Relationship between leadership topics and firm outcomes \(continued\)Table 38\.Relationship between leadership topics and firm outcomes \(continued\)Table 39\.Relationship between leadership topics and firm outcomes \(continued\)“Type\.” refers to the leader type \(Top or Non\-top\), and “Char\.” refers to the leader characteristic \(Ability, Attitude, or Behavior\)\. Translated Translated Topic Name \(Original Topic Name\) and Translated Translated Topic Description \(Original Topic Description\) are automatically generated English translations produced by GPT\-4\.1\-mini, given the original Japanese topic name and description as input\.Notes:For each entry, the reported value is the estimated coefficient, with the standard error shown in parentheses\.∗,∗∗, and∗∗∗indicate statistical significance at the 10%, 5%, and 1% levels, respectively\.ttdenotes the annual reviewer\-count threshold \(minimum number of reviewers per firm\-year\)\.NNdenotes the number of posts assigned to each topic\. Topics are included only if they are statistically significant at the 5% level for both outcomes in at least 0 different thresholds amongt∈\{5,10,15\}t\\in\\\{5,10,15\\\}\. Topic names may be manually renamed based on the topic descriptions for interpretability\.
Proposing Topic Models and Evaluation Frameworks for Analyzing Associations with External Outcomes: An Application to Leadership Analysis Using Large-Scale Corporate Review Data

Similar Articles

Explanation Fairness in Large Language Models: An Empirical Analysis of Disparities in How LLMs Justify Decisions Across Demographic Groups

Large Language Models for Causal Relations Extraction in Social Media: A Validation Framework for Disaster Intelligence

Foundational Study on Authorship Attribution of Japanese Web Reviews for Actor Analysis

Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks

Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models

Submit Feedback

Similar Articles

Explanation Fairness in Large Language Models: An Empirical Analysis of Disparities in How LLMs Justify Decisions Across Demographic Groups
Large Language Models for Causal Relations Extraction in Social Media: A Validation Framework for Disaster Intelligence
Foundational Study on Authorship Attribution of Japanese Web Reviews for Actor Analysis
Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks
Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models