ModTGCN: Modularity-aware Graph Neural Networks for Text Classification
Summary
ModTGCN is a modularity-aware graph neural network that jointly optimizes cross-entropy and a modularity-based auxiliary objective to improve text classification by leveraging global community structure in document graphs, achieving consistent gains on five benchmarks.
View Cached Full Text
Cached at: 06/24/26, 07:43 AM
# ModTGCN: Modularity-aware Graph Neural Networks for Text Classification
Source: [https://arxiv.org/html/2606.23694](https://arxiv.org/html/2606.23694)
11institutetext:BITS Pilani, India11email:f20201822p@alumni\.bits\-pilani\.ac\.in, \{p20200470,vinti\.agarwal\}@pilani\.bits\-pilani\.ac\.in22institutetext:Independent Researcher22email:hariom85@gmail\.comAditya SharmaVinti Agarwal Hari Om AggrawalCorresponding author: vinti\.agarwal@pilani\.bits\-pilani\.ac\.in
###### Abstract
Graph\-based text classification models typically rely on local neighborhood aggregation and overlook global community structure, despite semantic document graphs exhibiting strong class\-consistent clustering\. Ignoring this can blur class boundaries and lead to over\-smoothing\. We propose ModTGCN, a modularity\-aware graph neural network for text classification that jointly optimizes cross\-entropy and a modularity\-based auxiliary objective to promote class\-coherent document communities while preserving discriminative representations\. The modularity term is computed on a document–document similarity graph derived from transformer embeddings \(pretrained or fine\-tuned\)\. To improve scalability, we decouple the original heterogeneous TextGCN graph into separate document–word and word–word components, achieving 2×\\times–10×\\timesfaster training\. We further study graph construction strategies, label\-aware edge reweighting, and supervision choices for modularity optimization\. Experiments on five benchmarks show consistent gains, with larger improvements on complex, low homophily datasets such as Ohsumed and 20NG\.
## 1Introduction
Text classification remains a fundamental task in natural language processing\. Recent advances in transformer\-based models such as BERT and large language models \(LLMs\) have achieved strong zero\- and few\-shot performance\[[16](https://arxiv.org/html/2606.23694#bib.bib40),[23](https://arxiv.org/html/2606.23694#bib.bib39)\]\. However, these models often require costly full fine\-tuning, parameter\-efficient adaptation \(e\.g\., adapters, LoRA\), or prompt calibration to perform well in supervised settings\. An alternative line of work formulates text classification aslearning over graph structures, where words and documents are represented as nodes, and edges encode lexical or semantic relationships\. Graph neural network \(GNN\) approaches such as TextGCN\[[24](https://arxiv.org/html/2606.23694#bib.bib16)\], TensorGCN\[[9](https://arxiv.org/html/2606.23694#bib.bib17)\], BertGCN\[[8](https://arxiv.org/html/2606.23694#bib.bib20)\], and VGCNBert\[[10](https://arxiv.org/html/2606.23694#bib.bib19)\]explicitly model interactions among words and documents, enabling semi\-supervised learning through relational propagation\. Despite their effectiveness, most graph\-based text classifiers rely primarily on local neighborhood aggregation\. Yet semantic document graphs often exhibit clear mesoscopic structure:documents sharing the same label tend to form assortative clusterswith dense intra\-class connectivity and sparse inter\-class links\. When this global community structure is ignored, two issues commonly arise: \(i\) hub\-driven shortcuts from high\-frequency terms or noisy similarities blur class boundaries, and \(ii\) over\-smoothing in deeper GNNs homogenizes representations across weak community separations, reducing discriminability\. Incorporating global structural information during training may therefore improve robustness and class separation\[[2](https://arxiv.org/html/2606.23694#bib.bib31)\]\.
Modularity optimization—originally developed for community detection\[[14](https://arxiv.org/html/2606.23694#bib.bib33),[15](https://arxiv.org/html/2606.23694#bib.bib34)\]—provides a principled mechanism for modeling such global structure\. Motivated by this observation, we introduceModTGCN, a modularity\-aware GNN for text classification\. Our approach augments standard cross\-entropy supervision with a modularity\-based auxiliary objective computed on a document–document similarity graph derived from transformer embeddings\. The central hypothesis is that explicitly optimizing modular structure aligns learned representations with class\-level graph communities, thereby providing global regularization beyond local message passing\. Although modularity\-aware GNNs have shown promise in other domains\[[12](https://arxiv.org/html/2606.23694#bib.bib3),[22](https://arxiv.org/html/2606.23694#bib.bib35)\], their use in semi\-supervised document classification remains underexplored\.
A key challenge in semi\-supervised graph learning is propagating supervision from limited labeled nodes while respecting global structure\. We address this by computing modularity on a document–document graph constructed from pretrained or fine\-tuned SBERT embeddings\[[20](https://arxiv.org/html/2606.23694#bib.bib11)\], using gold labels for labeled nodes and TextGCN\[[24](https://arxiv.org/html/2606.23694#bib.bib16)\]predictions for unlabeled nodes\. This hybrid supervision scheme yields label\-efficient and interpretable improvements without requiring expensive LLM fine\-tuning, while remaining encoder\-agnostic and compatible with future embedding advances\.
To further improve scalability, we decouple the original single heterogeneous TextGCN graph into separate document–word \(TF–IDF\) and word–word \(PMI\) components\. This preserves the original propagation mechanism while substantially reducing computational overhead on large datasets\. A detailed complexity analysis is provided in Section[5\.3](https://arxiv.org/html/2606.23694#S5.SS3)\.
To this end, we summarize our contributions as:
- •Modularity\-aware GNN with hybrid supervision\.We introduce a joint objectiveL=CE\+λ\(−Q\)L=\\mathrm\{CE\}\+\\lambda\(\-Q\)that promotes class\-consistent communities on a document–document graph built from language model embeddings, using pseudo\-labels for unlabeled nodes\.
- •Architectural decoupling of TextGCN\.We reformulate the heterogeneous graph into separate document–word and word–word components, improving scalability without altering the effective decision function\.
- •Document–document adjacency strategies\.We compare TF–IDF, cosine, and Gaussian similarity graphs, analyzing accuracy–scalability trade\-offs\.
- •Empirical validation\.Experiments on five benchmark datasets demonstrate that jointly optimizing classification and modularity consistently improves performance, particularly on structurally complex datasets\.
## 2Related Work
Existing graph\-based text classification methods can be categorized intoGNN\-onlyandhybrid GNN–language model \(LM\)approaches\.GNN\-onlymethods such as TextGCN\[[24](https://arxiv.org/html/2606.23694#bib.bib16)\]construct heterogeneous document–word graphs using TF–IDF and PMI edges and apply a two\-layer GCN for semi\-supervised text classification\. Subsequent variants extend this design by incorporating multi\-view or tensor\-based word graphs\[[9](https://arxiv.org/html/2606.23694#bib.bib17)\], heterogeneous attention mechanisms\[[18](https://arxiv.org/html/2606.23694#bib.bib29)\], or additional lexical nodes \(e\.g\., character and n\-grams\) to enrich structural information\[[7](https://arxiv.org/html/2606.23694#bib.bib18)\]\. While effective, these methods rely on large, fine\-grained graphs, which incur high computational costs and limit scalability\.Hybrid GNN–LMmodels integrate contextual embeddings from pretrained transformers into graph learning\. Approaches such as BERTGCN\[[8](https://arxiv.org/html/2606.23694#bib.bib20)\]and VGCN\-BERT\[[10](https://arxiv.org/html/2606.23694#bib.bib19)\]jointly train or fuse language models with graph encoders to combine local contextual signals and global relational structure\. Although they improve performance, co\-training LMs with GNNs incurs substantial computational overhead\.
Modularity in GNNs\.Modularity\[[13](https://arxiv.org/html/2606.23694#bib.bib22)\]measures community quality by comparing observed within\-community edges against a degree\-preserving null model\. A known limitation is the*resolution limit*\[[1](https://arxiv.org/html/2606.23694#bib.bib23)\], which causes small but coherent communities to be merged into larger ones\. To alleviate this, resolution\-adjusted variants introduce a tunable parameter\[[19](https://arxiv.org/html/2606.23694#bib.bib24)\]\(withγ\>1\\gamma\>1revealing smaller groups\) or adopt*modularity density*\[[3](https://arxiv.org/html/2606.23694#bib.bib25)\]and its generalizationQgQ\_\{g\}\[[4](https://arxiv.org/html/2606.23694#bib.bib26)\], which weights communities by internal link density to better retain cohesive clusters\. Several works integrate modularity into neural objectives: Modularity regularizers have been added to GAE/VGAE for unsupervised community detection\[[21](https://arxiv.org/html/2606.23694#bib.bib27)\], VGAER\[[17](https://arxiv.org/html/2606.23694#bib.bib28)\]; Murata & Afzal\[[12](https://arxiv.org/html/2606.23694#bib.bib3)\]directly optimize modularity during GNN training, producing embeddings aligned with community structure for clustering tasks\. These studies show that modularity\-aware learning can uncover latent structure and improve downstream quality\.*Connection to our setting:*Prior modularity\-aware approaches primarily target unsupervised clustering or community detection\. In contrast, we integrate modularity into semi\-supervised text classification, aligning graph communities with class labels through hybrid supervision on a document–document similarity graph\. This provides a global structural prior that complements local message passing while maintaining computational efficiency via our decoupled graph construction\.
## 3Proposed Method
### 3\.1Problem Formulation
Given a corpus of documents𝒟=\{d1,d2,…,dn\}\\mathcal\{D\}=\\\{d\_\{1\},d\_\{2\},\\ldots,d\_\{n\}\\\}and a corresponding label set𝒴=\{y1,y2,…,yn\}\\mathcal\{Y\}=\\\{y\_\{1\},y\_\{2\},\\ldots,y\_\{n\}\\\}spanning𝒞\\mathcal\{C\}classes, the primary goal is to perform document classification\. During training, document set𝒟\\mathcal\{D\}is partitioned into labeledUUand unlabeledVVsets, and the objective is to learn a mapping functionf:𝒟→𝒴f:\\mathcal\{D\}\\rightarrow\\mathcal\{Y\}that can accurately predict the classyiy\_\{i\}for each unlabeled documents using graph\-based relational structure\.
### 3\.2ModTGCN: Graph Construction
ModTGCN operates over three graphs: \(1\) adocument–word graph𝒢d=\(𝒱d,ℰd\)\\mathcal\{G\}\_\{d\}=\(\\mathcal\{V\}\_\{d\},\\mathcal\{E\}\_\{d\}\)constructed via TF–IDF weights, \(2\)a word–word graph𝒢w=\(𝒱w,ℰw\)\\mathcal\{G\}\_\{w\}=\(\\mathcal\{V\}\_\{w\},\\mathcal\{E\}\_\{w\}\), built from PMI\-based word co\-occurrence, and \(3\)document\-document similarity graphfor modularity optimization,𝒢doc=\(𝒱doc,ℰdoc\)\\mathcal\{G\}\_\{doc\}=\(\\mathcal\{V\}\_\{doc\},\\mathcal\{E\}\_\{doc\}\), where edge weights represent node similarity computed using Gaussian \(RBF\) kernelSij=exp\(−‖ei−ej‖22σ2\)S\_\{ij\}=\\exp\\\!\\left\(\-\\tfrac\{\\\|e\_\{i\}\-e\_\{j\}\\\|^\{2\}\}\{2\\sigma^\{2\}\}\\right\)\[[25](https://arxiv.org/html/2606.23694#bib.bib45)\],eie\_\{i\}is the embedding of documenti∈𝒱doci\\in\\mathcal\{V\}\_\{doc\}from transformers andσ\\sigmacontrols the neighborhood sensitivity in kernel space\. The first two graphs preserve TextGCN’s propagation structure, while the third introduces global structural supervision\.
### 3\.3Modularity as an Objective Function
Standard GCNs\[[6](https://arxiv.org/html/2606.23694#bib.bib42)\]aggregate over L\-hop local neighborhoods, modularity introduces global, degree\-aware coupling between all node pairs via the null\-model term\. However, semantic document graphs often exhibit mesoscopic community structure aligned with class labels\. To explicitly encourage such structure, we introduce modularity as an auxiliary objective\. Modularity,QQmeasures the deviation of observed intra\-community connectivity from a degree\-preserving null model\.For the label \(or community\) matrixP∈ℝn×CP\\in\\mathbb\{R\}^\{n\\times C\}, the modularity is:
Q\(P\)=12mTr\(P⊤BP\),B=A−γkk⊤2m,Q\(P\)\\;=\\;\\frac\{1\}\{2m\}\\,\\mathrm\{Tr\}\\\!\\big\(P^\{\\top\}BP\\big\),\\qquad B\\;=\\;A\\;\-\\;\\gamma\\,\\frac\{kk^\{\\top\}\}\{2m\},\(1\)whereAAis the adjacency,kkthe degree vector,mmthe total edge weight, andγ\\gammathe resolution parameter\. Modularity matrixBBquantifies how much the graph can deviate from the null model\. Building on this, the modularity loss is:
ℒmod\(P\)=−Q\(P\),\\mathcal\{L\}\_\{mod\}\(P\)\\;=\\;\-\\,Q\(P\),\(2\)which is minimized when the predicted communities \(induced byPP\) have intra\-community edge count exceeding the null\-model expectation\. Importantly, even non\-adjacent nodes \(Aij=0A\_\{ij\}=0\) contribute via the null\-model termγkk⊤2m\\gamma\\,\\frac\{kk^\{\\top\}\}\{2m\}, giving modularity its global coupling effect\.
##### Modularity gradient under hybrid supervision\.
We partition the label matrixPPas\(PU,PV\)\(P\_\{U\},P\_\{V\}\)and the modularity matrixBBinto blocks\{BUU,BUV,BVU,BVV\}\\\{B\_\{UU\},B\_\{UV\},B\_\{VU\},B\_\{VV\}\\\}based on labeled,UUand unlabeledVVnodes\.
ModularityQQ, is computed on𝒢doc\\mathcal\{G\}\_\{doc\}, using gold labels forPUP\_\{U\}and TextGCN pseudo labels forPVP\_\{V\}\.We also evaluate a variant that uses soft labels for bothPUP\_\{U\}andPVP\_\{V\}; see ablation study in Section[6](https://arxiv.org/html/2606.23694#S6)\. The gradient with respect toPVP\_\{V\}is:
∇PVℒ=−1m\(BVVPV\+BVUPU\)\.\\nabla\_\{P\_\{V\}\}\\mathcal\{L\}\\;=\\;\-\\,\\frac\{1\}\{m\}\\Big\(B\_\{VV\}P\_\{V\}\\;\+\\;B\_\{VU\}P\_\{U\}\\Big\)\.\(3\)
The termBVUPUB\_\{VU\}P\_\{U\}acts as a degree\-corrected supervision field induced by labeled nodes, whileBVVPVB\_\{VV\}P\_\{V\}couples unlabeled nodes to one another\. Expanding the supervision field for nodei∈Vi\\in Vand classcc:
\[BVUPU\]i,c=∑j∈UAij1\[gj=c\]−ki2m∑j∈Ukj1\[gj=c\]\.\\big\[B\_\{VU\}P\_\{U\}\\big\]\_\{i,c\}\\;=\\;\\sum\_\{j\\in U\}A\_\{ij\}\\,\\mathbf\{1\}\[g\_\{j\}=c\]\\;\-\\;\\frac\{k\_\{i\}\}\{2m\}\\sum\_\{j\\in U\}k\_\{j\}\\,\\mathbf\{1\}\[g\_\{j\}=c\]\.\(4\)
Thus, nodes are encouraged toward classes where observed connectivity exceeds null\-model expectation, mitigating hub bias and discouraging degenerate single\-cluster assignments\.
##### Global coupling in modularityQQ\(toy\)\.
Figure[1](https://arxiv.org/html/2606.23694#S3.F1)shows three cases highlighting the global nature of modularity\. InS1 \(baseline\), node 2 links to moderate\-degree blue nodes\. Because these observed edges exceed the degree\-corrected null\-model expectation, the supervision field favors the blue class and penalizes red\. InS2 \(global change\), adding\(1,5\)\(1,5\)and\(3,4\)\(3,4\)increases blue node degrees andmmwithout altering node 2’s neighborhood, raising the null\-model baseline and weakening the blue field, despite its immediate connections remaining unchanged\. This demonstrates the dependence of modularity on the*global*degree distribution, rather than local adjacency\. InS2\+, adding\(2,5\)\(2,5\)changesk2k\_\{2\}andmmand activatesBVVPVB\_\{VV\}P\_\{V\}in \([3](https://arxiv.org/html/2606.23694#S3.E3)\), so node 2 is not only influenced by labeled neighbors but also by unlabeled neighbors’ soft labels \(e\.g\., a red\-leaning node 5 reduces its preference for blue\), illustrating unlabeled–unlabeled coupling\.
123456
\(a\)S1: baseline\.
123456
\(b\)S2: global change\.
123456
\(c\)S2\+: add \(2–5\)\.
Figure 1:Toy graphs with labeled nodesU=\{1,3,4\}U=\\\{1,3,4\\\}\(blue, red, blue\) and unlabeled nodes \(grey\)V=\{2,5,6\}V=\\\{2,5,6\\\}\. Node 2 always connects to nodes 1 and 4 \(both A\)\. Blue/red = class A/B\. Thick edges are added in S2; dashed in S2\+\.
##### Observation\.
This mechanism provides \(i\)*global supervision propagation*via the degree\-corrected fieldBVUPUB\_\{VU\}P\_\{U\}, allowing few labels to influence distant nodes, and \(ii\)*unlabeled–unlabeled coherence*viaBVVPVB\_\{VV\}P\_\{V\}, aligning predictions among unlabeled neighbors\. Overall, modularity propagates supervision through a degree\-aware global field: labeled nodes influence distant regions via the null model, and unlabeled nodes align through mutual coupling\. This mechanism complements local GCN aggregation by enforcing mesoscopic community consistency while mitigating hub\-driven shortcuts and degenerate single\-cluster solutions\.
#### 3\.3\.1Joint Optimization Objective
For decoupled\-TextGCN, we use the standard cross\-entropy loss on labeled nodes:
ℒCE=−Tr\(YU⊤logY^U\),\\mathcal\{L\}\_\{CE\}\\;=\\;\-\\mathrm\{Tr\}\\\!\\left\(Y\_\{U\}^\{\\top\}\\log\\hat\{Y\}\_\{U\}\\right\),\(5\)whereTr\(⋅\)\\mathrm\{Tr\}\(\\cdot\)is the trace,YUY\_\{U\}is the one\-hot label matrix, andY^U\\hat\{Y\}\_\{U\}denotes the predicted label distributions\. Predictions are computed as
Y^=softmax\(𝒜dw\(𝒜dw⊤W1\+𝒜wwW2\)W3\),\\hat\{Y\}=\\mathrm\{softmax\}\\\!\\left\(\\mathcal\{A\}\_\{dw\}\\left\(\\mathcal\{A\}\_\{dw\}^\{\\top\}W\_\{1\}\+\\mathcal\{A\}\_\{ww\}W\_\{2\}\\right\)W\_\{3\}\\right\),\(6\)with𝒜dw\\mathcal\{A\}\_\{dw\}and𝒜ww\\mathcal\{A\}\_\{ww\}the document–word and word–word adjacencies, andW1,W2W\_\{1\},W\_\{2\},W3W\_\{3\}trainable weights\. This preserves the original TextGCN propagation while improving scalability \(see Section[5\.3](https://arxiv.org/html/2606.23694#S5.SS3)\)\. The final objective combines classification and modularity:
ℒtotal=λℒCE−\(1−λ\)ℒmod,\\mathcal\{L\}\_\{total\}=\\lambda\\mathcal\{L\}\_\{CE\}\-\(1\-\\lambda\)\\mathcal\{L\}\_\{mod\},\(7\)whereλ∈\[0,1\]\\lambda\\in\[0,1\]controls the trade\-off between predictive accuracy and community\-structure preservation\.
## 4Experimental setup
### 4\.1Baselines
We evaluate our method on five benchmark datasets: MR, R8, R52, 20 Newsgroups \(20NG\), and Ohsumed as used in TextGCN\[[24](https://arxiv.org/html/2606.23694#bib.bib16)\]\. A detailed statistics of datasets are available on our website\[[11](https://arxiv.org/html/2606.23694#bib.bib48)\]\. We compare our framework against three broad categories of models\.GNN\-based models:TextGCN\[[24](https://arxiv.org/html/2606.23694#bib.bib16)\], TensorGCN\[[9](https://arxiv.org/html/2606.23694#bib.bib17)\], WCTextGCN\[[7](https://arxiv.org/html/2606.23694#bib.bib18)\];Classification on BERT embeddings:Logistic Regression\(LR\) and Linear Probing \(a single\-layer MLP\);LLM in zero and few\-shot settings:OpenAI\-GPT\-3, ChatGPT;Ours: ModTGCN with𝒢doc\\mathcal\{G\}\_\{doc\}using pre\-trained/fine\-tuned embeddings\.
### 4\.2Implementation and Evaluation Metrics
We follow TextGCN\[[24](https://arxiv.org/html/2606.23694#bib.bib16)\]configuration, including preprocessing, sliding window size \(2020\), dataset splits, and hidden layer size \(200200\)\. The model is trained with Adam optimizer for up to 300 epochs with early stopping \(patience 30\)\. For modularity optimization, hyperparameters including dropout,σ\\sigma, learning rateη\\eta, resolutionγ\\gamma, loss weightλ\\lambda, and edge\-weight parameters\(α,β,k\)\(\\alpha,\\beta,k\)with tuned withOptuna\. Performance is reported using micro\-F1, averaged over five random seeds for stability\. The code is available on GitHub*[https://github\.com/Rajarshi\-Misra/ModTGCN](https://github.com/Rajarshi-Misra/ModTGCN)*\.
## 5Results
### 5\.1Baseline Performance
Table[1](https://arxiv.org/html/2606.23694#S5.T1)shows that modularity\-based optimization substantially improves performance over TextGCN and related GNN baselines\. Using pre\-trained embeddings,ModTGCN\(P\)achieves clear gains over TextGCN across datasets, with particularly strong improvements on MR\(\+4\.7\+4\.7\), Ohsumed\(\+3\.6\+3\.6\) and 20NG\(\+4\.3\+4\.3\)\.
Table 1:Micro\-F1 \(mean%±\\pmstd\) of competing methods on MR, R8, R52, Ohsumed and 20NG\. Abbreviations: \(P\)/\(F\) — pre\-trained/fine\-tuned SBERT\-all\-mpnet\-baseembeddings used to build the doc–doc graph for the modularity term; LR — Logistic Regression\. Best results per dataset are shown inbold\. The homophily values corresponding to each dataset are reported in parentheses\.ModelMR\(0\.70\)R8\(0\.50\)R52\(0\.38\)Ohsumed\(0\.16\)20NG\(0\.19\)TextGCN76\.74±\\pm0\.2097\.07±\\pm0\.1393\.56±\\pm0\.1868\.36±\\pm0\.5686\.34±\\pm0\.09TensorGCN77\.91±\\pm0\.0798\.04±\\pm0\.0895\.05±\\pm0\.1170\.11±\\pm0\.2487\.74±\\pm0\.05WCTextGCN\[[7](https://arxiv.org/html/2606.23694#bib.bib18)\]77\.85±\\pm0\.3497\.49±\\pm0\.2093\.88±\\pm0\.3468\.52±\\pm0\.2087\.21±\\pm0\.54U\-TextGCN\[[5](https://arxiv.org/html/2606.23694#bib.bib9)\]76\.4196\.9293\.4568\.2486\.07BertGCN\[[8](https://arxiv.org/html/2606.23694#bib.bib20)\]\(Co\-trainedBERT with TextGCN\)86\.098\.196\.672\.889\.5Pre\-trained embeddings\(P\)LR\(P\)83\.7398\.1295\.6371\.3079\.06Linear Probing\(P\)83\.49±\\pm0\.0998\.25±\\pm0\.0795\.26±\\pm0\.0569\.87±\\pm0\.0979\.06±\\pm0\.14ModTGCN\(P\)81\.45±\\pm0\.0597\.55±\\pm0\.0694\.54±\\pm0\.1171\.97±\\pm0\.1590\.6 ± 0\.1Fine\-tuned embeddings\(F\)LR\(F\)87\.8798\.0496\.7374\.6587\.17Linear Probing\(F\)87\.37±\\pm0\.2098\.10±\\pm0\.1296\.71±\\pm0\.1174\.76±\\pm0\.1786\.90±\\pm0\.10ModTGCN\(F\)88\.07±\\pm0\.0698\.70±\\pm0\.0996\.16±\\pm0\.1677\.52±\\pm0\.2591\.14±\\pm0\.12ModTGCN\(F\)warmup88\.0498\.4096\.1476\.3489\.91
To examine the complementarity of modularity with domain\-adapted representations, we fine\-tunedSBERTon each dataset before constructing the document graph\. Although LR\(F\) and Linear Probing\(F\) benefit from fine\-tuning,ModTGCN\(F\)achieves the highest micro\-F1 on most datasets, with substantial improvements over its pre\-trained counterpart on MR \(\+6\.6\+6\.6\) and Ohsumed \(\+5\.5\+5\.5\)\. Moreover, despite its simpler architecture, ModTGCN matches TensorGCN’s performance on R8 \(97\.597\.5vs\.98\.098\.0\) and comes within0\.50\.5points on R52\. While overall improvements are consistent, the magnitude of gains varies across datasets\. We therefore analyze the structural conditions under which modularity is most beneficial\. Compared to embedding\-only baselines \(LR\(P\) and Linear Probing\(P\)\), ModTGCN achieves slightly lower performance on simpler, high\-homophily datasets \(MR−2\.05\-2\.05, R8−0\.70\-0\.70, R52−0\.72\-0\.72; homophily0\.700\.70,0\.500\.50,0\.380\.38\), where pretrained representations are already near linearly separable, leaving limited room for structural refinement\. In contrast, modularity yields substantial gains on structurally complex, low\-homophily datasets \(Ohsumed\+2\.10\+2\.10, 20NG\+11\.54\+11\.54; homophily0\.160\.16,0\.190\.19\)\. This trend indicates that modularity refinement is most beneficial in low\-homophily regimes with overlapping semantic boundaries, and less impactful on trivially separable datasets\. Overall, these results demonstrate that modularity\-based optimization remains effective and complementary under both pre\-trained and fine\-tuned embedding settings\.
### 5\.2Comparison with LLMs
Table[3](https://arxiv.org/html/2606.23694#S5.T3)compares ModTGCN against zero\-shot and few\-shot LLM baselines\. Reported GPT\-3 and ChatGPT scores are taken directly from their respective papers, and missing entries indicate datasets not evaluated in those works\. BothModTGCN\(P\)andModTGCN\(F\)outperform LLM variants\. The advantage is most evident on Ohsumed and 20NG, suggesting that explicit graph\-structured supervision can outperform prompting\-based inference when labeled graph structure is available\.
Table 2:Comparison of LLM baselines with ModTGCN\. GPT\-3 usesInstructGPT\-3 \(text\-davinci\-003\)\.ModelMRR8R52Ohsumed20NGLLM Zero\-shotGPT\-388\.6990\.1989\.06––ChatGPT–60\.1075\.2339\.9358\.70LLM Few\-shot \(k\)GPT\-3 \(k=16\)89\.1591\.5891\.56––ChatGPT \(k=2\)–72\.5481\.6847\.0558\.44ChatGPT \(k=5\)–82\.4390\.1345\.39–ModTGCN\(P\)81\.4597\.5594\.5471\.9790\.60ModTGCN\(F\)88\.0798\.7096\.1477\.5291\.14
Table 3:Training time \(seconds\) and epochs for decoupled vs\. original TextGCN\. Decoupling reduces per\-epoch cost, yielding2×–10×2\\times–10\\timesspeedup\.DatasetDecoupledOriginalTime \(s\)EpochsTime \(s\)EpochsMR10\.5011111\.5016R836\.9019187\.6097R5264\.55227257\.60163Ohsumed80\.62294241\.108120NG149\.602431450\.40111
### 5\.3Complexity and scalability analysis
The original TextGCN applies two\-hop propagation over a single heterogeneous adjacencyA=\[0,𝒜dw;𝒜wd,𝒜ww\]A=\[0,\\mathcal\{A\}\_\{dw\};\\mathcal\{A\}\_\{wd\},\\mathcal\{A\}\_\{ww\}\], evaluating both document and word branches at each hop\. This incurs sparse cost ofO\(\(4Edw\+2Eww\)H\)O\(\(4E\_\{dw\}\+2E\_\{ww\}\)H\)and unnecessary word\-node logit computation\. In contrast, our decoupled approach follows only the*document\-logit*flow \(Eq\. \([6](https://arxiv.org/html/2606.23694#S3.E6)\)\) reducing sparse operations toO\(\(2Edw\+Eww\)H\)O\(\(2E\_\{dw\}\+E\_\{ww\}\)H\)\. This effectively reduces≈50%\\approx 50\\%edge traversals per layer\. The decoupled formulation is lossless for document classification, as it is directly derived from the original TextGCN equations\. Since loss is computed only on document logits, removing the unused word branch preserves the decision function while eliminating redundant computation\. A detailed derivation is added in section[0\.A\.1](https://arxiv.org/html/2606.23694#Pt0.A1.SS1)of the Appendix\.
Empirically \(Table[3](https://arxiv.org/html/2606.23694#S5.T3)\), the decoupled variant achieves2×2\\timesto10×10\\timesfaster training than the original TextGCN despite using more epochs, indicating much lower per\-epoch cost\. The improvement is most pronounced on 20NG, where training time drops from1,4501,450seconds to150150seconds, confirming improved scalability\.
## 6Ablation studies
### 6\.1Effect of graph\-style, edge reweighting, and labeling strategy
Table[4](https://arxiv.org/html/2606.23694#S6.T4)evaluates three factors affecting modularity training: \(i\)𝒢doc\\mathcal\{G\}\_\{doc\}construction method; \(ii\) label\-aware edge reweighting; and \(iii\) supervision source for modularity\.
Graph\-style \(Gr\):We evaluate threeGdocG\_\{doc\}construction strategies: \(i\)TF–IDF inner product\(tf\):Sij=\(AdwAdw⊤\)ijS\_\{ij\}=\(A\_\{dw\}A\_\{dw\}^\{\\top\}\)\_\{ij\}, whereAdwA\_\{dw\}is the TF–IDF matrix\. This approach does not use language models and relies purely on sparse matrix multiplication over lexical features\. \(ii\)Cosine similarity\(c\):Sij=cos\(ei,ej\)S\_\{ij\}=\\cos\(e\_\{i\},e\_\{j\}\), whereeie\_\{i\}is the embedding of theithi^\{th\}sentence\. \(iii\)Gaussian similarity\(g\):Sij=exp\(−‖ei−ej‖22σ2\)S\_\{ij\}=\\exp\\\!\\left\(\-\\frac\{\\\|e\_\{i\}\-e\_\{j\}\\\|^\{2\}\}\{2\\sigma^\{2\}\}\\right\); see Section[3\.2](https://arxiv.org/html/2606.23694#S3.SS2)\. The latter two approaches construct the document graph using transformer\-based embeddings\.Observations:Across all datasets, Gaussian \(RBF\) graphs consistently outperform cosine\- and TF–IDF\-based similarities, indicating localized kernel affinities better capture semantic neighborhoods\. Moreover, among the three encoders in Table[5](https://arxiv.org/html/2606.23694#S6.T5),SBERT\-all\-mpnet\-baseachieves the best overall accuracy and is the most consistent across datasets\.
Edge reweighting \(W\):To inject label supervision into the modularity, we amplify same\-class edges and attenuate cross\-class edges by factorsα\>1\\alpha\>1andβ<1\\beta<1over each training document’s top\-kkneighbors\. Edges connecting validation/test nodes keep their original weights to prevent label leakage\. This retains semantic similarity while promoting label\-consistent communities\.Observation:Enabling label\-aware edge reweighting improves performance by strengthening intra\-class similarities, suggesting that modest supervision injected at the similarity level enhances modular community formation\.
Labeling strategy \(L\):We evaluate two supervision choices for the modularity loss: \(i\) gold labels on training nodes with predicted labels on validation/test nodes \(gt\), and \(ii\) predicted labels for*all*nodes \(p\)\.Observations:Using predictions \(p\) in place of ground\-truth labels \(𝐠𝐭\\mathbf\{gt\}\) often improves performance, particularly under class imbalance\. On Ohsumed, for example, Gaussian \+ Edge\-reweighting \+ Pred achieves72\.0/65\.072\.0/65\.0vs\.71\.5/63\.371\.5/63\.3with gold training labels\. This indicates that prediction\-based supervision can regularize modularity optimization and reduce overfitting to a limited set of labeled nodes\.
Overall, ModTGCN under \{Gaussian, reweighting enabled, predicted labels\}\(\{g,T,p\}\) configuration gives the strongest results\. We use the same configuration for all other further experiments\. Table 4 also shows that using predicted labels for all nodes performs slightly better than gold\-only supervision, indicating robustness to pseudo\-label errors\. Moreover, modularity increases smoothly over epochs \(Figure[2\(a\)](https://arxiv.org/html/2606.23694#S6.F2.sf1)\), suggesting stable community refinement rather than error amplification\. A warm\-up strategy \(ModTGCN\(F\) warmup\), delaying modularity until few epochs in Table[1](https://arxiv.org/html/2606.23694#S5.T1), yields only marginal differences\(−0\.03\-0\.03on MR,−0\.30\-0\.30on R8,−0\.02\-0\.02on R52,−1\.18\-1\.18on Ohsumed,−1\.23\-1\.23on 20NG\), confirming that early prediction noise does not significantly affect convergence\.
Table 4:Micro/Macro F1 performance \(mean ± std\) under different configurations\. Graph style: Gr∈\\in\(c,tf,g\);\(c=Cosine, tf=TF–IDF, g=Gaussian\), Edge reweighing: W∈\\in\(T, F\); T \- if weighted adjustment is applied, else F, and Labeling strategy: L∈\\in\(gt,p\); gt & p denote ground\-truth and predicted labels for train\-set respectively\.GrWLMRR8R52Ohs20NGcTgt81\.2 ± 0\.1 / 81\.2 ± 0\.197\.4 ± 0\.1 / 92\.2 ± 0\.494\.1 ± 0\.2 / 67\.0 ± 1\.571\.5 ± 0\.3 / 63\.3 ± 0\.591\.1 ± 0\.1 / 90\.5± 0\.1cTp81\.4 ± 0\.2 / 81\.4 ± 0\.297\.4 ± 0\.2 / 92\.2 ± 0\.394\.3 ± 0\.2 / 69\.1 ± 2\.471\.7 ± 0\.2 / 65\.0 ± 0\.490\.8 ± 0\.2 / 90\.1 ± 0\.2cFgt81\.6 ± 0\.1 / 81\.6 ± 0\.197\.1 ± 0\.1 / 92\.0 ± 0\.593\.7 ± 0\.2 / 69\.0 ± 2\.068\.9 ± 0\.3 / 61\.9 ± 0\.788\.2 ± 0\.1 / 87\.3 ± 0\.2cFp80\.8 ± 0\.1 / 80\.8 ± 0\.196\.8 ± 0\.1 / 90\.5 ± 2\.393\.9 ± 0\.1 / 67\.6 ± 1\.268\.2 ± 0\.9 / 60\.8 ± 2\.188\.1 ± 0\.2 / 87\.3 ± 0\.2tfTgt76\.4 ± 0\.3 / 76\.4 ± 0\.395\.8 ± 0\.3 / 86\.6 ± 1\.992\.9 ± 0\.1 / 66\.6 ± 2\.468\.3 ± 0\.3 / 62\.6 ± 0\.286\.9± 0\.1/ 86\.2 ± 0\.1tfTp76\.6 ± 0\.2 / 76\.6 ± 0\.296\.3 ± 0\.2 / 88\.5 ± 2\.292\.8 ± 0\.1 / 66\.9 ± 2\.268\.2 ± 0\.3 / 62\.5 ± 0\.486\.7 ± 0\.2 / 86\.2 ± 0\.2tfFgt77\.0 ± 0\.1 / 77\.0 ± 0\.197\.0 ± 0\.2 / 91\.7 ± 1\.793\.5 ± 0\.3 / 66\.8 ± 3\.968\.1 ± 0\.1 / 62\.4 ± 0\.386\.5 ± 0\.2 / 86\.0 ± 0\.3tfFp76\.8 ± 0\.1 / 76\.8 ± 0\.197\.0 ± 0\.2 / 90\.8 ± 1\.593\.9 ± 0\.1 / 69\.4 ± 1\.867\.9 ± 0\.6 / 61\.3 ± 1\.086\.3 ± 0\.3 / 86\.0 ± 0\.2gTgt81\.6 ± 0\.0 / 81\.6 ± 0\.097\.6 ± 0\.3 / 92\.7 ± 1\.294\.5 ± 0\.2 / 67\.6 ± 1\.971\.5 ± 0\.3 / 63\.3 ± 0\.790\.7 ± 1\.0 / 90\.0 ± 0\.9gTp81\.4 ± 0\.1 / 81\.4 ± 0\.197\.5 ± 0\.1 / 92\.3 ± 0\.494\.5 ± 0\.1 / 69\.9 ± 3\.072\.0 ± 0\.1 / 65\.0 ± 0\.490\.6 ± 0\.1 / 89\.9 ± 0\.1gFgt82\.4 ± 0\.1 / 82\.4 ± 0\.197\.0 ± 0\.2 / 90\.2 ± 1\.893\.9 ± 0\.2 / 67\.7 ± 2\.070\.7 ± 0\.2 / 63\.9 ± 0\.590\.0 ± 0\.1 / 89\.1 ± 0\.1gFp81\.1 ± 0\.1 / 81\.1 ± 0\.196\.7 ± 0\.3 / 90\.0 ± 1\.693\.8 ± 0\.2 / 67\.2 ± 1\.669\.2 ± 0\.3 / 62\.9 ± 0\.390\.0 ± 0\.1 / 89\.2 ± 0\.1
Table 5:Performance of ModTGCN under \{g,T,p\} configuration across three pre\-trained transformer embeddings\.DatasetMRR8R52Ohs20NGSBERT\-all\-mpnet\-base81\.45±\\pm0\.0597\.55±\\pm0\.0694\.54±\\pm0\.1171\.97±\\pm0\.1590\.60±\\pm0\.10BERT\-base\-uncased76\.89±\\pm0\.4196\.76±\\pm0\.4693\.72±\\pm0\.1867\.72±\\pm0\.1286\.85±\\pm0\.12RoBERTa\-large77\.12±\\pm0\.5296\.99±\\pm0\.2193\.90±\\pm0\.1568\.10±\\pm0\.2186\.93±\\pm0\.09
### 6\.2Hyperparameter sensitivity analysis
Table[6](https://arxiv.org/html/2606.23694#S6.T6)analyzes sensitivity to key parameters modularity resolutionγ\\gamma, loss weightλ\\lambda, reweighting factorsα,β\\alpha,\\beta, and neighborhood sizekkon ModTGCN\. The loss weightλ\\lambdaperforms best at intermediate range \{0\.25−0\.50\.25\-0\.5\}, indicating that classification and modularity objectives should be balanced\. Similarly, moderate value \{∼3\.0\\sim 3\.0\} for resolution parameterγ\\gammayield optimal values, indicating that neither overly coarse nor overly fine communities are optimal\. Edge reweighting factorsα,β\\alpha,\\betaand neighborhood sizekkalso show clear optima: smallerα,β\\alpha,\\beta\{α∼1\.2,β∼0\.4\\alpha\\sim 1\.2,\\beta\\sim 0\.4\} with a moderatekk\{∼10\\sim 10\} produce the strongest results\. These results indicate that the benefits of modularity optimization are consistent and robust across reasonable parameter choices\.
Table 6:Ablation studies for key hyperparametersγ,λ,α,β\\gamma,\\lambda,\\alpha,\\beta, andkk\. Micro\-F1 is reported in \(%\)\.γ\\gammaMRR8OhsumedR5220NG0\.175\.797\.172\.694\.487\.80\.580\.097\.469\.594\.487\.81\.080\.697\.570\.794\.588\.63\.081\.697\.371\.994\.489\.95\.081\.497\.371\.994\.090\.7
λ\\lambdaMRR8OhsumedR5220NG0\.068\.45\.516\.50\.74\.60\.2581\.393\.767\.694\.590\.60\.581\.396\.871\.394\.190\.40\.7579\.497\.472\.893\.889\.91\.076\.596\.866\.593\.486\.0
kkMRR8OhsumedR5220NG479\.5297\.2672\.5794\.6390\.32679\.3897\.3072\.4094\.4790\.28881\.0497\.3072\.5294\.5590\.451081\.0497\.4472\.4294\.6190\.041280\.5397\.4472\.3794\.5989\.95
β\\betaMRR8OhsumedR5220NG0\.380\.597\.2172\.5994\.790\.720\.481\.2997\.2172\.5794\.790\.720\.581\.2397\.2172\.5794\.790\.630\.681\.2097\.2172\.5794\.789\.870\.781\.2697\.1772\.5794\.789\.87
α\\alphaMRR8OhsumedR5220NG1\.080\.8497\.3072\.5794\.3190\.591\.281\.0697\.3972\.5494\.5190\.171\.481\.0697\.3972\.5794\.7090\.041\.681\.2697\.2172\.5994\.7489\.861\.880\.8497\.1772\.5794\.7089\.88
### 6\.3Optimization dynamics of CE and Modularity
Figure[2\(a\)](https://arxiv.org/html/2606.23694#S6.F2.sf1)illustrates that cross\-entropy decreases rapidly in early epochs, establishing initial decision boundaries\. As predictions stabilize, modularity increases steadily, reinforcing community\-level alignment\. The smooth convergence of total loss suggests complementary optimization rather than objective conflict\. Thus, cross\-entropy shapes local classification boundaries, while modularity refines global embedding geometry\. This behavior is further reflected in silhouette scores \(Figure[2\(b\)](https://arxiv.org/html/2606.23694#S6.F2.sf2)\), which rise and stabilize over epochs, indicating improved intra\-class cohesion and inter\-class separation\.*Additional results, plots, dataset details, and detailed derivation of decoupling approach are available on the website\[[11](https://arxiv.org/html/2606.23694#bib.bib48)\]*
\(a\)
\(b\)
Figure 2:\(a\) Cross\-entropy, modularity, and total loss across epochs on 20NG\. \(b\) Silhouette score across epochs, showing improved cluster separation during training\.
## 7Conclusion
We introduced a modularity\-aware GNN framework that complements the local neighborhood aggregation with a global, community\-aware objective for text classification\. By integrating modularity into training, the model encourages class\-coherent document communities and mitigates degree\-related biases that can lead to over\-smoothing or trivial partitions\. Experiments across five benchmark datasets demonstrate consistent improvements over baselines\. The best results are obtained using Gaussian similarity with prediction\-based supervision, underscoring the value of global similarity modeling and hybrid label supervision\. Overall, modularity optimization provides a lightweight yet effective complement to GCN\-based text classification\. Beyond classification, this framework suggests broader opportunities for incorporating community\-aware objectives into graph learning tasks where mesoscopic structure plays a central role, including topic modeling, document clustering, and knowledge graph analysis\.
## References
- \[1\]M\. Barthélemy and S\. Fortunato\(2007\)Resolution limit in community detection\.Bulletin of the American Physical Society\.Cited by:[§2](https://arxiv.org/html/2606.23694#S2.p2.2)\.
- \[2\]M\. Bugueño and G\. de Melo\(2023\)Connecting the dots: what graph\-based text representations work best for text classification using gnns?\.InEMNLP,Cited by:[§1](https://arxiv.org/html/2606.23694#S1.p1.1)\.
- \[3\]M\. Chen, T\. Nguyen, and B\. K\. Szymanski\(2015\)A new metric for quality of network community structure\.CoRR\.External Links:1507\.04308Cited by:[§2](https://arxiv.org/html/2606.23694#S2.p2.2)\.
- \[4\]J\. Guo, P\. Singh, and K\. Bassler\(2020\)Resolution limit revisited: community detection using generalized modularity density\.Journal of Physics: Complexity\.Cited by:[§2](https://arxiv.org/html/2606.23694#S2.p2.2)\.
- \[5\]S\. C\. Han, Z\. Yuan, K\. Wang, S\. Long, and J\. Poon\(2022\)Understanding graph convolutional networks for text classification\.External Links:2203\.16060Cited by:[Table 1](https://arxiv.org/html/2606.23694#S5.T1.36.34.36.1.1.1)\.
- \[6\]T\. N\. Kipf and M\. Welling\(2016\)Semi\-supervised classification with graph convolutional networks\.CoRR\.External Links:1609\.02907Cited by:[§3\.3](https://arxiv.org/html/2606.23694#S3.SS3.p1.2)\.
- \[7\]W\. Li and N\. Aletras\(2022\)Improving graph\-based text representations with character and word level n\-grams\.InAssociation for Computational Linguistics,Cited by:[§2](https://arxiv.org/html/2606.23694#S2.p1.1),[§4\.1](https://arxiv.org/html/2606.23694#S4.SS1.p1.1),[Table 1](https://arxiv.org/html/2606.23694#S5.T1.17.15.15.6.1.1)\.
- \[8\]Y\. Lin, Y\. Meng, X\. Sun, and et\.al\.\(2021\)BertGCN: transductive text classification by combining GCN and BERT\.CoRR\.External Links:2105\.05727Cited by:[§1](https://arxiv.org/html/2606.23694#S1.p1.1),[§2](https://arxiv.org/html/2606.23694#S2.p1.1),[Table 1](https://arxiv.org/html/2606.23694#S5.T1.36.34.37.1.1.1.1.1.1)\.
- \[9\]X\. Liu, X\. You, X\. Zhang, J\. Wu, and P\. Lv\(2020\)Tensor graph convolutional networks for text classification\.External Links:2001\.05313Cited by:[§1](https://arxiv.org/html/2606.23694#S1.p1.1),[§2](https://arxiv.org/html/2606.23694#S2.p1.1),[§4\.1](https://arxiv.org/html/2606.23694#S4.SS1.p1.1)\.
- \[10\]Z\. Lu, P\. Du, and J\. Nie\(2020\)VGCN\-BERT: augmenting BERT with graph embedding for text classification\.CoRR\.External Links:2004\.05707Cited by:[§1](https://arxiv.org/html/2606.23694#S1.p1.1),[§2](https://arxiv.org/html/2606.23694#S2.p1.1)\.
- \[11\]ModTGCN website\.External Links:[Link](https://sites.google.com/pilani.bits-pilani.ac.in/modtgcn)Cited by:[§4\.1](https://arxiv.org/html/2606.23694#S4.SS1.p1.1),[§6\.3](https://arxiv.org/html/2606.23694#S6.SS3.p1.1.1)\.
- \[12\]T\. Murata and N\. Afzal\(2018\)Modularity optimization as a training criterion for graph neural networks\.Cited by:[§1](https://arxiv.org/html/2606.23694#S1.p2.1),[§2](https://arxiv.org/html/2606.23694#S2.p2.2)\.
- \[13\]M\. E\. J\. Newman and M\. Girvan\(2004\)Finding and evaluating community structure in networks\.Phys\. Rev\. E\.Cited by:[§2](https://arxiv.org/html/2606.23694#S2.p2.2)\.
- \[14\]M\. E\. J\. Newman\(2006\)Modularity and community structure in networks\.Proceedings of the National Academy of Sciences\.External Links:https://www\.pnas\.org/doi/pdf/10\.1073/pnas\.0601602103Cited by:[§1](https://arxiv.org/html/2606.23694#S1.p2.1)\.
- \[15\]M\. E\. J\. Newman\(2016\)Community detection in networks: modularity optimization and maximum likelihood are equivalent\.CoRR\.External Links:1606\.02319Cited by:[§1](https://arxiv.org/html/2606.23694#S1.p2.1)\.
- \[16\]D\. Pisarevskaya and A\. Zubiaga\(2025\)Zero\-shot and few\-shot learning with instruction\-following llms for claim matching in automated fact\-checking\.InCOLING,Cited by:[§1](https://arxiv.org/html/2606.23694#S1.p1.1)\.
- \[17\]C\. Qiu, Z\. Huang, W\. Xu, and H\. Li\(2022\)VGAER: graph neural network reconstruction based community detection\.CoRR\.External Links:2201\.04066Cited by:[§2](https://arxiv.org/html/2606.23694#S2.p2.2)\.
- \[18\]R\. Ragesh, S\. Sellamanickam, and et\.al\. Iyer\(2021\)HeteGCN: heterogeneous graph convolutional networks for text classification\.WSDM\.Cited by:[§2](https://arxiv.org/html/2606.23694#S2.p1.1)\.
- \[19\]J\. Reichardt and S\. Bornholdt\(2006\)Statistical mechanics of community detection\.Phys\. Rev\. E\.Cited by:[§2](https://arxiv.org/html/2606.23694#S2.p2.2)\.
- \[20\]N\. Reimers and I\. Gurevych\(2019\)Sentence\-bert: sentence embeddings using siamese bert\-networks\.CoRR\.External Links:1908\.10084Cited by:[§1](https://arxiv.org/html/2606.23694#S1.p3.1)\.
- \[21\]G\. Salha\-Galvan, J\. F\. Lutzeyer, and et\.al\.\(2022\)Modularity\-aware graph autoencoders for joint community detection and link prediction\.CoRR\.External Links:2202\.00961Cited by:[§2](https://arxiv.org/html/2606.23694#S2.p2.2)\.
- \[22\]A\. Tsitsulin, J\. Palowitch, B\. Perozzi, and E\. Müller\(2020\)Graph clustering with graph neural networks\.CoRR\.External Links:2006\.16904Cited by:[§1](https://arxiv.org/html/2606.23694#S1.p2.1)\.
- \[23\]S\. Vajjala and S\. Shimangaud\(2025\)Text classification in the llm era – where do we stand?\.External Links:2502\.11830Cited by:[§1](https://arxiv.org/html/2606.23694#S1.p1.1)\.
- \[24\]L\. Yao, C\. Mao, and Y\. Luo\(2018\)Graph convolutional networks for text classification\.CoRR\.External Links:1809\.05679Cited by:[§1](https://arxiv.org/html/2606.23694#S1.p1.1),[§1](https://arxiv.org/html/2606.23694#S1.p3.1),[§2](https://arxiv.org/html/2606.23694#S2.p1.1),[§4\.1](https://arxiv.org/html/2606.23694#S4.SS1.p1.1),[§4\.2](https://arxiv.org/html/2606.23694#S4.SS2.p1.7)\.
- \[25\]D\. Zhou, O\. Bousquet, T\. N\. Lal, and et\.al\.\(2003\)Learning with local and global consistency\.NIPS’03\.Cited by:[§3\.2](https://arxiv.org/html/2606.23694#S3.SS2.p1.7)\.
## Appendix 0\.AAppendix
### 0\.A\.1Decoupled Formulation of TextGCN
The original TextGCN operates on a heterogeneous graph with adjacency:
A=\[0AdwAwdAww\],A=\\begin\{bmatrix\}0&A\_\{dw\}\\\\ A\_\{wd\}&A\_\{ww\}\\end\{bmatrix\},\(8\)whereAdw∈ℝnd×nwA\_\{dw\}\\in\\mathbb\{R\}^\{n\_\{d\}\\times n\_\{w\}\}denotes document–word edges,Awd=Adw⊤A\_\{wd\}=A\_\{dw\}^\{\\top\}, andAww∈ℝnw×nwA\_\{ww\}\\in\\mathbb\{R\}^\{n\_\{w\}\\times n\_\{w\}\}encodes word–word co\-occurrence\.
Let the node representations at layerllbe:
H\(l\)=\[Hd\(l\)Hw\(l\)\],H^\{\(l\)\}=\\begin\{bmatrix\}H\_\{d\}^\{\(l\)\}\\\\ H\_\{w\}^\{\(l\)\}\\end\{bmatrix\},\(9\)whereHd\(l\)H\_\{d\}^\{\(l\)\}andHw\(l\)H\_\{w\}^\{\(l\)\}denote document and word embeddings, respectively\.
A single GCN layer performs:
H\(l\+1\)=AH\(l\)W\.H^\{\(l\+1\)\}=AH^\{\(l\)\}W\.\(10\)
Expanding block\-wise:
Hd\(l\+1\)\\displaystyle H\_\{d\}^\{\(l\+1\)\}=AdwHw\(l\)Wd,\\displaystyle=A\_\{dw\}H\_\{w\}^\{\(l\)\}W\_\{d\},\(11\)Hw\(l\+1\)\\displaystyle H\_\{w\}^\{\(l\+1\)\}=AwdHd\(l\)Wwd\+AwwHw\(l\)Www\.\\displaystyle=A\_\{wd\}H\_\{d\}^\{\(l\)\}W\_\{wd\}\+A\_\{ww\}H\_\{w\}^\{\(l\)\}W\_\{ww\}\.\(12\)
##### Eliminating word\-node states\.
Since supervision is applied only on document nodes, the word embeddingsHw\(l\)H\_\{w\}^\{\(l\)\}act as intermediate variables and can be eliminated via substitution\. Ignoring higher\-order recursion, we approximate:
Hw\(l\)≈AwdHd\(l\)W1\+AwwW2,H\_\{w\}^\{\(l\)\}\\approx A\_\{wd\}H\_\{d\}^\{\(l\)\}W\_\{1\}\+A\_\{ww\}W\_\{2\},\(13\)where the first term captures document\-induced signals and the second term encodes intrinsic word co\-occurrence structure\.
Substituting into the document update:
Hd\(l\+1\)\\displaystyle H\_\{d\}^\{\(l\+1\)\}=Adw\(AwdHd\(l\)W1\+AwwW2\)W3\.\\displaystyle=A\_\{dw\}\\left\(A\_\{wd\}H\_\{d\}^\{\(l\)\}W\_\{1\}\+A\_\{ww\}W\_\{2\}\\right\)W\_\{3\}\.\(14\)
UsingAwd=Adw⊤A\_\{wd\}=A\_\{dw\}^\{\\top\}, we obtain:
Hd\(l\+1\)=Adw\(Adw⊤Hd\(l\)W1\+AwwW2\)W3\.H\_\{d\}^\{\(l\+1\)\}=A\_\{dw\}\\left\(A\_\{dw\}^\{\\top\}H\_\{d\}^\{\(l\)\}W\_\{1\}\+A\_\{ww\}W\_\{2\}\\right\)W\_\{3\}\.\(15\)
Finally, the prediction is computed as:
Y^=softmax\(Adw\(Adw⊤W1\+AwwW2\)W3\)\.\\hat\{Y\}=\\mathrm\{softmax\}\\left\(A\_\{dw\}\\left\(A\_\{dw\}^\{\\top\}W\_\{1\}\+A\_\{ww\}W\_\{2\}\\right\)W\_\{3\}\\right\)\.\(16\)
##### Discussion\.
This formulation removes explicit word\-node embeddings while preserving the original propagation mechanism\. The word nodes are implicitly represented through the composed operatorsAdw⊤A\_\{dw\}^\{\\top\}andAwwA\_\{ww\}, yielding an equivalent document\-level transformation without maintaining intermediate word states\. As a result, redundant computations are eliminated while retaining the same decision function for document classification\.
\(a\)MR
\(b\)Ohsumed
\(c\)R52
\(d\)R8
Figure 3:Training dynamics of ModTGCN on four benchmark datasets \(MR, Ohsumed, R52, and R8\)\. Each plot shows the evolution of modularity loss, cross\-entropy loss, and total loss across epochs\. The consistent decrease in cross\-entropy alongside modularity optimization demonstrates stable convergence and highlights the complementary role of global community structure in improving classification performance\.
### 0\.A\.2Robustness of ModTGCN to early pseudo\-label noise while training
Because modularity incorporates predicted labels for unlabeled nodes, we examine sensitivity to early\-stage noise using a warm\-up strategy that delays the application of the modularity objective for a few initial epochs\. As shown in Table[7](https://arxiv.org/html/2606.23694#Pt0.A1.T7), the impact of warm\-up is consistently negligible across datasets\. On MR, R8, and R52, the performance differences remain within0\.30\.3points, indicating that early prediction noise does not adversely affect optimization in relatively high\-homophily settings\. For more structurally complex datasets such as Ohsumed and 20NG, we observe slightly larger gaps \(1\.181\.18and1\.231\.23points, respectively\), yet training without warm\-up still achieves comparable or better performance\.
Overall, these results suggest that the modularity objective is robust to noisy early predictions and can be applied from the beginning of training without requiring a dedicated warm\-up phase\.
Table 7:Effect of modularity warm\-up to assess sensitivity to early\-stage noise from predicted labels\.Δ\\Deltadenotes the difference between models trained with and without warm\-up\.DatasetWith Warm\-upWithout Warm\-up𝚫\\bm\{\\Delta\}MR88\.0488\.07\-0\.03R898\.4098\.70\-0\.30R5296\.1496\.16\-0\.02Ohsumed76\.3477\.52\-1\.1820NG89\.9191\.14\-1\.23Table 8:Comparison of Linear Probing and ModTGCN under pre\-trained embeddings \(P\) across datasets with varying homophily\.Δ\\DeltaMicro\-F1 denotes the performance difference \(ModTGCN \- Linear Probing\)\.DatasetHomophilyLinear Probing \(P\)ModTGCN \(P\)𝚫\\bm\{\\Delta\}MR0\.7083\.4981\.45\-2\.04R80\.5098\.2597\.55\-0\.70R520\.3895\.2694\.54\-0\.72Ohsumed0\.1669\.8771\.97\+2\.1020NG0\.1979\.0690\.60\+11\.54
### 0\.A\.3ModTGCN: Relationship with homophily
Compared to embedding\-only baselines \(Linear Probing \(P\)\), ModTGCN achieves slightly lower performance on simpler, high\-homophily datasets \(MR, R8, R52\), where pretrained representations are already nearly linearly separable, leaving limited room for structural refinement\. In contrast, modularity yields substantial gains on structurally complex, low\-homophily datasets \(Ohsumed\+2\.10\+2\.10, 20NG\+11\.54\+11\.54\)\. This trend indicates that modularity refinement is most beneficial in low\-homophily regimes with overlapping semantic boundaries, and less impactful on trivially separable datasets\.
ModTGCN shows larger improvements on low\-homophily datasets because it incorporates global, degree\-corrected structural information through the modularity objective, whereas standard GNNs rely primarily on local neighborhood aggregation\. In low\-homophily settings, neighboring nodes often belong to different classes, causing local message passing to propagate noisy or conflicting signals and leading to over\-smoothing\. In contrast, modularity introduces a global coupling mechanism that encourages nodes to align with communities exhibiting higher\-than\-expected connectivity under a degree\-preserving null model\. This allows the model to capture long\-range, statistically significant structures beyond immediate neighborhoods, while mitigating the influence of hub nodes and noisy edges\. As a result, ModTGCN can recover class\-consistent communities even when the local graph structure is weak, leading to substantially improved performance in low\-homophily regimes\.Similar Articles
TERGAD: Structure-Aware Text-Enhanced Representations for Graph Anomaly Detection
TERGAD is a novel data augmentation framework that uses large language models to translate node-level topological properties into semantic narratives, then fuses these with original node attributes via a gated dual-branch autoencoder for graph anomaly detection, achieving state-of-the-art results on six datasets.
Context-aware Modality-Topology Co-Alignment for Multimodal Attributed Graphs
Proposes CoMAG, a unified backbone for multimodal attributed graphs that learns task-adaptive reliable contexts and performs modality-preserving alignment, achieving state-of-the-art results on graph-level prediction, modality matching, and graph-conditioned generation.
Robustness of Graph Self-Supervised Learning to Real-World Noise: A Case Study on Text-Driven Biomedical Graphs
This paper introduces NATD-GSSL, a framework evaluating the robustness of Graph Self-Supervised Learning on noisy, text-driven biomedical graphs. It demonstrates that certain GNN architectures and pretext tasks maintain performance despite real-world noise, offering practical guidance for unsupervised learning in imperfect datasets.
G^2C-MT: Graph-Guided Context Selection for Document-Level Machine Translation
Proposes G²C-MT, a graph-guided context selection framework for document-level machine translation that models structured discourse dependencies via a lightweight discourse graph and depth-biased random walk, outperforming baselines on multiple LLMs.
Beyond the Golden Teacher: Enhancing Graph Learning through LLM-GNN Co-teaching
This paper proposes LLM-GNN Co-Teaching, a bidirectional framework for few-shot graph learning on text-attributed graphs. The LLM and GNN exchange confident pseudo-labels and use round-based preference optimization (RPL-PO) to mutually improve, outperforming prior methods on benchmarks.