GiLT: Augmenting Transformer Language Models with Dependency Graphs

arXiv cs.CL Papers

Summary

The paper proposes GiLT (Graph-Infused Layers Transformer Language Model), which improves syntactic generalization by modulating attention weights using features from dependency graphs constructed incrementally during token prediction, outperforming baselines while maintaining competitive perplexity.

arXiv:2605.15562v1 Announce Type: new Abstract: Augmenting Transformers with linguistic structures effectively enhances the syntactic generalization performance of language models. Previous work in this direction focuses on syntactic tree structures of languages, in particular constituency tree structures. We propose Graph-Infused Layers Transformer Language Model (GiLT) which leverages dependency graphs for augmenting Transformer language models. Unlike most previous work, GiLT does not insert extra structural tokens in language modeling; instead, it injects structural information into language modeling by modulating attention weights in the Transformer with features extracted from the dependency graph that is incrementally constructed along with token prediction. In our experiments, GiLT with semantic dependency graphs achieves better syntactic generalization while maintaining competitive perplexity in comparison with Transformer language model baselines. In addition, GiLT can be finetuned from a pretrained language model to achieve improved downstream task performance. Our code is released at https://github.com/cookie-pie-oops/GiLT-LM.
Original Article
View Cached Full Text

Cached at: 05/18/26, 06:32 AM

# GiLT: Augmenting Transformer Language Models with Dependency Graphs
Source: [https://arxiv.org/html/2605.15562](https://arxiv.org/html/2605.15562)
Tianyu Huang, Yida Zhao, Chuyan Zhou, Kewei Tu School of Information Science and Technology, ShanghaiTech University Shanghai Engineering Research Center of Intelligent Vision and Imaging \{huangty2024,zhaoyd2023,zhouchy2022,tukw\}@shanghaitech\.edu\.cn

###### Abstract

Augmenting Transformers with linguistic structures effectively enhances the syntactic generalization performance of language models\. Previous work in this direction focuses on syntactic tree structures of languages, in particular constituency tree structures\. We propose*Graph\-Infused Layers Transformer Language Model*\(GiLT\) which leverages dependency graphs for augmenting Transformer language models\. Unlike most previous work, GiLT does not insert extra structural tokens in language modeling; instead, it injects structural information into language modeling by modulating attention weights in the Transformer with features extracted from the dependency graph that is incrementally constructed along with token prediction\. In our experiments, GiLT with semantic dependency graphs achieves better syntactic generalization while maintaining competitive perplexity in comparison with Transformer language model baselines\. In addition, GiLT can be finetuned from a pretrained language model to achieve improved downstream task performance\. Our code is released at[https://github\.com/cookie\-pie\-oops/GiLT\-LM](https://github.com/cookie-pie-oops/GiLT-LM)\.

GiLT: Augmenting Transformer Language Models with Dependency Graphs

Tianyu Huang, Yida Zhao, Chuyan Zhou, Kewei Tu††thanks:Corresponding AuthorSchool of Information Science and Technology, ShanghaiTech UniversityShanghai Engineering Research Center of Intelligent Vision and Imaging\{huangty2024,zhaoyd2023,zhouchy2022,tukw\}@shanghaitech\.edu\.cn

## 1Introduction

Transformer language models \(LMs\) have shown excellent performance in language modeling and downstream tasks\(Vaswaniet al\.,[2017](https://arxiv.org/html/2605.15562#bib.bib3)\)\. Notably, linguistic structures such as syntactic and semantic parses that have been deemed essential in traditional natural language processing are absent from the model design and training process of Transformer LMs\.

Over the past decade, a number of researchers have been trying to integrate linguistic structures into neural language models\. Among them are syntactic LMs which jointly model syntactic structures and surface words\(Choe and Charniak,[2016](https://arxiv.org/html/2605.15562#bib.bib7)\)\. These include earlier work such as RNNG, which combines constituency parsing with recurrent neural networks\(Dyeret al\.,[2016](https://arxiv.org/html/2605.15562#bib.bib5); Kimet al\.,[2019](https://arxiv.org/html/2605.15562#bib.bib6); Noji and Oseki,[2021](https://arxiv.org/html/2605.15562#bib.bib8)\), and recent studies that incorporate constituency and dependency syntax into Transformers\(Yoshida and Oseki,[2022](https://arxiv.org/html/2605.15562#bib.bib16); Qianet al\.,[2021](https://arxiv.org/html/2605.15562#bib.bib13); Sartranet al\.,[2022](https://arxiv.org/html/2605.15562#bib.bib15); Murtyet al\.,[2023](https://arxiv.org/html/2605.15562#bib.bib14); Zhaoet al\.,[2024](https://arxiv.org/html/2605.15562#bib.bib10)\)\. Empirically, they achieve stronger syntactic generalization compared with standard Transformer LMs while retaining competitive language modeling performance\.

However, existing researches in this direction have two major limitations\. First, most of them are based on constituency syntactic tree structures\. Dependency tree structures, another important form of syntax, receive much less attention\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.15562#bib.bib10)\)\. In addition, little work has been done to jointly model linguistic structures other than syntactic trees in LMs\. Second, most of the existing methods require inserting additional tree\-building operations into the input and output sequence, leading to longer sequence lengths and higher computational cost, and making it harder to finetune a pretrained LM into a syntactic LM\. An exception is Pushdown Layers\(Murtyet al\.,[2023](https://arxiv.org/html/2605.15562#bib.bib14)\), which leverages syntactic trees to guide attention computation without changing the LM’s symbol space\.

In this paper, we propose the Graph\-Infused Layer Transformer LM \(GiLT\) that addresses the above\-mentioned limitations in integrating linguistic structures into Transformer LMs\. GiLT is based on*dependency graphs*that subsume both syntactic dependency trees and*semantic*dependency graphs, thus extending syntactic LM research beyond syntax\. Inspired by Pushdown Layers\(Murtyet al\.,[2023](https://arxiv.org/html/2605.15562#bib.bib14)\), GiLT incrementally constructs dependency graphs without changing the symbol space of the underlying LMs, and modulates attention scores with features computed from graph attributes such as node degrees, depths and distances\.

Experimental results show that GiLT achieves gains in syntactic generalization over baselines with almost no degradation in perplexity on language modeling\. Furthermore, GiLT finetuned from a pretrained GPT2 achieves better performance on downstream tasks compared with the original pretrained GPT2, suggesting that the Graph\-Infused layer is a competitive alternative to standard self\-attention\.

In summary, our contributions are as follows:

- •We propose*Graph\-Infused Layers*, leveraging dependency graphs to enhance LMs by our novel*graph\-based feature tapes*without modifying the input or output space\.
- •Comprehensive experiments on language modeling, syntactic evaluation and finetuning on text classification demonstrate competitive perplexity, improved syntactic generalization and language understanding\. Ablation study on feature tapes shows the importance of each part and test on generation speed illustrates the advantage of not requiring extra tokens\.

## 2Background

### 2\.1Pushdown Layers

Transformer LM with Pushdown Layers\(Pushdown\-LM, Murtyet al\.,[2023](https://arxiv.org/html/2605.15562#bib.bib14)\)is a type of syntactic LMs that incrementally builds constituency syntactic trees and modulates attention scores based on the constituency trees\. Unlike other syntactic LMs, it does not change the symbol space of the underlying LM\.

At each decoding stepii, Pushdown\-LM predicts shift/reduce operations to simulate the status of a pushdown automaton that corresponds to the partially built constituency tree, and records on a*stack tape*𝐭i\\mathbf\{t\}\_\{i\}the depths of all the tokens that are already generated in the partially built constituency tree\.

Pushdown\-LM then augments self\-attention with stack tape𝐭i\\mathbf\{t\}\_\{i\}:

α~i​jl=\[𝐡jl\+𝐝i​jl\]⊤​𝐖k⊤​𝐖q​𝐡il\\tilde\{\\alpha\}^\{l\}\_\{ij\}=\[\\mathbf\{h\}^\{l\}\_\{j\}\+\\mathbf\{d\}\_\{ij\}^\{l\}\]^\{\\top\}\\mathbf\{W\}\_\{k\}^\{\\top\}\\mathbf\{W\}\_\{q\}\\mathbf\{h\}\_\{i\}^\{l\}\(1\)whereα~i​jl\\tilde\{\\alpha\}\_\{ij\}^\{l\}is the attention score before softmax assigned to thejj\-th token from theii\-th token at layerll,𝐡jl\\mathbf\{h\}^\{l\}\_\{j\}is the hidden state ofjj\-th token at thell\-th attention block,𝐝i​jl\\mathbf\{d\}\_\{ij\}^\{l\}is the embedding of the depth of thejj\-th token recorded in𝐭i\\mathbf\{t\}\_\{i\}, and𝐖k\\mathbf\{W\}\_\{k\}and𝐖q\\mathbf\{W\}\_\{q\}are learnable query and key matrices in self\-attention\. In this way, structural information from the constituency tree is implicitly introduced into self\-attention computation and thus influences the decoding of the underlying LM\.

### 2\.2Semantic Dependency Graphs

A semantic dependency graph forms as a directed acyclic graph instead of a tree\. The dependencies in the graph, where nodes correspond to words, illustrate semantic relations \(*e\.g\.*, agent and patient inPalmeret al\.[2005](https://arxiv.org/html/2605.15562#bib.bib38)\)\. The graph often includes a virtual root node\.

In this paper, we consider three types of semantic dependency graphs fromOepenet al\.\([2015](https://arxiv.org/html/2605.15562#bib.bib33)\)as discussed below\. DELPH\-IN MRS\-Derived Bi\-Lexical Dependencies\(DM, Flickingeret al\.,[2012](https://arxiv.org/html/2605.15562#bib.bib35)\)are derived from Deep BankFlickinger \([2000](https://arxiv.org/html/2605.15562#bib.bib34)\), in which roots designate the highest\-scoping predicate in the graph\. Enju Predicate–Argument Structures \(PAS\) originate from Enju TreebankMiyao \([2006](https://arxiv.org/html/2605.15562#bib.bib36)\), which is obtained by automatically annotating the PTB\. The root of PAS denotes the semantic head in the sentence\. Prague Semantic Dependencies \(PSD\) are based on Prague Czech\-English Dependency TreebankHajicet al\.\([2012](https://arxiv.org/html/2605.15562#bib.bib37)\), where the roots mostly correspond to main verbs\.

## 3Graph\-Infused Layers

We introduce a dependency\-graph\-based language model,*Graph\-Infused Layers Transformer LM*\(GiLT\), which simultaneously generates tokens that form sentences, and dependencies that incrementally construct dependency graphs over the sentences\. We first score possible dependencies that link the current word to previous words \(Section[3\.1](https://arxiv.org/html/2605.15562#S3.SS1)\), then update the dependency graph based on the scoring \(Section[3\.2](https://arxiv.org/html/2605.15562#S3.SS2)\), and utilize the*graph\-based feature tapes*\(Section[3\.3](https://arxiv.org/html/2605.15562#S3.SS3)\), which characterize generated tokens in the graph, to modulate attention computation \(Section[3\.4](https://arxiv.org/html/2605.15562#S3.SS4)\)\.

### 3\.1Dependency Scoring

Whenever a wordwiw\_\{i\}is generated by the Transformer LM, we score all possible dependencies connected from and towiw\_\{i\}with a biaffine mechanism\. Since a word may correspond to multiple tokens, we first define the word\-level representations that serve as input to the biaffine module\.

Suppose a wordwiw\_\{i\}is tokenized intommtokens with input embeddings\{𝐱k,⋯,𝐱k\+m−1\}\\\{\\mathbf\{x\}\_\{k\},\\cdots,\\mathbf\{x\}\_\{k\+m\-1\}\\\}and corresponding hidden states\{𝐡kl,⋯,𝐡k\+m−1l\}⊆ℝd\\\{\\mathbf\{h\}\_\{k\}^\{l\},\\cdots,\\mathbf\{h\}\_\{k\+m\-1\}^\{l\}\\\}\\subseteq\\mathbb\{R\}^\{d\}from all layersl=1,…,Ll=1,\\dots,L\. We define its word\-level representation𝐨i∈ℝ3​d\\mathbf\{o\}\_\{i\}\\in\\mathbb\{R\}^\{3d\}by concatenating three components: \(i\) the hidden state from the middle layer,𝐡k−1L/2\\mathbf\{h\}\_\{k\-1\}^\{L/2\}; \(ii\) the hidden state from the penultimate layer,𝐡k−1L−1\\mathbf\{h\}\_\{k\-1\}^\{L\-1\}; \(iii\) the input embedding of the first token,𝐱k\\mathbf\{x\}\_\{k\}, which provides direct lexical information about the word\. According to the assumption inMurtyet al\.\([2023](https://arxiv.org/html/2605.15562#bib.bib14)\), since𝐡k−1L/2\\mathbf\{h\}\_\{k\-1\}^\{L/2\}and𝐡k−1L−1\\mathbf\{h\}\_\{k\-1\}^\{L\-1\}are hidden states from sufficiently deep layers used to predict thekk\-th token, they capture useful information about thekk\-th token rather than the \(k−1k\-1\)\-th token\. We do not use the final\-layer hidden states to reserve them exclusively focused on next token prediction\. Note that we do not use input embeddings and hidden states computed after𝐱k\\mathbf\{x\}\_\{k\}so that we can predict all dependencies ofwiw\_\{i\}before processing𝐱k\\mathbf\{x\}\_\{k\}, thus being able to infuse structural information from the dependency graph to the hidden states of the tokens inwiw\_\{i\}\.

![Refer to caption](https://arxiv.org/html/2605.15562v1/Figures/update_of_gk_3.png)Figure 1:Illustration of how the feature tape is recomputed when generating a sentence and constructing its dependency graph\. Rows inG2G\_\{2\}andG3G\_\{3\}from top to bottom correspond to Degree, Distance and Depth, respectively\. We setmi​n=1m\_\{in\}=1andmo​u​t=10m\_\{out\}=10for this example\. As*dogs*is predicted, one dependency is added to the graph\.We follow the biaffine parsing approach\(Dozat and Manning,[2018](https://arxiv.org/html/2605.15562#bib.bib23)\)to compute the probabilitypi​jp\_\{ij\}of the dependency from the wordwiw\_\{i\}towjw\_\{j\}\. Note that for the root node, we use a learnable vector as its word representation:

𝐨~ip​a​r\\displaystyle\\tilde\{\\mathbf\{o\}\}^\{par\}\_\{i\}=MLPp​a​r2​\(MLPp​a​r1​\(𝐨i\)\+𝐩𝐞i​i\)\\displaystyle=\\text\{MLP\}\_\{par\}^\{2\}\(\\text\{MLP\}\_\{par\}^\{1\}\(\\mathbf\{o\}\_\{i\}\)\+\\mathbf\{pe\}\_\{ii\}\)\(2\)𝐨~jc​h​d\\displaystyle\\tilde\{\\mathbf\{o\}\}^\{chd\}\_\{j\}=MLPc​h​d2​\(MLPc​h​d1​\(𝐨j\)\+𝐩𝐞i​j\)\\displaystyle=\\text\{MLP\}\_\{chd\}^\{2\}\(\\text\{MLP\}\_\{chd\}^\{1\}\(\\mathbf\{o\}\_\{j\}\)\+\\mathbf\{pe\}\_\{ij\}\)pi​j\\displaystyle p\_\{ij\}=σ​\(𝐨~ip​a​r​𝐖p⊤​𝐨~jc​h​d\)\\displaystyle=\\sigma\(\\tilde\{\\mathbf\{o\}\}^\{par\}\_\{i\}\{\}^\{\\top\}\\mathbf\{W\}\_\{p\}\{\\mathbf\{\\tilde\{o\}\}^\{chd\}\_\{j\}\}\)where𝐨i\\mathbf\{o\}\_\{i\}is the word representation ofwiw\_\{i\}as defined above,𝐖p​a​r∈ℝd×d\\mathbf\{W\}\_\{par\}\\in\\mathbb\{R\}^\{d\\times d\}is a learnable matrix,MLPp​a​r/c​h​d1/2\\text\{MLP\}^\{1/2\}\_\{\{par\}/\{chd\}\}denotes the first/second MLP for computing parent/child representations𝐨~ip​a​r/c​h​d∈ℝd\\tilde\{\\mathbf\{o\}\}\_\{i\}^\{\{par\}/\{chd\}\}\\in\\mathbb\{R\}^\{d\},σ\\sigmadenotes the sigmoid function, and𝐩𝐞i​j\\mathbf\{pe\}\_\{ij\}denotes the positional embedding which is a sum of the sinusoid encoding of\|i−j\|\|i\-j\|and the embedding of graph\-based feature tapeGiG\_\{i\}\(see Section[3\.3](https://arxiv.org/html/2605.15562#S3.SS3)\)\.

### 3\.2Graph Update

Given dependency probabilities\{pi​j,pj​i,pi​i\}\\\{p\_\{ij\},p\_\{ji\},p\_\{ii\}\\\}wherej∈\{0,⋯,i−1\}j\\in\\\{0,\\cdots,i\-1\\\}for all possible dependencies with regard to theii\-th wordwiw\_\{i\}, a straightforward method to greedily update the dependency graph is to add any dependency whose probability exceeds 0\.5\. However, this becomes computationally intractable when we employ beam search of dependency graphs \(Section[3\.5](https://arxiv.org/html/2605.15562#S3.SS5.SSS0.Px2)\) because of the exponentially large search space\. To address this issue, we consider a restricted subspace of dependency graphs by using a two\-step method as follows\. Forwiw\_\{i\}:

\(i\) We predict the number of dependenciesci∈\{0,1,⋯,C\}c\_\{i\}\\in\\\{0,1,\\cdots,C\\\}, whereCCis a constant upper bound:

𝐬\\displaystyle\\mathbf\{s\}=∑j=0i𝐨~ip​a​r⊙𝐖s​𝐨~jc​h​d\+∑j=0i−1𝐨~jp​a​r⊙𝐖s​𝐨~ic​h​d\\displaystyle=\\sum\_\{j=0\}^\{i\}\\tilde\{\\mathbf\{o\}\}^\{par\}\_\{i\}\\odot\\mathbf\{W\}\_\{s\}\\tilde\{\\mathbf\{o\}\}^\{chd\}\_\{j\}\+\\sum\_\{j=0\}^\{i\-1\}\\tilde\{\\mathbf\{o\}\}^\{par\}\_\{j\}\\odot\\mathbf\{W\}\_\{s\}\\tilde\{\\mathbf\{o\}\}^\{chd\}\_\{i\}\(3\)𝝅i\\displaystyle\\boldsymbol\{\\pi\}\_\{i\}:=\[πi0,πi1,⋯,πiC\]T\\displaystyle=\[\\pi\_\{i\}^\{0\},\\pi\_\{i\}^\{1\},\\cdots,\\pi\_\{i\}^\{C\}\]^\{T\}=softmax​\(𝐖a⊤​\(𝐬2​i−1\)\+𝐛a\)\\displaystyle=\\text\{softmax\}\\left\(\\mathbf\{W\}\_\{a\}^\{\\top\}\(\\frac\{\\mathbf\{s\}\}\{\\sqrt\{2i\-1\}\}\)\+\\mathbf\{b\}\_\{a\}\\right\)where⊙\\odotdenotes the Hadamard product operation,𝐖a∈ℝ\(C\+1\)×d\\mathbf\{W\}\_\{a\}\\in\\mathbb\{R\}^\{\(C\+1\)\\times d\},𝐖s∈ℝd×d\\mathbf\{W\}\_\{s\}\\in\\mathbb\{R\}^\{d\\times d\}and𝐛a∈ℝC\+1\\mathbf\{b\}\_\{a\}\\in\\mathbb\{R\}^\{C\+1\}are learnable parameters,𝝅i∈ℝC\+1\\boldsymbol\{\\pi\}\_\{i\}\\in\\mathbb\{R\}^\{C\+1\}is the probability distribution over\{0,1,⋯,C\}\\\{0,1,\\cdots,C\\\}\. To normalize the variance, we scale𝐬\\mathbf\{s\}by2​i−1\\sqrt\{2i\-1\}\.

\(ii\) The value ofcic\_\{i\}can be obatined through either greedy decoding \(choosing the most probable value\) or sampling from𝝅i\\boldsymbol\{\\pi\}\_\{i\}\. Then we pickcic\_\{i\}highest\-scoring dependencies and add them to the dependency graph\. This two\-step method reduces the search space from exponential to linear for each step in beam search\.

### 3\.3Feature Extraction

Given the input sequencex<kx\_\{<k\}, wherexkx\_\{k\}is the first token of theii\-th wordwiw\_\{i\}as defined in Section[3\.1](https://arxiv.org/html/2605.15562#S3.SS1), we extract features from the partially constructed dependency graph and form a graph\-based feature tapeGk=\[g1​k,g2​k,⋯,gk​k\]∈ℕ3×kG\_\{k\}=\[g\_\{1k\},g\_\{2k\},\\cdots,g\_\{kk\}\]\\in\\mathbb\{N\}^\{3\\times k\}for the tokenxkx\_\{k\}\. Note that the graph is word\-level, butGkG\_\{k\}corresponds to a token\. Therefore, for any tokeni,ji,jbelongs to the same word,gi​k=gj​kg\_\{ik\}=g\_\{jk\}inGkG\_\{k\}\.

The feature tape involves three graph\-based features: \(i\) degree, an attribute for each word; \(ii\) distance, measuring connectivity between words; \(iii\) depth, reflecting the global structure of the graph\.

#### Degree\.

The degree of a word refers to the number of its incoming and outgoing dependencies, denoted asci​nc\_\{in\}andco​u​tc\_\{out\}respectively\. According to the definition,ci​n\+co​u​tc\_\{in\}\+c\_\{out\}is the degree for each word, but empirically, we discover that weighted summation achieves better performance \(see section[4\.4](https://arxiv.org/html/2605.15562#S4.SS4)\): we assign weightmi​n∈ℤ\+m\_\{in\}\\in\\mathbb\{Z\}^\{\+\}to in\-degree andmo​u​t∈ℤ\+m\_\{out\}\\in\\mathbb\{Z\}^\{\+\}to out\-degree, where0<mi​n<mo​u​t0<m\_\{in\}<m\_\{out\}, and set the degree asmo​u​t​co​u​t\+mi​n​ci​nm\_\{out\}c\_\{out\}\+m\_\{in\}c\_\{in\}\.

#### Distance\.

The distance from wordwiw\_\{i\}to wordwjw\_\{j\}is computed by finding the weighted shortest path on the current dependency graph\. When traversing a dependency along its direction, we weight it bymo​u​tm\_\{out\}; against the direction, we usemi​nm\_\{in\}\. Thereby encoding dependency direction information into the distance measure\. Intuitively, the distance measures the relevance of the two words\. Specifically, distance recorded ing1​kg\_\{1k\}is measured from the word of tokenxkx\_\{k\}to the word of tokenx1x\_\{1\}\.

#### Depth\.

The dependency graphs used in our work are all rooted\. We define the depth of a word to be the length of the shortest path to the root plus one in the undirected backbone of the dependency graph\. Since the dependency graph is partial, a word may be disconnected from the root and we set its depth to be 0\. We can compute the depths of all the words with breadth\-first search starting from the root\. A visited flag ensures that each word is processed exactly once\.

In Figure[1](https://arxiv.org/html/2605.15562#S3.F1), we illustrate the feature tapes when generating an example sentence\.

### 3\.4Computing Attention Scores

We incorporate information in the graph\-based feature tapeGkG\_\{k\}into the Transformer LM by modifying the self\-attention module\. Specifically, we first mapGk∈ℕ3×kG\_\{k\}\\in\\mathbb\{N\}^\{3\\times k\}via a learned embedding layer onto a global embedding𝐞~k∈ℝ3×k×d~\\tilde\{\\mathbf\{e\}\}\_\{k\}\\in\\mathbb\{R\}^\{3\\times k\\times\\tilde\{d\}\}\. For each Transformer layerll, we apply a linear projectionflf\_\{l\}for feature fusion, that is,𝐞kl=fl​\(𝐞~k\)∈ℝk×d\\mathbf\{e\}^\{l\}\_\{k\}=f\_\{l\}\(\\tilde\{\\mathbf\{e\}\}\_\{k\}\)\\in\\mathbb\{R\}^\{k\\times d\}\. For each token positionj∈\{0,1,⋯,k\}j\\in\\\{0,1,\\cdots,k\\\}, the corresponding fused graph feature𝐞k​jl∈ℝd\\mathbf\{e\}^\{l\}\_\{kj\}\\in\\mathbb\{R\}^\{d\}is directly added to its key in attention computation,

α~k​jl=\[𝐡jl\+𝐞k​jl\]⊤​𝐖k⊤​𝐖q​𝐡kl\\tilde\{\\alpha\}^\{l\}\_\{kj\}=\[\\mathbf\{h\}^\{l\}\_\{j\}\+\\mathbf\{e\}^\{l\}\_\{kj\}\]^\{\\top\}\\mathbf\{W\}\_\{k\}^\{\\top\}\\mathbf\{W\}\_\{q\}\\mathbf\{h\}\_\{k\}^\{l\}\(4\)In this work, we follow the practice of Transformer\-XL\(TXL, Daiet al\.,[2019](https://arxiv.org/html/2605.15562#bib.bib21)\)for attention computation, so we additionally transform Equation[4](https://arxiv.org/html/2605.15562#S3.E4)as follows\.

α~k​jl\\displaystyle\{\\tilde\{\\alpha\}\}^\{l\}\_\{kj\}=𝐡jl⊤​𝐖k,c⊤​𝐖q​𝐡kl\\displaystyle=\{\\mathbf\{h\}^\{l\}\_\{j\}\}^\{\\top\}\\mathbf\{W\}\_\{k,c\}^\{\\top\}\\mathbf\{W\}\_\{q\}\\mathbf\{h\}^\{l\}\_\{k\}\(5\)\+\(𝐫k​j\+𝐞k​jl\)⊤​𝐖k,r⊤​𝐖q​𝐡jl\\displaystyle\+\(\\mathbf\{r\}\_\{kj\}\+\\mathbf\{e\}^\{l\}\_\{kj\}\)^\{\\top\}\\mathbf\{W\}\_\{k,r\}^\{\\top\}\\mathbf\{W\}\_\{q\}\{\\mathbf\{h\}^\{l\}\_\{j\}\}\+u⊤​𝐖k,c​𝐡kl\+v⊤​𝐖k,r​\(𝐫k​j\+𝐞k​jl\)​,\\displaystyle\+u^\{\\top\}\\mathbf\{W\}\_\{k,c\}\\mathbf\{h\}^\{l\}\_\{k\}\+v^\{\\top\}\\mathbf\{W\}\_\{k,r\}\(\\mathbf\{r\}\_\{kj\}\+\\mathbf\{e\}^\{l\}\_\{kj\}\)\\text\{,\}where𝐫k​j\\mathbf\{r\}\_\{kj\}is a vector with sinusoid encoding of\|k−j\|\|k\-j\|,Wk,cW\_\{k,c\}andWk,rW\_\{k,r\}are key matrix for respectively extracting content and relative representation,uuandvvare learnable bias vectors\.

### 3\.5Training and Inference

#### Training\.

Given a corpus of strings annotated with dependency graphs, we precompute graph\-based feature tapeGkG\_\{k\}for each prefixx≤kx\_\{\\leq k\}based on the ground\-truth dependencies overx≤kx\_\{\\leq k\}\. After this preprocessing step, GiLT can be trained in parallel like a standard Transformer\. During training, teacher forcing is applied to both token prediction and dependency prediction\. Given a stringxxwithNNtokens andMMwords and its ground\-truth dependency graph, we derive a sequence of one\-hot ground\-truth vectors of lengthMM\{𝝅^1,⋯,𝝅^M\}\\\{\\hat\{\\boldsymbol\{\\pi\}\}\_\{1\},\\cdots,\\hat\{\\boldsymbol\{\\pi\}\}\_\{M\}\\\}indicating the number of dependencies of each word and a matrixp^\\hat\{p\}wherep^i​j=1\\hat\{p\}\_\{ij\}=1iff\. there is a dependency from wordiito wordjj\. The training loss function is defined as follows:

ℒ\\displaystyle\\mathcal\{L\}=α​1N​∑i=1NCE​\(xi,x^i\)\\displaystyle=\\alpha\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\text\{CE\}\(x\_\{i\},\\hat\{x\}\_\{i\}\)\(6\)\+β​1M2​∑j=1M∑k=1MBCE​\(pj​k,p^j​k\)\\displaystyle\+\\beta\\frac\{1\}\{M^\{2\}\}\\sum\_\{j=1\}^\{M\}\\sum\_\{k=1\}^\{M\}\\text\{BCE\}\(p\_\{jk\},\\hat\{p\}\_\{jk\}\)\+γ​1M​∑k=1MCE​\(𝝅k,𝝅^k\)\\displaystyle\+\\gamma\\frac\{1\}\{M\}\\sum\_\{k=1\}^\{M\}\\text\{CE\}\(\\boldsymbol\{\\pi\}\_\{k\},\\hat\{\\boldsymbol\{\\pi\}\}\_\{k\}\)whereα\\alpha,β\\beta,γ\\gammaare constant coefficients,pj​kp\_\{jk\}and𝝅k\\boldsymbol\{\\pi\}\_\{k\}are from Eq\.[2](https://arxiv.org/html/2605.15562#S3.E2)&[3](https://arxiv.org/html/2605.15562#S3.E3)respectively\.

#### Inference\.

GiLT jointly generates a stringxxand its dependency graphyy\. As discussed in Section[3\.2](https://arxiv.org/html/2605.15562#S3.SS2), GiLT considers only a subspaceYYof dependency graphs and models uncertainty over the subspace with uncertainty over numbers of dependencies of the words inxx\. Therefore, the joint probabilityp​\(x,y\)p\(x,y\)is computed as follows\.

p​\(x,y\)\\displaystyle p\(x,y\)=∏k=1Np​\(xk\|x<k;Gk−1​\(y\)\)\\displaystyle=\\prod\_\{k=1\}^\{N\}p\(x\_\{k\}\|x\_\{<k\};G\_\{k\-1\}\(y\)\)\(7\)×∏j=1Mp\(cj\(y\)\|x≤zj;Gzj\(y\)\)\\displaystyle\\times\\prod\_\{j=1\}^\{M\}p\(c\_\{j\}\(y\)\|x\_\{\\leq z\_\{j\}\};G\_\{z\_\{j\}\}\(y\)\)whereNNandMMare the numbers of tokens and words inxx,cj​\(y\)c\_\{j\}\(y\)is the number of dependencies of thejj\-th word in graphyy,zjz\_\{j\}is the index of the last token within thejj\-th word,Gk​\(y\)G\_\{k\}\(y\)is the feature tape for tokenxkx\_\{k\}based on the subgraph ofyycontaining only the words corresponding to the firstkktokens\.

Computing the marginalp​\(x\)=∑y∈Yp​\(x,y\)p\(x\)=\\sum\_\{y\\in Y\}p\(x,y\)is computationally intractable due to the huge space of all possible graphs\. FollowingMurtyet al\.\([2023](https://arxiv.org/html/2605.15562#bib.bib14)\), we approximate it by marginalizing \(summing\) over a relatively small set of dependency graphs produced via beam search \(i\.e\., retaining multiple most likely values ofcjc\_\{j\}and their corresponding dependencies for every beam\)\. The approximated marginal probability in this way is an exact lower bound of its true value\. Note that since GiLT does not generate extra tokens representing parsing actions in the output sequence, we do not need to use complicated word\-synchronous beam search decodingSternet al\.\([2017](https://arxiv.org/html/2605.15562#bib.bib24)\), which has been widely used in previous syntactic LMs\.

## 4Experiments

### 4\.1Sentence\-Level Language Modeling

#### Dataset and preprocessing\.

We use the BLLIP\-LG dataset ofCharniaket al\.\([2000](https://arxiv.org/html/2605.15562#bib.bib20)\), with training splits fromHuet al\.\([2020](https://arxiv.org/html/2605.15562#bib.bib19)\)\. We obtain annotated PSD, PAS and DM dependency graphs with unlabeled dependencies by parsing the dataset with ACEWanget al\.\([2021](https://arxiv.org/html/2605.15562#bib.bib28)\)\. Since a dependency tree can be seen as a special case of a dependency graph, we also obtain unlabeled projective dependency trees with the Biaffine\-RoBERTa parserDozat and Manning \([2017](https://arxiv.org/html/2605.15562#bib.bib29)\)followingZhaoet al\.\([2024](https://arxiv.org/html/2605.15562#bib.bib10)\)\. Tokenization is performed with the same scheme as inSartranet al\.\([2022](https://arxiv.org/html/2605.15562#bib.bib15)\)with SentencePieceKudo and Richardson \([2018](https://arxiv.org/html/2605.15562#bib.bib30)\)\. We followMurtyet al\.\([2023](https://arxiv.org/html/2605.15562#bib.bib14)\)and model each sentence independently\.

#### Setup\.

We evaluate the perplexity \(PPL\) of the models on the BLLIP\-LG dataset\. We train a 16\-layer TXLDaiet al\.\([2019](https://arxiv.org/html/2605.15562#bib.bib21)\)language model of 252M parameters as the baseline\. We also reimplement Pushdown\-LMMurtyet al\.\([2023](https://arxiv.org/html/2605.15562#bib.bib14)\)based on the code base of TXL for fair comparison\. We use PSD, DM and PAS dependency graphs to train our GiLT respectively, resulting in three models: GiLT\-PSD, GiLT\-DM and GiLT\-PAS\. We also train GiLT on dependency parse trees, resulting in the GiLT\-DP\. Meanwhile, we compare our models with constituency\-based and dependency\-based syntactic Transformer language models including: \(i\) Parsing as Language Model \(PLM & PLM\-Mask\) ofQianet al\.\([2021](https://arxiv.org/html/2605.15562#bib.bib13)\), \(ii\) Transformer Grammars \(TG\) ofSartranet al\.\([2022](https://arxiv.org/html/2605.15562#bib.bib15)\), and \(iii\) Dependency Transformer Grammars \(DTG\) ofZhaoet al\.\([2024](https://arxiv.org/html/2605.15562#bib.bib10)\)\.

Table 1:Results on language modeling and syntactic generalization\. The best values over models that do not add extra tokens arebold\. Overall best values areunderlined\. All PPL results except for that of TXL are approximated upper bounds\. SG scores are computed without the "other" suite\.♠\\spadesuitdenotes that the PPL is taken from the original paper\.♣\\clubsuitdenotes that the result is evaluated with the full BLiMP dataset reported in the original paper\.We set the hyperparameters for GiLT withmi​nm\_\{in\}=11,mo​u​tm\_\{out\}=1010,α=1\\alpha=1,β=0\.2\\beta=0\.2,γ=0\.2\\gamma=0\.2,d~=256\\tilde\{d\}=256, andd=1024d=1024\. This results in 268M parameters for Transformer and 54M parameters for the modules described in Section[3\.1](https://arxiv.org/html/2605.15562#S3.SS1)\-[3\.4](https://arxiv.org/html/2605.15562#S3.SS4)in GiLT\. For TXL, the same configuration of hyperparameters \(model size, dropout, learning rate schedulers\) is used as inZhaoet al\.\([2024](https://arxiv.org/html/2605.15562#bib.bib10)\)\. For perplexity computation, we apply beam search of dependency graphs with the same beam sizebbof 300 following Pushdown\-LMMurtyet al\.\([2023](https://arxiv.org/html/2605.15562#bib.bib14)\)to estimate the perplexity upper bound of GiLT\. To account for the additional parameters in GiLT in comparison with the baseline, we also train a TXL\-Large with 22 layers \(6 more than the base TXL, resulting in 334M parameters\), to investigate the impact of scaling up model sizes alone\.

#### Result\.

We report the PPL of all models in Table[1](https://arxiv.org/html/2605.15562#S4.T1)\. Some syntactic language models, such as PLM and TG, introduce syntactic inductive bias at the cost of language modeling performance\. On the other hand, Pushdown\-LM achieves the best PPL and even outperforms the baseline, confirming previous observation that Pushdown\-LM excels in language modeling among syntactic LMsMurtyet al\.\([2023](https://arxiv.org/html/2605.15562#bib.bib14)\)\. The three GiLT models based on dependency graphs maintain PPL values comparable to both Pushdown\-LM and the baseline\. In contrast, GiLT\-DP exhibits a higher PPL, highlighting the limitations of tree structures compared to more flexible graph\-based structures\.

### 4\.2Syntactic Generalization

We evaluate syntactic generalization on BLiMPWarstadtet al\.\([2020](https://arxiv.org/html/2605.15562#bib.bib31)\)and the SG test suitesHuet al\.\([2020](https://arxiv.org/html/2605.15562#bib.bib19)\)\.

#### Setup\.

We use the same baseline models as in Section[4\.1](https://arxiv.org/html/2605.15562#S4.SS1)\. For BLiMP, models are evaluated on their ability to assign higher probability to grammatical sentences than to their minimally altered ungrammatical counterparts\. The reported BLiMP score is the percentage of such pairs where the model succeeds, i\.e\., where the grammatical sentence receives a higher probability\. Due to limited computational resources, we evaluate on a 10% subset of the BLiMP dataset, selecting every tenth example \(i\.e\., the 1st, 11th, 21st, etc\.\)\. We find that doing this only leads to very small perturbations to the scores \(e\.g\., 0\.2 for TXL\)\.

The SG test suites include seven syntactic phenomenon classes\. For each suite, the model is evaluated on its ability to satisfy a predefined inequality concerning the probability of generating a target span\. We report the per\-suite satisfaction percentage rate \(i\.e\., the fraction of inequalities that hold\) and then compute the SG score as the macro\-average of six rates except for the “other” suite, as it contains only a single sentence, does not correspond to any specific syntactic phenomenon, and disproportionately influences the macro\-average\.

For 10%BLiMP and SG, we use beam search of dependency graphs to both compute marginal probabilityp​\(x\)p\(x\)and conditional probabilityp​\(xt\|x<t\)p\(x\_\{t\}\|x\_\{<t\}\)\.

![Refer to caption](https://arxiv.org/html/2605.15562v1/Figures/SG2.png)Figure 2:Scores on the 6 circuits of the SG test suites among models without extra tokens\.
#### Result\.

The results are presented in Table[1](https://arxiv.org/html/2605.15562#S4.T1)\. Although TXL\-Large has more parameters than our models, its improvements are marginal, indicating that simply scaling up TXL without syntactic inductive bias fails to improve syntactic generalization\. GiLT\-PSD outperforms most models in both tests, surpassing the baseline by 0\.6 points in 10%BLiMP and 7\.6 points in SG\. GiLT\-DP achieves the best 10%BLiMP performance, matching that of DTG\. In contrast, Pushdown\-LM exhibits better SG performance but worse 10%BLiMP performance than the baseline\.

### 4\.3Finetuning on Pretrained LM

Since GiLT does not change the symbol space of the Transformer LM, it can be finetuned from any pretrained language model on any datasets annotated with dependency graphs to introduce syntactic inductive bias\. We therefore evaluate GiLT by starting from a pretrained GPT2 model and finetuning it on downstream tasks\. Meanwhile, we are also curious about whether finetuned models still exhibit better syntactic generalization\. Thus, we also evaluate them on BLiMP and the SG test suites\.

Table 2:Results when Post\-GPT2 and GiLT\-GPT2 are separately finetuned on each downstream task\.#### Setup\.

We use the pretrained GPT2\-medium \(355M\)Radfordet al\.\([2019](https://arxiv.org/html/2605.15562#bib.bib40)\)as the base model\. We create GiLT\-GPT2 by replacing its last 12 vanilla transformer layers with our Graph\-Infused layers and finetune GiLT\-GPT2 on BLLIP\-LG\. We evaluate the language understanding ability of GiLT\-GPT2 on four downstream text classification tasks from GLUEWanget al\.\([2018](https://arxiv.org/html/2605.15562#bib.bib32)\): RTE, SST2, MRPC and STS\-B\. Each task is transformed into a language modeling task via prompting \(details are provided in Appendix[A](https://arxiv.org/html/2605.15562#A1)\)\. We use GiLT\-GPT2 to parse each prompt and then finetune the model with the parsed dependency graph on text classification\. For fair comparison, we also perform the same workflow for vanilla GPT2\-medium, i\.e\., finetuning on BLLIP\-LG to obtain Post\-GPT2, and then further finetuning Post\-GPT2 on downstream task data\.

Table 3:Results of pretrained GPT2, Post\-GPT2 and GiLT\-GPT2 on BLiMP and the SG test suites\.
#### Result\.

In Table[2](https://arxiv.org/html/2605.15562#S4.T2), we report F1 scores for SST2 and RTE, accuracy/F1 for MRPC, and Pearson/Spearman correlation for STS\-B\. It can be seen that GiLT\-GPT2 wins all tasks against Post\-GPT2, implying generally enhanced language understanding capabilities\. It is also worth noting that our parsing process uses a modest beam size of 20 \(compared to 300 in language modeling\), yet achieves good task performance\. Furthermore, GiLT\-GPT2 maintains performance surpassing Post\-GPT2 on both BLiMP and SG as shown in Table[3](https://arxiv.org/html/2605.15562#S4.T3), which reflects the strong syntactic generalization ability of GiLT\.

### 4\.4Ablation Study

We design five controlled ablations: \(1\) –degree: removal of degree from the feature tape, \(2\) –depth: removal of depth, \(3\) –distance: removal of distance, \(4\) –weights of degree: removing the degree weighting coefficientsmi​nm\_\{in\}andmo​u​tm\_\{out\}, and \(5\) –weights of distance: removing the distance weighting coefficientsmi​nm\_\{in\}andmo​u​tm\_\{out\}\. Each ablation is re\-trained from scratch separately\. We use GiLT\-PSD as the base model and evaluate on both PPL, SG and 10%BLiMP as in previous sections\.

As shown in Table[4](https://arxiv.org/html/2605.15562#S4.T4), the PPLs of the six settings are at the same level, but their syntactic generalization performances can be quite different\. Notably, the –degree, –depth and –weights of distance model exhibit better 10%BLiMP than the GiLT\-PSD base model, but they yield significantly lower SG scores\. The other two ablation settings degrade on both 10%BLiMP and SG\. The ablation study indicates that each feature captures a distinct linguistic aspect and their combination leads to more robust overall performance\.

Table 4:The results of the ablation study\.Table 5:The results of the generation speed test\. CUDA memory consumption is measured in GB\.
### 4\.5Efficiency Comparison

We have observed the strong performance of DTG in Table[1](https://arxiv.org/html/2605.15562#S4.T1)\. However, DTG introduces additional structural tokens, which slows down inference\. We measure the speed of DTG and our model on a small set of sentences when using beam search of dependency graphs with the same beam sizebb\. We also measure the greedy decoding speed of TXL as a baseline, which does not require beam search of sentence structures and can be seen as using a beam size of 1\. It can be seen that GiLT is slightly slower than TXL whenb=1b=1, showing that the low consumption of our extra module\. Compares with DTG, our GiLT remains significantly faster and more memory efficient\. Asbbincreases, the efficiency degrades and GPU memory grows, yet both worsen markedly more slowly for GiLT than DTG\. Note that whenb=300b=300, we were unable to complete DTG’s inference on our NVIDIA A6000 GPU due to excessive memory demands\.

![Refer to caption](https://arxiv.org/html/2605.15562v1/Figures/showcase3.png)

![Refer to caption](https://arxiv.org/html/2605.15562v1/Figures/showcase_dep_graph.png)

Figure 3:Left: visualization of attention scores of the first head in the last layer of GiLT \(left\) and TXL \(right\) given the input “Writing long reports every week is boring\.”\. Right: the predicted PSD dependency graph by GiLT\-PSD, which also serves as the silver dependency graph of the given input\.
### 4\.6Case Study

We obtain the attention scores of both GiLT\-PSD and TXL and the predicted PSD graph by GiLT\-PSD when inputting “Writing long reports every week is boring”, and visualize them in Figure[3](https://arxiv.org/html/2605.15562#S4.F3)\. Above all, GiLT\-PSD correctly predicts every dependency of the PSD graph\. For the attention scores, TXL can be seen to consistently assign a large proportion of attention to the most recent noun and fails to identify the subject of this sentence\. In contrast, GiLT correctly focuses on “Writing” when the input is “is” since “Writing” is the governing word of the subject\. When inputting “week”, GiLT assigns attention scores more evenly to “long reports” and “every week” besides the attention sink, since both phrases modify “Writing”\.

## 5Related Work

There have been studies about leveraging recursive linguistic structural \(symbolic\) information for sequential language modeling\. For syntactic LMs with neural architectures, RNNGs\(Dyeret al\.,[2016](https://arxiv.org/html/2605.15562#bib.bib5)\), jointly model the syntactic structure and words by integrating top\-down transition\-based constituency parsing into a recursive neural network, while recent studies\(Qianet al\.,[2021](https://arxiv.org/html/2605.15562#bib.bib13); Yoshida and Oseki,[2022](https://arxiv.org/html/2605.15562#bib.bib16); Sartranet al\.,[2022](https://arxiv.org/html/2605.15562#bib.bib15)\)have applied this approach to Transformers, which explicitly model a syntactic tree along with words by imposing hard constraints over attention masks to simulate the shift/compose operations in transition\-based parsing\.Huet al\.\([2024](https://arxiv.org/html/2605.15562#bib.bib39)\)further explores an unsupervised training framework for constituency\-based syntactic LMs, showing the potential of training syntactic LMs at scale\.

In addition to constituency\-based models mentioned above, studies on those based on dependency tree structures\(Buys and Blunsom,[2015](https://arxiv.org/html/2605.15562#bib.bib11); Mirowski and Vlachos,[2015](https://arxiv.org/html/2605.15562#bib.bib12)\), also achieve improved syntactic generalization performance\. A recent example is Dependency Transformer Grammars\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.15562#bib.bib10)\), which employs a constrained attention pattern similar toSartranet al\.\([2022](https://arxiv.org/html/2605.15562#bib.bib15)\)to encourage head\-dependent representation learning\.

Both constituency\-based and dependency\-based studies incorporate the inductive bias of symbolic structures into the self\-attention mechanism by regulating the attention masks dynamically\. Some other studies also focus on adapting the self\-attention modules, or both\(Wanget al\.,[2019](https://arxiv.org/html/2605.15562#bib.bib25); Penget al\.,[2019](https://arxiv.org/html/2605.15562#bib.bib26); Deshpande and Narasimhan,[2020](https://arxiv.org/html/2605.15562#bib.bib27); Murtyet al\.,[2023](https://arxiv.org/html/2605.15562#bib.bib14)\), whereas our work follows the conventions of adaptation, modifying the self\-attention module by incorporating dependency graph feature representations without changing the input or output space of Transformer LMs\.

These models show considerable performance in generalizing syntactic information via recursion as tree structures\. However, most of these studies focus solely on trees rather than a more general and flexible form: graphs\. One notable work\(Prangeet al\.,[2022](https://arxiv.org/html/2605.15562#bib.bib22)\)proposes a model that exploits information from both syntactic and semantic graphs\. However, it only introduces graph\-informed language modeling without actually modeling the explicit symbolic structure: gold syntax and semantics are needed for both training and test\-time inference of the model\. Semantic graphs are also employed to guide the model in other fields such as machine translation and visual tasks, but these studies directly apply the gold signals for model augmentation from semantic graphs instead of encoding the graphs into the model\(Aueet al\.,[2004](https://arxiv.org/html/2605.15562#bib.bib17); Keet al\.,[2024](https://arxiv.org/html/2605.15562#bib.bib18)\)\. GiLT differs from these models as we model graphs in the Transformer LM, and we can incrementally build a graph along with the next token prediction without graph supervision during inference\.

## 6Conclusion

We propose GiLT, a novel type of syntactic language models that incorporates dependency graphs—a more general and flexible form of linguistic structural information compared with traditional syntactic tree structures—into Transformers\. GiLT jointly predicts tokens and dependencies, incrementally constructing a dependency graph and using features extracted from it to modulate attention scores\. Experiments show that GiLT achieves enhanced syntactic generalization without introducing extra tokens and with minimal impact on perplexity\. Additionally, finetuning GiLT from pretrained language models also improves language understanding performance on several downstream tasks\. These results demonstrate GiLT can effectively construct dependency graphs of generated sentences and and extract their structural information to serve as inductive bias for language modeling\. Our future work is discussed in Appendix[C](https://arxiv.org/html/2605.15562#A3)\.

## Limitations

During inference, we rely on beam search of dependency graphs to estimate the marginalized probability, which can only provide its lower bound\. Although our dependency population space is constant and independent of the sequence length, beam search of dependency graphs remains computationally expensive\.

Additionally, the discussion in Appendix[B](https://arxiv.org/html/2605.15562#A2)suggests that the performance limitations observed on GiLT\-DP are primarily due to the under\-utilization of tree properties in our graph\-based modeling approach\. This insight highlights the potential for further research to focus on better integrating the inherent properties of graphs, such as the presence of multiple heads, to improve the model’s overall performance and effectiveness\.

## 7Acknowledgments

This work was supported by the robotic AI\-Scientist platform of Chinese Academy of Science, the HPC platform of ShanghaiTech University, and the Core Facility Platform of Computer Science and Communication, SIST, ShanghaiTech University\.

## References

- A\. Aue, A\. Menezes, B\. Moore, C\. Quirk, and E\. Ringger \(2004\)Statistical machine translation using labeled semantic dependency graphs\.InProceedings of the 10th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages,Baltimore, Maryland\.External Links:[Link](https://aclanthology.org/2004.tmi-1.14/)Cited by:[§5](https://arxiv.org/html/2605.15562#S5.p4.1)\.
- J\. Buys and P\. Blunsom \(2015\)Generative incremental dependency parsing with neural networks\.InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing \(Volume 2: Short Papers\),C\. Zong and M\. Strube \(Eds\.\),Beijing, China,pp\. 863–869\.External Links:[Link](https://aclanthology.org/P15-2142/),[Document](https://dx.doi.org/10.3115/v1/P15-2142)Cited by:[§5](https://arxiv.org/html/2605.15562#S5.p2.1)\.
- E\. Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale, and Mark Johnson \(2000\)BLLIP 1987\-89 WSJ Corpus Release 1\.Linguistic Data Consortium\.External Links:[Link](https://catalog.ldc.upenn.edu/LDC2000T43),[Document](https://dx.doi.org/10.35111/FWEW-DA58)Cited by:[§4\.1](https://arxiv.org/html/2605.15562#S4.SS1.SSS0.Px1.p1.1)\.
- D\. K\. Choe and E\. Charniak \(2016\)Parsing as language modeling\.InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing,J\. Su, K\. Duh, and X\. Carreras \(Eds\.\),Austin, Texas,pp\. 2331–2336\.External Links:[Link](https://aclanthology.org/D16-1257/),[Document](https://dx.doi.org/10.18653/v1/D16-1257)Cited by:[§1](https://arxiv.org/html/2605.15562#S1.p2.1)\.
- Z\. Dai, Z\. Yang, Y\. Yang, J\. Carbonell, Q\. Le, and R\. Salakhutdinov \(2019\)Transformer\-XL: attentive language models beyond a fixed\-length context\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,A\. Korhonen, D\. Traum, and L\. Màrquez \(Eds\.\),Florence, Italy,pp\. 2978–2988\.External Links:[Link](https://aclanthology.org/P19-1285/),[Document](https://dx.doi.org/10.18653/v1/P19-1285)Cited by:[§3\.4](https://arxiv.org/html/2605.15562#S3.SS4.p1.15),[§4\.1](https://arxiv.org/html/2605.15562#S4.SS1.SSS0.Px2.p1.1)\.
- A\. Deshpande and K\. Narasimhan \(2020\)Guiding attention for self\-supervised learning with transformers\.InFindings of the Association for Computational Linguistics: EMNLP 2020,T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Online,pp\. 4676–4686\.External Links:[Link](https://aclanthology.org/2020.findings-emnlp.419/),[Document](https://dx.doi.org/10.18653/v1/2020.findings-emnlp.419)Cited by:[§5](https://arxiv.org/html/2605.15562#S5.p3.1)\.
- T\. Dozat and C\. D\. Manning \(2017\)Deep biaffine attention for neural dependency parsing\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Hk95PK9le)Cited by:[§4\.1](https://arxiv.org/html/2605.15562#S4.SS1.SSS0.Px1.p1.1)\.
- T\. Dozat and C\. D\. Manning \(2018\)Simpler but more accurate semantic dependency parsing\.InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),I\. Gurevych and Y\. Miyao \(Eds\.\),Melbourne, Australia,pp\. 484–490\.External Links:[Link](https://aclanthology.org/P18-2077/),[Document](https://dx.doi.org/10.18653/v1/P18-2077)Cited by:[§3\.1](https://arxiv.org/html/2605.15562#S3.SS1.p3.3)\.
- C\. Dyer, A\. Kuncoro, M\. Ballesteros, and N\. A\. Smith \(2016\)Recurrent neural network grammars\.InProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,K\. Knight, A\. Nenkova, and O\. Rambow \(Eds\.\),San Diego, California,pp\. 199–209\.External Links:[Link](https://aclanthology.org/N16-1024/),[Document](https://dx.doi.org/10.18653/v1/N16-1024)Cited by:[§1](https://arxiv.org/html/2605.15562#S1.p2.1),[§5](https://arxiv.org/html/2605.15562#S5.p1.1)\.
- D\. Flickinger \(2000\)On building a more effcient grammar by exploiting types\.Nat\. Lang\. Eng\.6\(1\),pp\. 15–28\.External Links:[Link](http://journals.cambridge.org/action/displayAbstract?aid=58601)Cited by:[§2\.2](https://arxiv.org/html/2605.15562#S2.SS2.p2.1)\.
- D\. Flickinger, Y\. Zhang, and V\. Kordoni \(2012\)DeepBank: a dynamically annotated treebank of the wall street journal\.InProceedings of the Eleventh International Workshop on Treebanks and Linguistic Theories\. International Workshop on Treebanks and Linguistic Theories \(TLT\-11\), 11th, November 30\-December 1, Lisbon, Portugal,pp\. 85–96\.Cited by:[§2\.2](https://arxiv.org/html/2605.15562#S2.SS2.p2.1)\.
- J\. Hajic, E\. Hajicová, J\. Panevová, P\. Sgall, O\. Bojar, S\. Cinková, E\. Fučíková, M\. Mikulová, P\. Pajas, J\. Popelka, J\. Semecký, J\. Šindlerová, J\. Štépánek, J\. Toman, Z\. Urešová, and Z\. Žabokrtský \(2012\)Announcing prague czech\-english dependency treebank 2\.0\.InInternational Conference on Language Resources and Evaluation,External Links:[Link](https://api.semanticscholar.org/CorpusID:14944936)Cited by:[§2\.2](https://arxiv.org/html/2605.15562#S2.SS2.p2.1)\.
- J\. Hu, J\. Gauthier, P\. Qian, E\. Wilcox, and R\. Levy \(2020\)A systematic assessment of syntactic generalization in neural language models\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 1725–1744\.External Links:[Link](https://aclanthology.org/2020.acl-main.158/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.158)Cited by:[§4\.1](https://arxiv.org/html/2605.15562#S4.SS1.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2605.15562#S4.SS2.p1.1)\.
- X\. Hu, P\. Ji, Q\. Zhu, W\. Wu, and K\. Tu \(2024\)Generative pretrained structured transformers: unsupervised syntactic language models at scale\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 2640–2657\.External Links:[Link](https://aclanthology.org/2024.acl-long.145/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.145)Cited by:[§5](https://arxiv.org/html/2605.15562#S5.p1.1)\.
- J\. Ke, Z\. Wen, Y\. Yang, C\. Cui, Y\. Ren, X\. Pu, and L\. He \(2024\)Integrating vision\-language semantic graphs in multi\-view clustering\.InProceedings of the Thirty\-Third International Joint Conference on Artificial Intelligence, IJCAI\-24,K\. Larson \(Ed\.\),pp\. 4273–4281\.Note:Main TrackExternal Links:[Document](https://dx.doi.org/10.24963/ijcai.2024/472),[Link](https://doi.org/10.24963/ijcai.2024/472)Cited by:[§5](https://arxiv.org/html/2605.15562#S5.p4.1)\.
- Y\. Kim, A\. Rush, L\. Yu, A\. Kuncoro, C\. Dyer, and G\. Melis \(2019\)Unsupervised recurrent neural network grammars\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),J\. Burstein, C\. Doran, and T\. Solorio \(Eds\.\),Minneapolis, Minnesota,pp\. 1105–1117\.External Links:[Link](https://aclanthology.org/N19-1114/),[Document](https://dx.doi.org/10.18653/v1/N19-1114)Cited by:[§1](https://arxiv.org/html/2605.15562#S1.p2.1)\.
- T\. Kudo and J\. Richardson \(2018\)SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations,E\. Blanco and W\. Lu \(Eds\.\),Brussels, Belgium,pp\. 66–71\.External Links:[Link](https://aclanthology.org/D18-2012/),[Document](https://dx.doi.org/10.18653/v1/D18-2012)Cited by:[§4\.1](https://arxiv.org/html/2605.15562#S4.SS1.SSS0.Px1.p1.1)\.
- P\. Mirowski and A\. Vlachos \(2015\)Dependency recurrent neural language models for sentence completion\.InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing \(Volume 2: Short Papers\),C\. Zong and M\. Strube \(Eds\.\),Beijing, China,pp\. 511–517\.External Links:[Link](https://aclanthology.org/P15-2084/),[Document](https://dx.doi.org/10.3115/v1/P15-2084)Cited by:[§5](https://arxiv.org/html/2605.15562#S5.p2.1)\.
- Y\. Miyao \(2006\)From linguistic theory to syntactic analysis : corpus\-oriented grammar development and feature forest model\.Ph\.D\. Thesis,University of Tokyo\.Note:Unpublished doctoral dissertationExternal Links:[Link](https://api.semanticscholar.org/CorpusID:124072316)Cited by:[§2\.2](https://arxiv.org/html/2605.15562#S2.SS2.p2.1)\.
- S\. Murty, P\. Sharma, J\. Andreas, and C\. Manning \(2023\)Pushdown layers: encoding recursive structure in transformer language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 3233–3247\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.195/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.195)Cited by:[§1](https://arxiv.org/html/2605.15562#S1.p2.1),[§1](https://arxiv.org/html/2605.15562#S1.p3.1),[§1](https://arxiv.org/html/2605.15562#S1.p4.1),[§2\.1](https://arxiv.org/html/2605.15562#S2.SS1.p1.1),[§3\.1](https://arxiv.org/html/2605.15562#S3.SS1.p2.18),[§3\.5](https://arxiv.org/html/2605.15562#S3.SS5.SSS0.Px2.p2.2),[§4\.1](https://arxiv.org/html/2605.15562#S4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.15562#S4.SS1.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2605.15562#S4.SS1.SSS0.Px2.p2.10),[§4\.1](https://arxiv.org/html/2605.15562#S4.SS1.SSS0.Px3.p1.1),[§5](https://arxiv.org/html/2605.15562#S5.p3.1)\.
- H\. Noji and Y\. Oseki \(2021\)Effective batching for recurrent neural network grammars\.InFindings of the Association for Computational Linguistics: ACL\-IJCNLP 2021,C\. Zong, F\. Xia, W\. Li, and R\. Navigli \(Eds\.\),Online,pp\. 4340–4352\.External Links:[Link](https://aclanthology.org/2021.findings-acl.380/),[Document](https://dx.doi.org/10.18653/v1/2021.findings-acl.380)Cited by:[§1](https://arxiv.org/html/2605.15562#S1.p2.1)\.
- S\. Oepen, M\. Kuhlmann, Y\. Miyao, D\. Zeman, S\. Cinková, D\. Flickinger, J\. Hajič, and Z\. Urešová \(2015\)SemEval 2015 task 18: broad\-coverage semantic dependency parsing\.InProceedings of the 9th International Workshop on Semantic Evaluation \(SemEval 2015\),P\. Nakov, T\. Zesch, D\. Cer, and D\. Jurgens \(Eds\.\),Denver, Colorado,pp\. 915–926\.External Links:[Link](https://aclanthology.org/S15-2153/),[Document](https://dx.doi.org/10.18653/v1/S15-2153)Cited by:[§2\.2](https://arxiv.org/html/2605.15562#S2.SS2.p2.1)\.
- M\. Palmer, D\. Gildea, and P\. Kingsbury \(2005\)The proposition bank: an annotated corpus of semantic roles\.Computational Linguistics31\(1\),pp\. 71–106\.External Links:ISSN 0891\-2017,[Document](https://dx.doi.org/10.1162/0891201053630264),[Link](https://doi.org/10.1162/0891201053630264),https://direct\.mit\.edu/coli/article\-pdf/31/1/71/1798172/0891201053630264\.pdfCited by:[§2\.2](https://arxiv.org/html/2605.15562#S2.SS2.p1.1)\.
- H\. Peng, R\. Schwartz, and N\. A\. Smith \(2019\)PaLM: a hybrid parser and language model\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),K\. Inui, J\. Jiang, V\. Ng, and X\. Wan \(Eds\.\),Hong Kong, China,pp\. 3644–3651\.External Links:[Link](https://aclanthology.org/D19-1376/),[Document](https://dx.doi.org/10.18653/v1/D19-1376)Cited by:[§5](https://arxiv.org/html/2605.15562#S5.p3.1)\.
- J\. Prange, N\. Schneider, and L\. Kong \(2022\)Linguistic frameworks go toe\-to\-toe at neuro\-symbolic language modeling\.InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,M\. Carpuat, M\. de Marneffe, and I\. V\. Meza Ruiz \(Eds\.\),Seattle, United States,pp\. 4375–4391\.External Links:[Link](https://aclanthology.org/2022.naacl-main.325/),[Document](https://dx.doi.org/10.18653/v1/2022.naacl-main.325)Cited by:[§5](https://arxiv.org/html/2605.15562#S5.p4.1)\.
- P\. Qian, T\. Naseem, R\. Levy, and R\. Fernandez Astudillo \(2021\)Structural guidance for transformer language models\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),C\. Zong, F\. Xia, W\. Li, and R\. Navigli \(Eds\.\),Online,pp\. 3735–3745\.External Links:[Link](https://aclanthology.org/2021.acl-long.289/),[Document](https://dx.doi.org/10.18653/v1/2021.acl-long.289)Cited by:[§1](https://arxiv.org/html/2605.15562#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.15562#S4.SS1.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.15562#S5.p1.1)\.
- A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, and I\. Sutskever \(2019\)Language models are unsupervised multitask learners\.Technical ReportOpenAI\.Cited by:[§4\.3](https://arxiv.org/html/2605.15562#S4.SS3.SSS0.Px1.p1.1)\.
- L\. Sartran, S\. Barrett, A\. Kuncoro, M\. Stanojević, P\. Blunsom, and C\. Dyer \(2022\)Transformer grammars: augmenting transformer language models with syntactic inductive biases at scale\.Transactions of the Association for Computational Linguistics10,pp\. 1423–1439\.External Links:ISSN 2307\-387X,[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00526),[Link](https://doi.org/10.1162/tacl%5C_a%5C_00526),https://direct\.mit\.edu/tacl/article\-pdf/doi/10\.1162/tacl\_a\_00526/2064617/tacl\_a\_00526\.pdfCited by:[§1](https://arxiv.org/html/2605.15562#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.15562#S4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.15562#S4.SS1.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.15562#S5.p1.1),[§5](https://arxiv.org/html/2605.15562#S5.p2.1)\.
- M\. Stern, D\. Fried, and D\. Klein \(2017\)Effective inference for generative neural parsing\.InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,M\. Palmer, R\. Hwa, and S\. Riedel \(Eds\.\),Copenhagen, Denmark,pp\. 1695–1700\.External Links:[Link](https://aclanthology.org/D17-1178/),[Document](https://dx.doi.org/10.18653/v1/D17-1178)Cited by:[§3\.5](https://arxiv.org/html/2605.15562#S3.SS5.SSS0.Px2.p2.2)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. ukasz Kaiser, and I\. Polosukhin \(2017\)Attention is All you Need\.InAdvances in Neural Information Processing Systems,Vol\.30\.Cited by:[§1](https://arxiv.org/html/2605.15562#S1.p1.1)\.
- A\. Wang, A\. Singh, J\. Michael, F\. Hill, O\. Levy, and S\. Bowman \(2018\)GLUE: a multi\-task benchmark and analysis platform for natural language understanding\.InProceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP,T\. Linzen, G\. Chrupała, and A\. Alishahi \(Eds\.\),Brussels, Belgium,pp\. 353–355\.External Links:[Link](https://aclanthology.org/W18-5446/),[Document](https://dx.doi.org/10.18653/v1/W18-5446)Cited by:[§4\.3](https://arxiv.org/html/2605.15562#S4.SS3.SSS0.Px1.p1.1)\.
- X\. Wang, Y\. Jiang, N\. Bach, T\. Wang, Z\. Huang, F\. Huang, and K\. Tu \(2021\)Automated concatenation of embeddings for structured prediction\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),C\. Zong, F\. Xia, W\. Li, and R\. Navigli \(Eds\.\),Online,pp\. 2643–2660\.External Links:[Link](https://aclanthology.org/2021.acl-long.206/),[Document](https://dx.doi.org/10.18653/v1/2021.acl-long.206)Cited by:[§4\.1](https://arxiv.org/html/2605.15562#S4.SS1.SSS0.Px1.p1.1)\.
- Y\. Wang, H\. Lee, and Y\. Chen \(2019\)Tree transformer: integrating tree structures into self\-attention\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),K\. Inui, J\. Jiang, V\. Ng, and X\. Wan \(Eds\.\),Hong Kong, China,pp\. 1061–1070\.External Links:[Link](https://aclanthology.org/D19-1098/),[Document](https://dx.doi.org/10.18653/v1/D19-1098)Cited by:[§5](https://arxiv.org/html/2605.15562#S5.p3.1)\.
- A\. Warstadt, A\. Parrish, H\. Liu, A\. Mohananey, W\. Peng, S\. Wang, and S\. R\. Bowman \(2020\)BLiMP: the benchmark of linguistic minimal pairs for English\.Transactions of the Association for Computational Linguistics8,pp\. 377–392\.External Links:[Link](https://aclanthology.org/2020.tacl-1.25/),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00321)Cited by:[§4\.2](https://arxiv.org/html/2605.15562#S4.SS2.p1.1)\.
- R\. Yoshida and Y\. Oseki \(2022\)Composition, attention, or both?\.InFindings of the Association for Computational Linguistics: EMNLP 2022,Y\. Goldberg, Z\. Kozareva, and Y\. Zhang \(Eds\.\),Abu Dhabi, United Arab Emirates,pp\. 5822–5834\.External Links:[Link](https://aclanthology.org/2022.findings-emnlp.428/),[Document](https://dx.doi.org/10.18653/v1/2022.findings-emnlp.428)Cited by:[§1](https://arxiv.org/html/2605.15562#S1.p2.1),[§5](https://arxiv.org/html/2605.15562#S5.p1.1)\.
- Y\. Zhao, C\. Lou, and K\. Tu \(2024\)Dependency transformer grammars: integrating dependency structures into transformer language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 1543–1556\.External Links:[Link](https://aclanthology.org/2024.acl-long.84/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.84)Cited by:[§1](https://arxiv.org/html/2605.15562#S1.p2.1),[§1](https://arxiv.org/html/2605.15562#S1.p3.1),[§4\.1](https://arxiv.org/html/2605.15562#S4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.15562#S4.SS1.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2605.15562#S4.SS1.SSS0.Px2.p2.10),[§5](https://arxiv.org/html/2605.15562#S5.p2.1)\.

## Appendix AOther Experimental Details

#### Hyperparamter for finetuning

To obtain Post\-GPT2, we use a batch size of 64, 5000 warmup steps, a cosine decay schedule, and a maximum learning rate of 3e\-5 to finetune pretrained GPT2 on BLLIP\-LG\. To obtain GiLT\-GPT2, we assign a larger maximum learning rate of 1\.5e\-4 for the new parameters\. Since the newly initialized parameters disrupt the semantics of hidden states, we left the language model train alone for 5000 steps before training jointly with the biaffine module using the same configuration as above\.

For downstream tasks, we use a batch size of 64 and a fixed learning rate of 7\.5e\-6\. We choose the best model based on performance on the validation set\. We use the following prompts to convert text classification task into language modeling:

- •RTE: We utilize the following prompt: *Sentence1:\{s1s\_\{1\}\}; Sentence2:\{s2s\_\{2\}\}; Label:\{ll\}*\.l∈l\\in\{0, 1\} for input sentence pair\(s1,s2\)\(s\_\{1\},s\_\{2\}\)
- •MRPC: Given input sentence pair\(s1,s2\)\(s\_\{1\},s\_\{2\}\), we construct the prompt: *Sentence1:\{s1s\_\{1\}\};**Sentence2:\{s2s\_\{2\}\};**Label:\{ll\}*\.l∈l\\in\{inequivalent,equivalent\}\.
- •SST2: Given stringssand labelll, prompt is: *Sentence1:\{s1s\_\{1\}\}; Sentiment:\{ll\}*\.l∈l\\in\{0, 1\}\.
- •STS\-B: Given the sentence pair\(s1,s2\)\(s\_\{1\},s\_\{2\}\), we create the prompt*Sentence1:\{s1s\_\{1\}\}; Sentence2:\{s2s\_\{2\}\}; Score:*\. We use the final hidden states to train a linear regression model, training jointly with LM\.

#### Computational costs

We use PyTorch version 2\.7\.0 for all experiments\. For language modeling experiments, we spent one NVIDIA A6000 GPU for each training, which lasted about 50 hours\. For finetuning experiments, we spent one NVIDIA H800 GPU for each training, which lasted less than 1 hour for each task\.

## Appendix BDiscussion on different parsing

By analyzing metrics in Table[1](https://arxiv.org/html/2605.15562#S4.T1), we discover that the order of performance in perplexity of models trained on different datasets can be listed high to low as: PSD, DM, PAS and DP\. It also roughly conforms to this order in other metrics\.

Table 6:Average number of dependencies per sentence in different SDP dataset based on BLLIP\-LG and reported perplexity of each model from Section[4\.1](https://arxiv.org/html/2605.15562#S4.SS1)\.We calculate the average number of dependencies in the graphs and report the results in Table[6](https://arxiv.org/html/2605.15562#A2.T6)\. We can surprisingly find that the fewer dependencies we need to establish, the better performance we will get\. This is likely because fewer dependencies result in less noise we obtained from silver dependency graphs, and the simpler graphs are probably easier to model\.

The exception is GiLT\-PAS with better performance than GiLT\-DP when PAS has average dependencies more than DP\. Performance degradation on the DP dataset is not unexpected, as the dependency graphs for DP are essentially trees: we lessen the constraints in the models, hence our model for dependency trees are weaker to leverage the unique recursive properties of trees\. This suggests that while GiLT is able to handle more dependencies in graphs with relatively minor performance degradation, it has limitations in effectively utilizing tree structures, a specific type of graph\.

## Appendix CFuture work

For future work, we plan to explore the potential of the feature tape for jointly modeling multiple types of dependency graphs\. This presents a significant challenge for both effective training and efficient inference\. Furthermore, we consider unsupervised training for GiLT as another promising direction\.

Similar Articles

TextLDM: Language Modeling with Continuous Latent Diffusion

Hugging Face Daily Papers

This paper introduces TextLDM, a method that adapts visual latent diffusion transformers for language modeling by mapping discrete tokens to continuous latents. It demonstrates that this approach, enhanced by representation alignment, matches GPT-2 performance and unifies visual and text generation architectures.

Attribution-Guided Continual Learning for Large Language Models

arXiv cs.LG

This paper proposes an attribution-guided continual fine-tuning framework for large language models that estimates task-specific parameter importance in Transformer layers and modulates gradients accordingly, mitigating catastrophic forgetting while maintaining performance on new tasks.

Better language models and their implications

OpenAI Blog

OpenAI introduces GPT-2, a 1.5 billion parameter transformer-based language model trained on 40GB of internet text that achieves state-of-the-art performance on language modeling benchmarks and demonstrates zero-shot capabilities in reading comprehension, translation, question answering, and summarization. Due to safety concerns, only a smaller model and technical paper are released publicly rather than the full trained model.

Language Acquisition Device in Large Language Models

arXiv cs.CL

This paper proposes LAD-inspired pre-pretraining using a formal language called MP-Struct that encodes natural-language-like structures. It shows that this approach improves token efficiency and imparts human-like resistance to structurally implausible languages, challenging prior hypotheses about effective pre-pretraining languages.