G^2C-MT: Graph-Guided Context Selection for Document-Level Machine Translation

arXiv cs.CL Papers

Summary

Proposes G²C-MT, a graph-guided context selection framework for document-level machine translation that models structured discourse dependencies via a lightweight discourse graph and depth-biased random walk, outperforming baselines on multiple LLMs.

arXiv:2606.03078v1 Announce Type: new Abstract: Effective document-level machine translation (DocMT) requires capturing long-range discourse dependencies. Recent work has explored retrieval-based and discourse-aware context selection. However, these approaches often lack an explicit mechanism for modeling structured discourse dependencies between distant paragraphs in a document. In this paper, we propose G^2C-MT (Graph-Guided Context for Machine Translation), which views DocMT context selection as a structured path discovery problem on a lightweight discourse graph, rather than retrieving unstructured context sets or relying on expensive LLM-based discourse modeling. In detail, we represent each paragraph as a node and model the relationship between each pair of nodes, considering their semantic similarity, adjacency, and keyword overlap. Furthermore, we propose a depth-biased random walk over the graph to sample a backward context path for each target paragraph. The context path will be used to prompt a large language model (LLM) for translation. This framework naturally supports multi-path context sampling, which can improve robustness by aggregating diverse translation candidates for discourse-ambiguous inputs. Experiments conducted across various domains show that G^2C-MT outperforms strong baselines on multiple LLMs, including DeepSeek-V3, Gemini-2.5-Flash-lite, and the Qwen-2.5/3 series.
Original Article
View Cached Full Text

Cached at: 06/03/26, 09:36 AM

# G2C-MT: Graph-Guided Context Selection for Document-Level Machine Translation
Source: [https://arxiv.org/html/2606.03078](https://arxiv.org/html/2606.03078)
Zixuan ZhouXiangyu DuanYu LiuLongbo SunRupu Wei &Bohong Zhao School of Computer Science and Technology, Soochow University Trip\.com Group cocaer\.cl@gmail\.com, zxzhou1213@stu\.suda\.edu\.cn, xyduan@suda\.edu\.cn, \{liu\.yub, lbsun, rpwei, bohongzhao\}@trip\.com

###### Abstract

Effective document\-level machine translation \(DocMT\) requires capturing long\-range discourse dependencies\. Recent work has explored retrieval\-based and discourse\-aware context selection\. However, these approaches often lack an explicit mechanism for modeling structured discourse dependencies between distant paragraphs in a document\. In this paper, we propose G²C\-MT \(Graph\-Guided Context for Machine Translation\), which views DocMT context selection as a structured path discovery problem on a lightweight discourse graph, rather than retrieving unstructured context sets or relying on expensive LLM\-based discourse modeling\. In detail, we represent each paragraph as a node and model the relationship between each pair of nodes, considering their semantic similarity, adjacency, and keyword overlap\. Furthermore, we propose a depth\-biased random walk over the graph to sample a backward context path for each target paragraph\. The context path will be used to prompt a large language model \(LLM\) for translation\. This framework naturally supports multi\-path context sampling, which can improve robustness by aggregating diverse translation candidates for discourse\-ambiguous inputs\. Experiments conducted across various domains show that G²C\-MT outperforms strong baselines on multiple LLMs, including DeepSeek\-V3, Gemini\-2\.5\-Flash\-lite, and the Qwen\-2\.5/3 series\.

Stage 1: Directed DiscourseGraph Constructionx1x\_\{1\}x2x\_\{2\}x3x\_\{3\}x4x\_\{4\}x5x\_\{5\}Seq \(β\\beta\)Sem \(α\\alpha\)Key \(γ\\gamma\)wi​j=α​Ss​e​m\+β​Ss​e​q\+γ​Sk​e​yw\_\{ij\}=\\alpha S\_\{sem\}\+\\beta S\_\{seq\}\+\\gamma S\_\{key\}Stage 2: Depth\-Biased PathContext Samplingx1x\_\{1\}x2x\_\{2\}x3x\_\{3\}ϕ​\(x3\)=2\\phi\(x\_\{3\}\)=2x4x\_\{4\}x5x\_\{5\}x1x\_\{1\}P​\(vn​e​x​t\|v\)∝w⋅\(ϕ​\(vn​e​x​t\)\+1\)λP\(v\_\{next\}\|v\)\\propto w\\cdot\(\\phi\(v\_\{next\}\)\+1\)^\{\\lambda\}Path:x5→x3→x1x\_\{5\}\\rightarrow x\_\{3\}\\rightarrow x\_\{1\}Stage 3: Path\-BasedContextual GenerationPrompt Construction \(I5I\_\{5\}\):Instructionℐ\\mathcal\{I\}Ctx:\(x1,y1\)\(x\_\{1\},y\_\{1\}\)Ctx:\(x3,y3\)\(x\_\{3\},y\_\{3\}\)Input:x5x\_\{5\}LLMℳ\\mathcal\{M\}⇒y5\\Rightarrow y\_\{5\}

Figure 1:Overview of the G2C\-MT Framework\.The process involves three stages: \(1\) Constructing a discourse graph considering semantic \(α\\alpha\), sequential \(β\\beta\), and keyword \(γ\\gamma\) cohesion; \(2\) Traversing a context path via Depth\-Biased Random Walk \(backtracking from target to history\), where nodes with higher depth potentialϕ\\phi\(e\.g\.,x3x\_\{3\}\) and edge score attract the walker; \(3\) Formatting the path into a structured prompt for translation\.## 1Introduction

High\-quality document\-level machine translation requires more than accurate sentence\-level translation\. It also needs the preservation of discourse phenomena, including lexical consistency and coreference resolution\. Capturing long\-range dependencies is therefore essential\. Although recent advances in LLMs have shown success in handling long contextsLiuet al\.\([2023](https://arxiv.org/html/2606.03078#bib.bib6)\); Chenet al\.\([2023](https://arxiv.org/html/2606.03078#bib.bib9)\); Gaoet al\.\([2025](https://arxiv.org/html/2606.03078#bib.bib10)\), translating an entire document in a single pass often leads to issues such as sentence omissions or context dilutionWanget al\.\([2025](https://arxiv.org/html/2606.03078#bib.bib7)\)\. Moreover, exposing the model to the whole document context is inefficient, since the decoding cost of LLMs increases quadratically with the length of the input text\. To address this issue, recent studies have employed retrieval\-based and graph\-based strategies to select prior translated paragraphs as context\. Retrieval\-based methods select historical translations according to semantic similarity to mitigate long\-range dependenciesWanget al\.\([2025](https://arxiv.org/html/2606.03078#bib.bib7)\)\. However, these methods often fail to preserve explicit discourse structures, as they treat sentences as an unstructured collection\. Similarly, existing graph\-based methods depend on expensive edge relation definitions via LLM\-based relation classification, and they are usually limited to selecting first\-order neighbors as historical contextDuttaet al\.\([2025](https://arxiv.org/html/2606.03078#bib.bib1)\); Phamet al\.\([2025](https://arxiv.org/html/2606.03078#bib.bib8)\)\. This restriction prevents them from capturing deep and multi\-hop discourse paths that define the global document structure\.

To overcome the above problem, we propose G2C\-MT, a novel Graph\-Guided Context for Machine Translation framework\. Unlike retrieving sentences in isolation or relying on expensive graph construction processes, we model the document’s discourse structure as a weighted directed acyclic graph \(DAG\) using a lightweight graph construction procedure\. Specifically, each paragraph serves as a node, while edges denote the relationship between paragraphs\. These relationships are quantified through a fusion score that derived from semantic similarity, sequential adjacency, and lexical overlap\. This rich metric enables our framework to model the document as a graph with complex semantic relations rather than a simple linear chain\.

Based on the constructed discourse graph, we apply a graph\-driven method to select historical context for document\-level translation dynamically\. Specifically, when translating one target paragraph, we perform a backward and biased random walk starting from the corresponding node\. Then, we identify a related context path, which is composed of previous paragraphs along with their translations\. Unlike previous approaches that restrict context to immediate neighborsDuttaet al\.\([2025](https://arxiv.org/html/2606.03078#bib.bib1)\), we guide the traversal with two complementary signals: local edge weights that encode semantic and lexical relevance, and a global signal that encourages traversing nodes that can generate longer and richer discourse chains\. This design enables the model to select a single, structured context path that captures long\-range, non\-linear dependencies while remaining computationally efficient\. The selected path is subsequently formatted into a discourse\-aware prompt, enabling the model to exploit structured contextual information without processing the entire text\. Moreover, since the traversal is probabilistic rather than deterministic, G2C\-MT naturally supports multi\-path sampling\. By exploring multiple plausible discourse paths and aggregating the resulting translations, the framework can improve robustness in the presence of discourse\-level ambiguity

Our main contributions are as follows:

- •We propose G2C\-MT, a novel framework that models document context as a DAG, which can be built in a lightweight way\. The graph can capture non\-linear discourse dependencies effectively by modeling semantic and sequential correlations\.
- •We design a biased random walk mechanism that explicitly favors deeper context paths containing richer discourse information\. Meanwhile, this probabilistic traversal can inherently support multi\-path sampling, which can enhance robustness when the target paragraph involves discourse ambiguities\.
- •We conduct extensive experiments on various document\-level translation benchmarks using LLMs of different scales\. Our method outperforms strong baselines in both translation quality and coherence\. Further analysis confirms the effectiveness of our graph\-guided approach in capturing long\-range dependencies\.

## 2Methodology

As illustrated in Figure[1](https://arxiv.org/html/2606.03078#S0.F1)and Algorithm[1](https://arxiv.org/html/2606.03078#alg1), the overall pipeline of our method can be divided into the following three stages:

1. 1\.Directed Discourse Graph Construction, which treats each paragraph as a node and builds edges between paragraphs considering their multi\-dimensional relevance\.
2. 2\.Depth\-Biased Context Sampling, a stochastic process to backtrack previously translated paragraphs and favor deeper context paths\.
3. 3\.Path\-Based Contextual Generation, which formats these context paths as the discourse information to prompt LLMs\.

### 2\.1Directed Discourse Graph Construction

We model the source documentD=\{x1,x2,…,xN\}D=\\\{x\_\{1\},x\_\{2\},\\dots,x\_\{N\}\\\}as a weighted directed acyclic graphG=\(V,E\)G=\(V,E\)firstly\. The directed edgeei​je\_\{ij\}connects a target nodeviv\_\{i\}to a previous nodevjv\_\{j\}wherej<ij<i\. Note that future translationyiy\_\{i\}\(wherei\>ji\>j\) are unavailable at stepjj, which is consistent with a human translating the document sentence by sentence\. This definition of directed edges can also avoid the appearance of cyclic graphs, thus reducing complexity during traversing\.

The weightwi​jw\_\{ij\}of edgeei​je\_\{ij\}quantifies the relevance of paragraphxjx\_\{j\}toxix\_\{i\}, calculated by a fusion of three discourse\-related factors:

wi​j=α⋅Ss​e​m​\(i,j\)\+β⋅Ss​e​q​\(i,j\)\+γ⋅Sk​e​y​\(i,j\),w\_\{ij\}=\\alpha\\cdot S\_\{sem\}\(i,j\)\+\\beta\\cdot S\_\{seq\}\(i,j\)\+\\gamma\\cdot S\_\{key\}\(i,j\),\(1\)whereα\\alpha,β\\beta, andγ\\gammaare coefficients, that sum to 1 to balance each factor\. The specific definitions are as follows:

##### Semantic Relevance \(Ss​e​mS\_\{sem\}\)\.

Global coherence relies on thematic consistency\. We map each paragraphxix\_\{i\}into a dense vector space via a pre\-trained embedding model, denoted as𝐡i\\mathbf\{h\}\_\{i\}\. Then we compute the cosine similarity between these vectors\. To prevent the graph from being too dense and introducing extra noise, we introduce a thresholdτsem\\tau\_\{\\text\{sem\}\}to truncate those edges with low correlation:

Ss​e​m​\(i,j\)=max⁡\(0,𝐡i⊤​𝐡j\|𝐡i\|​\|𝐡j\|−τsem\)S\_\{sem\}\(i,j\)=\\max\(0,\\frac\{\\mathbf\{h\}\_\{i\}^\{\\top\}\\mathbf\{h\}\_\{j\}\}\{\|\\mathbf\{h\}\_\{i\}\|\|\\mathbf\{h\}\_\{j\}\|\}\-\\tau\_\{\\text\{sem\}\}\)\(2\)

##### Sequential Adjacency \(Ss​e​qS\_\{seq\}\)\.

The paragraph being translated is most closely related to its adjacent paragraphs\. For example, in dialogue questionnaires or background introductions, adjacent contextual information is essential for reference resolution and for preserving logical coherence\. Therefore, an adjacent edge is naturally introduced and assigned a fixed weight:

Ss​e​q​\(i,j\)=𝕀​\(j=i−1\)S\_\{seq\}\(i,j\)=\\mathbb\{I\}\(j=i\-1\)\(3\)where𝕀​\(⋅\)\\mathbb\{I\}\(\\cdot\)is the indicator function\. This guarantees that the local context is always considered a candidate\.

##### Keyword Overlap \(Sk​e​yS\_\{key\}\)\.

Semantic\-based retrieval may miss some paragraphs that contain overlapping keywords but have low semantic similarity\. These omissions may lead to inconsistency in the translation of terms, such as some proper nouns\. We alleviate this problem by introducingSk​e​y​\(i,j\)S\_\{key\}\(i,j\)of keyword overlap\. Specifically,𝒦i\\mathcal\{K\}\_\{i\}denotes the set of top\-KKkeywords inxix\_\{i\}extracted via TF\-IDF\. Then we calculate the lexical score through the degree of keyword overlap:

Sk​e​y​\(i,j\)=∑t∈𝒦i​jψ​\(t,xi\)\+ψ​\(t,xj\)2S\_\{key\}\(i,j\)=\\sum\_\{t\\in\\mathcal\{K\}\_\{ij\}\}\\frac\{\\psi\(t,x\_\{i\}\)\+\\psi\(t,x\_\{j\}\)\}\{2\}\(4\)where𝒦i​j=𝒦i∩𝒦j\\mathcal\{K\}\_\{ij\}=\\mathcal\{K\}\_\{i\}\\cap\\mathcal\{K\}\_\{j\}denotes the intersection of keywords, andψ​\(t,x\)\\psi\(t,x\)represents the TF\-IDF score of termttin paragraphxx\.

### 2\.2Depth\-Biased Context Path Sampling

Once the discourse graph is built, we can backtrack the context path, starting from the given target paragraphxix\_\{i\}\. The backtrack strategy is also important\. The most straightforward method is greedy search, which always selects the neighbor node with the highest weight\. However, we find that this method sometimes terminates prematurely, or tends to select some nodes with certain types of edges ,such as repeatedly traversing edges with highly similar semantics, resulting in redundancy\. To address this, we propose a sampling strategy to balance the edge relevance with the depth of context path to exploit global structural information\.

#### 2\.2\.1Depth Heuristic

We introduce a concept ofDepth Heuristicto estimate the informational richness of a context node\. The Depth Heuristicϕ​\(vj\)\\phi\(v\_\{j\}\)represents the longest path backtracked from nodevjv\_\{j\}\. We argue that this backtrace depth represents how much historical translation context can be provided from a given nodeviv\_\{i\}\. Specifically, we can efficiently calculateϕ​\(vj\)\\phi\(v\_\{j\}\)via dynamic programming\.

ϕ​\(vj\)=1\+maxvk∈ℬ​\(vj\)⁡ϕ​\(vk\)\\phi\(v\_\{j\}\)=1\+\\max\_\{v\_\{k\}\\in\\mathcal\{B\}\(v\_\{j\}\)\}\\phi\(v\_\{k\}\)\(5\)whereϕ​\(vs​t​a​r​t\)=1\\phi\(v\_\{start\}\)=1andℬ​\(vj\)\\mathcal\{B\}\(v\_\{j\}\)denotes the set of backward neighbors ofvjv\_\{j\}\.

#### 2\.2\.2Probabilistic Sampling

We apply the random walk mechanism for context selection to construct the path𝒫i=\(vp1,vp2,…,vpL\)\\mathcal\{P\}\_\{i\}=\(v\_\{p\_\{1\}\},v\_\{p\_\{2\}\},\\dots,v\_\{p\_\{L\}\}\), starting from the targetvp1=viv\_\{p\_\{1\}\}=v\_\{i\}\. Given the current nodevc​u​r​rv\_\{curr\}, the walker transitions to a previous nodevn​e​x​tv\_\{next\}, which is sampled from its neighborhood𝒩​\(vc​u​r​r\)\\mathcal\{N\}\(v\_\{curr\}\)defined above\. The transition probability is as follows:

P​\(vn​e​x​t\|vc​u​r​r\)=wc​u​r​r,n​e​x​t⋅\(ϕ​\(vn​e​x​t\)\+1\)λZP\(v\_\{next\}\|v\_\{curr\}\)=\\frac\{w\_\{curr,next\}\\cdot\(\\phi\(v\_\{next\}\)\+1\)^\{\\lambda\}\}\{Z\}\(6\)whereZZis the partition function for normalization, and the term\(ϕ​\(vn​e​x​t\)\+1\)λ\(\\phi\(v\_\{next\}\)\+1\)^\{\\lambda\}introduces a bias towards deeper structures\. The hyperparameterλ≥0\\lambda\\geq 0affects the strength of the deep bias\. Whenλ=0\\lambda=0, the traversal process degrades to a standard random walk based on edge weights\. Increasingλ\\lambdawill favor nodes with higher depth, which encourages the retrieval of long\-range context\.

### 2\.3Path\-Based Contextual Generation

After completing the traversal, we can employ this sampled discourse path𝒫i\\mathcal\{P\}\_\{i\}for in\-context learning by prompting LLMs\. We reverse the path to restore the natural document order, and the final promptIiI\_\{i\}is constructed as follows:

Ii=ℐ⊕\[\(xpL,ypL\)⊕⋯⊕\(xp2,yp2\)\]⊕xiI\_\{i\}=\\mathcal\{I\}\\oplus\[\(x\_\{p\_\{L\}\},y\_\{p\_\{L\}\}\)\\oplus\\dots\\oplus\(x\_\{p\_\{2\}\},y\_\{p\_\{2\}\}\)\]\\oplus x\_\{i\}\(7\)whereℐ\\mathcal\{I\}denotes the translation instruction and⊕\\oplusrepresents string concatenation\. The pair\(xk,yk\)\(x\_\{k\},y\_\{k\}\)denotes the previous source paragraph and its translation corresponding to the nodevkv\_\{k\}\.

##### Multi\-Path Sampling\.

Since our method is based on a random walk, each traversal can yield a different context path\. We can sampleKKindependent context paths\{𝒫i\(1\),…,𝒫i\(K\)\}\\\{\\mathcal\{P\}\_\{i\}^\{\(1\)\},\\dots,\\mathcal\{P\}\_\{i\}^\{\(K\)\}\\\}for the target paragraphxix\_\{i\}and then generateKKcandidate translations\{yi\(1\),…,yi\(K\)\}\\\{y\_\{i\}^\{\(1\)\},\\dots,y\_\{i\}^\{\(K\)\}\\\}accordingly\. The final translation can be determined via a majority voting mechanism or by selecting the candidate with the lowest perplexity\. In this paper, we cluster theKKcandidates via k\-means and select the representative candidate closest to the cluster centroid, which also proves to be a simple yet effective strategy\.

##### Complexity Analysis\.

The time cost of graph construction primarily lies in embedding computation and TF\-IDF keyword matching, leading to an overall time complexity ofO​\(N2\)O\(N^\{2\}\)\. In practice, this is a one\-time cost of<<10 s per typical document \(e\.g\.,N≈200N\\approx 200sentences on a single CPU\), and each random walk completes in milliseconds\. At inference time, G2C\-MT requires exactlyNNLLM calls forNNparagraphs—the same as the Window\-Based baseline—since graph construction and the walk are pre\-processing steps that do not invoke the LLM\. Moreover, paragraphs below the similarity cutoffτsem\\tau\_\{\\text\{sem\}\}are pruned, making the graph sparse\.

Algorithm 1G2C\-MT: Graph\-Guided Contextual Translation1:Input:Source Document

D=\{x1,…,xN\}D=\\\{x\_\{1\},\\dots,x\_\{N\}\\\}, LLM

ℳ\\mathcal\{M\}
2:Output:Translated Document

YY
3:Stage 1: Directed Discourse Graph Construction

4:Initialize

G=\(V,E\)G=\(V,E\)with nodes

V=\{1,…,N\}V=\\\{1,\\dots,N\\\}
5:for

i=1i=1to

NNdo

6:for

j=1j=1to

i−1i\-1do

7:Calc weight

wi​jw\_\{ij\}via semantic/seq/keyword scores

8:if

wi​j\>0w\_\{ij\}\>0then

9:Add edge

\(i,j\)\(i,j\)to

EEwith weight

wi​jw\_\{ij\}
10:endif

11:endfor

12:endfor

13:Stage 2: Depth Heuristic Calculation

14:Compute

ϕ​\(v\)\\phi\(v\)for all

v∈Vv\\in Vvia dynamic programming

15:Stage 3: Path\-Based Generation

16:

Y←∅Y\\leftarrow\\emptyset
17:for

i=1i=1to

NNdo

18:Sample path

𝒫back\\mathcal\{P\}\_\{\\text\{back\}\}from

xix\_\{i\}on

GG
19:

𝒫ctx←Reverse​\(𝒫back∖\{xi\}\)\\mathcal\{P\}\_\{\\text\{ctx\}\}\\leftarrow\\text\{Reverse\}\(\\mathcal\{P\}\_\{\\text\{back\}\}\\setminus\\\{x\_\{i\}\\\}\)
20:Construct prompt

IiI\_\{i\}using

𝒫ctx\\mathcal\{P\}\_\{\\text\{ctx\}\}and

xix\_\{i\}
21:

yi←ℳ​\(Ii\)y\_\{i\}\\leftarrow\\mathcal\{M\}\(I\_\{i\}\)
22:Append

yiy\_\{i\}to

YY
23:endfor

24:Return

YY

## 3Experiments

### 3\.1Experimental Setup

##### Datasets

We evaluate our models on two standard benchmark test sets for DocMT\. One of them is the SAP test set\. Derived from the WAT 2020 and 2021 shared tasks, the set contains documents in the IT domain\. For this test set, experiments are conducted on six translation directions: English↔\\leftrightarrowVietnamese, English↔\\leftrightarrowChinese, and English↔\\leftrightarrowIndonesian\. The sentences are organized into separate documents, comprising approximately 2,000 sentences per direction\. We also employ the tst2017 test set from the IWSLT 2017 translation task, comprising parallel TED talk documents\. For this test set, we evaluate on the following eight translation directions: English↔\\leftrightarrowChinese, English↔\\leftrightarrowFrench, English↔\\leftrightarrowGerman, and English↔\\leftrightarrowJapanese\. Each translation direction consists of 10 to 12 sentence\-aligned parallel documents, totaling approximately 1,500 sentences\. In both benchmarks, each document paragraph consists of a single sentence; thus each graph node corresponds to one sentence\.

##### Baselines

We compare our model against the following baselines:

- •Sentence: Each segment is translated independently any contextual information\.
- •Window\-Based: A fixed\-size sliding window of preceding sentences is used as context for translating the current sentence\.
- •Semantic\-Based: Context paragraphs are selected based on embedding\-based semantic similarity to the source paragraph\.
- •GRAFTDuttaet al\.\([2025](https://arxiv.org/html/2606.03078#bib.bib1)\): A multi\-agent framework for DocMT\. The framework first determines the most relevant context paragraph and then extracts key alignment information, such as pronouns, entities, and phrases, to aid document\-level translation\.
- •DelTAWanget al\.\([2025](https://arxiv.org/html/2606.03078#bib.bib7)\): An agentic framework aimed at translation consistency, featuring a multi\-level memory architecture responsible for proper nouns, summaries, and variable\-term contexts\.

##### Settings

For the backbone translation models, we utilize the APIs for Gemini\-2\.5\-Flash\-lite, DeepSeek\-v3\-0324, Qwen\-2\.5\-72B\-Instruct and Qwen3\-235B\-A22B\-Instruct\. To ensure reproducibility and deterministic outputs, we set the decoding temperature to0for all LLMs\. For Graph Construction, we employtext\-embedding\-3\-smallprovided by OpenAI for the semantic edge calculation \(Ss​e​mS\_\{sem\}\)\. The semantic similarity thresholdτsem\\tau\_\{\\text\{sem\}\}is empirically set to0\.60\.6\. The hyperparameters governing edge weight contribution are set toα=0\.2\\alpha=0\.2\(semantic\),β=0\.3\\beta=0\.3\(sequential\), andγ=0\.5\\gamma=0\.5\(keyword\), prioritizing lexical overlap and semantic coherence based on performance on the validation set\. We verify term importance using TF\-IDF, where each paragraph is treated as a document to compute the IDF within the scope of the source text\. For the Biased Random Walk, we set the path depth biasλ=2\.0\\lambda=2\.0to encourage the selection of deeper context nodes\. The maximum context path length \(number of previous paragraphs included\) is capped atL=4L=4and stop the walk when the distance exceeds 100\. For Window\-Based and Semantic\-Based baselines, we also set the context size to 4 paragraphs for a fair comparison\.

Table 1:d\-BLEU scores on the SAP benchmark across three different LLM backbones\. Bold indicates the best performance\. Results averaged over 3 random walk seeds \(σ≤0\.3\\sigma\\leq 0\.3\)\. Improvements of G2C\-MT over Window\-Based are significant \(p<0\.05p<0\.05, bootstrap\)\.Table 2:d\-BLEU scores on the IWSLT 2017 test set based on Qwen\-2\.5\-72B\-Instruct for fair comparison with prior work\.

### 3\.2Main Results

#### 3\.2\.1Performance on Technical Documentation

The evaluation results of all translation directions and backbones on SAP datasets are presented in Table[1](https://arxiv.org/html/2606.03078#S3.T1)\. Firstly, we observe that the discourse context plays an important role in enhancing translation quality for DocMT\. TheSentencebaseline consistently underperformes all context\-aware methods, trailing the simpleWindow\-Basedcontext by an average of 2–3 d\-BLEU\. Secondly, G2C\-MT consistently achieves the highest d\-BLEU scores across all translation directions and backbones, validating the robustness and generality of our graph\-based approach regardless of the underlying model architecture\. Thirdly, we observe that theWindow\-Basedapproach usually outperforms theSemantic\-Basedapproach in this domain\. This is intuitive for technical documentation, where logical progression \(e\.g\., step 1, step 2\) often implies that the adjacent sentences is the most relevant context\. However, G2C\-MT surpasses theWindow\-Basedbaseline by substantial margins—for instance, achieving a gain of\+0\.9 d\-BLEUon EN→\\toVI and\+1\.2 d\-BLEUon VI→\\toEN using Gemini\-2\.5\-Flash\-lite\. This indicates that even in highly sequential documents, ourDepth\-Biased Samplingsuccessfully retrieves necessary long\-range constraints \(such as terminology defined earlier in the document\) that a fixed window misses, without introducing the noise associated with pure semantic retrieval\. We also note the unstable performance of theSemantic\-Basedmethod, which sometimes even underperforms the Sentence baseline \(EN→\\toID\)\. This suggests that treating context as a set of independent fragments may have the negative effect of disrupting the logical flow, leading to incoherent translations\.

#### 3\.2\.2Performance on Narrative Discourse

Table[2](https://arxiv.org/html/2606.03078#S3.T2)shows the results on the IWSLT 2017 benchmark using Qwen\-2\.5\-72B\-Instruct\. This dataset is challenging for its loose conversational structure and long\-range thematic dependencies\. First, G2C\-MT significantly outperforms the commercial and supervised baselines \(NLLB\-3\.3B and Google Translate\) across all directions, confirming the efficacy of LLMs for document\-level translation when prompted with appropriate context\. Second, our method proves superior to recent state\-of\-the\-art agentic frameworks, which achieves comparable or higher d\-BLEU scores \(e\.g\.,\+1\.4over GRAFT in EN→\\toZH and\+2\.1over DelTA in JP→\\toEN\)\. This result is significant given the computational efficiency of our method, as GRAFT and DelTA rely on computationally expensive, multi\-step LLM calls to curate context or maintain memory modules\.

##### Revisiting Context: Structure vs\. Similarity\.

A notable pattern across both datasets is that theSemantic\-Basedbaseline frequently fails to outperform the simpleWindow\-Basedapproach, particularly on SAP \(Table[1](https://arxiv.org/html/2606.03078#S3.T1)\) and several IWSLT directions \(e\.g\., EN→\\toDE, JP→\\toEN\)\. This highlights a critical fact in document translation: local cohesion often outweighs global topical relevance\. Pure semantic retrieval may break the linear narrative required for immediate syntactic dependency handling \(e\.g\., pronoun resolution\)\. G2C\-MT avoids this trade\-off by rooting retrieval in the discourse structure\. By explicitly modeling sequential edges \(Ss​e​qS\_\{seq\}\) alongside semantic ones, our graph traversal prioritizes the immediate history while the depth\-biased sampling allows the model to trace logical threads back to earlier context\.

## 4Analysis

In this section, we conduct a deeper analysis to investigate: \(1\) whether G2C\-MT effectively handles discourse phenomena; \(2\) how the model exploits long\-range dependencies; \(3\) the potential of multi\-path sampling for robustness; and \(4\) the contribution of each component through ablation studies\.

Table 3:Evaluation of discourse Metric on the SAP dataset \(EN→\\toXX\)\. Higher is better for all metrics\.### 4\.1Evaluation of Discourse Metric

We adopt three specialized metrics to further investigate the discourse translation performance:BlonDeJianget al\.\([2022](https://arxiv.org/html/2606.03078#bib.bib2)\), which explicitly tracks discourse phenomena such as entities, tenses, and pronouns;d\-PrismThompson and Post \([2020](https://arxiv.org/html/2606.03078#bib.bib4)\); Vernikoset al\.\([2022](https://arxiv.org/html/2606.03078#bib.bib3)\), which measures semantic consistency using probability scores; andd\-CometReiet al\.\([2022](https://arxiv.org/html/2606.03078#bib.bib5)\); Vernikoset al\.\([2022](https://arxiv.org/html/2606.03078#bib.bib3)\), which evaluates translation quality by considering the preceding context\.

As shown in Table[3](https://arxiv.org/html/2606.03078#S4.T3), G2C\-MT achieves the best performance across all metrics\.Window\-Basedachieves high BlonDe scores but obtains lower d\-Prism scores thanSemantic\-Based\. This indicates that theWindow\-Basedmethod can capture local connections well, while theSemantic\-Basedmethod is good at finding relevant topics in long\-range context\. G2C\-MT mitigates this trade\-off by obtaining the highest BlonDe score \(51\.35\) while maintaining strong semantic consistency through the combination of sequential and semantic edges\.

![Refer to caption](https://arxiv.org/html/2606.03078v1/Figure_1.png)Figure 2:Distribution of Context Distances\.The red dashed line represents the hard cutoff of standard Window\-based approach \(Window=4\)\. The blue histogram shows the context selected by G2C\-MT\. Our method retains strong attention to local context \(peak atΔ<10\\Delta<10\) while maintaining a long tail of retrieval capabilities extending up to 100 sentences back, capturing long\-range dependencies that linear models miss\.
### 4\.2Properites of Graph\-Guided Context Selection

To better understand the efficacy of the proposed pipeline, we characterize the context selection behavior of G2C\-MT compared to theSemantic\-BasedandWindow\-Basedmethods\. We define theContext DistanceΔ=i−k\\Delta=i\-kand visualize the distribution of the selected nodes in Figure[2](https://arxiv.org/html/2606.03078#S4.F2)\. We can observe two key properties of G2C\-MT from the figure:

- •Beyond Fixed Windows:TheWindow\-Basedmethod uses a fixed window size for context selection, where any dependency beyond the window sizeLLis inaccessible\. Our method G2C\-MT can walk through semantic edges \(Ss​e​mS\_\{sem\}\) to skip over intermediate nodes\. We can see that in Figure[2](https://arxiv.org/html/2606.03078#S4.F2), G2C\-MT effectively captures context at distancesΔ\>20\\Delta\>20\.
- •Beyond Isolated Retrieval \(Coherence over Similarity\):theSemantic\-Basedmethod retrieves context based on similarity scores, resulting in abag of sentences\{xk\}\\\{x\_\{k\}\\\}without structural connections\. However, the contexts selected by G2C\-MT are the components of aconnected path𝒫i\\mathcal\{P\}\_\{i\}\. There exists a strong connection between nodes, since these edges of the path are weighted by discourse factors \(Semantic, Sequential, and Lexical\)\. The explicitly modeled sequential edges \(Ss​e​qS\_\{seq\}\) act as glue, so the distribution of context distances in G2C\-MT still peaks at smallΔ\\Deltavalues\.

Table 4:Effect of Multi\-Path Exploration on translation quality \(En→\\toVi\)\.
### 4\.3Multi\-Path Sampling for Robustness

In this section, we analyze the effect of multi\-path sampling in G2C\-MT\. As introduced in Section[2\.3](https://arxiv.org/html/2606.03078#S2.SS3.SSS0.Px1), we first sample multiple context paths for the same target paragraph by random walks, and then select the final translation from multiple transslation candidates\. We use the embedding model to vectorize each candidate translation, and then select the candidate closest to the cluster centroid as the final output\. We experimented with samplingK=\{1,3,5\}K=\\\{1,3,5\\\}paths on the difficult En→\\toVi subset, leaving largerKKvalues for future work due to inference cost\. As shown in Table[4](https://arxiv.org/html/2606.03078#S4.T4), there is a consistent improvements in d\-BLEU as we increase the number of sampled pathsKK\. WhenK=5K=5, we achieve the best performance of 67\.9 d\-BLEU, which is \+0\.6 higher than the single\-path baseline\. This suggests that when translating ambiguous sentences, different context paths may provide complementary information, leading to more robust translations\. Although this comes at the cost of increased inference latency, it offers a flexible trade\-off for scenarios where quality is preferable\.

Table 5:Ablation study on edge components and search strategy on Gemini\-2\.5\-Flash\-lite\.
### 4\.4Ablation Study

We conduct an ablation study on the SAP dataset to analyze the effect of different edge types and our proposed search strategies\. The results are summarized in Table[5](https://arxiv.org/html/2606.03078#S4.T5)\. From the table, we can observe that removing theKeyword Edgeor theBiased Random Walkresults in the largest performance drops \(\-0\.5\)\. This result is intuitive, as terminological consistency is crucial in technical documents, especially for the SAP dataset which is in the IT domain\. TheBiased Random Walkalso plays an important role in preventing the model from settling for shallow, uninformative context\. Another noticeable observation is that the absence of theSequential Edge\(Ss​e​qS\_\{seq\}\) causes a larger performance drop \(\-0\.4\) compared to removing theSemantic Edge\(Ss​e​mS\_\{sem\}\) \(\-0\.2\)\. This indicates that local sequential order is crucial for immediate discourse coherence \(e\.g\., pronoun resolution\), while the less impact of semantic edges may imply that theKeyword Edgehas already captured much of the necessary topical relevance\.

### 4\.5Case Study

Table[6](https://arxiv.org/html/2606.03078#S4.T6)demonstrates how G2C\-MT handles long\-range ambiguity\. The source termpostingis polysemous: it means publishing but refers to accounting entry in this specific ERP context\. The defining context appears ten paragraphs earlier, causing the window\-based baseline to default to the generic, incorrect translation\. In contrast, G2C\-MT successfully retrieves the prior paragraph guided by keyword overlap edges—and generates the correct domain terminology\.

Table 6:A case study on long\-range lexical disambiguation in SAP dataset\.

## 5Related Work

### 5\.1LLM\-based Document\-Level MT

LLMs have demonstrated strong in\-context learning and long\-context modeling capabilities, which makes them suitable for DocMTWanget al\.\([2023](https://arxiv.org/html/2606.03078#bib.bib14)\); Wuet al\.\([2024](https://arxiv.org/html/2606.03078#bib.bib15)\); Cuiet al\.\([2024](https://arxiv.org/html/2606.03078#bib.bib16)\)\. One recent line of work focuses on combining graph structures with LLMs to generate discourse\-aware translations since the graph can naturally model discourse dependencies\.[Duttaet al\.](https://arxiv.org/html/2606.03078#bib.bib1)proposed GRAFT, which uses an LLM agent to segment documents into discourse units, identify context dependencies, and form a directed acyclic discourse graph\. TransGraphPhamet al\.\([2025](https://arxiv.org/html/2606.03078#bib.bib8)\)similarly models inter\-chunk discourse relations via LLM\-based relation classification\. Our work differs from both along three axes:\(i\) Graph construction cost—GRAFT and TransGraph requireO​\(N2\)O\(N^\{2\}\)LLM calls for pairwise edge classification, whereas G2C\-MT constructs edges using lightweight embedding similarity and TF\-IDF in a single pass;\(ii\) Context depth—prior methods restrict context selection to first\-order neighbors, while our depth\-biased random walk discovers multi\-hop paths spanning up to 100 nodes;\(iii\) Multi\-path robustness—stochastic traversal enables multi\-path sampling \(Table[4](https://arxiv.org/html/2606.03078#S4.T4)\), a capability absent in prior graph\-based DocMT\. Another line of work focuses on building memory mechanismsWanget al\.\([2025](https://arxiv.org/html/2606.03078#bib.bib7)\)to maintain document\-level consistency during translation, but these approaches tend to underemphasize explicit discourse structure modeling\.

### 5\.2Structured Reasoning and Self\-Consistency in LLMs

LLMs’ reasoning capabilities can be effectively invoked through sophisticated prompting techniques instead of simple instruction\-response paradigms\. Chain\-of\-Thought \(CoT\) promptingWeiet al\.\([2022](https://arxiv.org/html/2606.03078#bib.bib17)\)shows the potential of handling complex reasoning tasks by generating intermediate reasoning steps\. Building on this work, non\-linear reasoning frameworks have been proposed, such as Tree of Thoughts \(ToT\)Yaoet al\.\([2023](https://arxiv.org/html/2606.03078#bib.bib18)\)and Graph of Thoughts \(GoT\)Bestaet al\.\([2024](https://arxiv.org/html/2606.03078#bib.bib19)\), further improving reasoning performance by exploring multiple reasoning paths\. Beyond structural exploration, single\-path generation can be extended to multi\-path generation to enhance robustness via self\-consistencyWanget al\.\([2022](https://arxiv.org/html/2606.03078#bib.bib20)\)\. This paradigm is particularly suitable for DocMT, since the translation of a given paragraph often relies on ambiguous discourse clues that may permit multiple valid interpretations\. In this work, we bridge these mechanisms by integrating graph\-based structural modeling with multi\-path sampling for DocMT\.

### 5\.3Retrieval\-based Machine Translation

Early works in NMT have explored retrieval\-based methods to improve domain adaptation and translation quality\. One representative line of workKhandelwalet al\.\([2021](https://arxiv.org/html/2606.03078#bib.bib21)\); Menget al\.\([2022](https://arxiv.org/html/2606.03078#bib.bib22)\)retrieves token\-level examples from a vector datastore to calibrate the model’s output distribution\. With the powerful in\-context learning ability of LLMs, retrieving few\-shot examples in sentence\-level for LLM has become a dominant paradigmAgrawalet al\.\([2023](https://arxiv.org/html/2606.03078#bib.bib24)\); Jiet al\.\([2024](https://arxiv.org/html/2606.03078#bib.bib23)\); Zebazeet al\.\([2025](https://arxiv.org/html/2606.03078#bib.bib25)\)\.Wanget al\.\([2025](https://arxiv.org/html/2606.03078#bib.bib7)\); Cuiet al\.\([2024](https://arxiv.org/html/2606.03078#bib.bib16)\)work on document\-level translation by retrieving relevant examples but still lack explicit modeling of discourse structure and primarily consider semantic similarity, treating the document as a bag of unrelated sentences\.

## 6Conclusion

In this paper, we presented G2C\-MT, a novel framework for document\-level machine translation\. Our method models the document as a directed graph to capture structured discourse dependencies\. By using depth\-biased random walks, we select high\-quality context paths as discourse context for prompting LLMs\. Experiments across technical and narrative domains show that G2C\-MT significantly outperforms strong baselines\. Furthermore, our approach is computationally efficient compared to complex agent\-based methods\. We believe this graph\-guided strategy provides a robust solution for long\-text translation tasks\.

## Acknowledgments

We would like to thank the anonymous reviewers for the helpful comments\. This work was supported by National Natural Science Foundation of China \(Grant No\. 62276179, 62537001\) and Project Funded by the Priority Academic Program Development of Jiangsu Higher EducationInstitutions\. Xiangyu Duan is the corresponding author\.

## References

- S\. Agrawal, C\. Zhou, M\. Lewis, L\. Zettlemoyer, and M\. Ghazvininejad \(2023\)In\-context examples selection for machine translation\.InFindings of the Association for Computational Linguistics: ACL 2023,pp\. 8857–8873\.Cited by:[§5\.3](https://arxiv.org/html/2606.03078#S5.SS3.p1.1)\.
- M\. Besta, N\. Blach, A\. Kubicek, R\. Gerstenberger, M\. Podstawski, L\. Gianinazzi, J\. Gajda, T\. Lehmann, H\. Niewiadomski, P\. Nyczyk,et al\.\(2024\)Graph of thoughts: solving elaborate problems with large language models\.InProceedings of the AAAI conference on artificial intelligence,Vol\.38,pp\. 17682–17690\.Cited by:[§5\.2](https://arxiv.org/html/2606.03078#S5.SS2.p1.1)\.
- Y\. Chen, S\. Qian, H\. Tang, X\. Lai, Z\. Liu, S\. Han, and J\. Jia \(2023\)Longlora: efficient fine\-tuning of long\-context large language models\.arXiv preprint arXiv:2309\.12307\.Cited by:[§1](https://arxiv.org/html/2606.03078#S1.p1.1)\.
- M\. Cui, J\. Du, S\. Zhu, and D\. Xiong \(2024\)Efficiently exploring large language models for document\-level machine translation with in\-context learning\.arXiv preprint arXiv:2406\.07081\.Cited by:[§5\.1](https://arxiv.org/html/2606.03078#S5.SS1.p1.2),[§5\.3](https://arxiv.org/html/2606.03078#S5.SS3.p1.1)\.
- H\. Dutta, S\. Manchanda, P\. Bapat, M\. R\. Gurjar, and P\. Bhattacharyya \(2025\)GRAFT: a graph\-based flow\-aware agentic framework for document\-level machine translation\.arXiv preprint arXiv:2507\.03311\.Cited by:[§1](https://arxiv.org/html/2606.03078#S1.p1.1),[§1](https://arxiv.org/html/2606.03078#S1.p3.1),[4th item](https://arxiv.org/html/2606.03078#S3.I1.i4.p1.1),[Table 2](https://arxiv.org/html/2606.03078#S3.T2.9.17.8.1),[§5\.1](https://arxiv.org/html/2606.03078#S5.SS1.p1.2)\.
- T\. Gao, A\. Wettig, H\. Yen, and D\. Chen \(2025\)How to train long\-context language models \(effectively\)\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 7376–7399\.Cited by:[§1](https://arxiv.org/html/2606.03078#S1.p1.1)\.
- B\. Ji, X\. Duan, Z\. Qiu, T\. Zhang, J\. Li, H\. Yang, and M\. Zhang \(2024\)Submodular\-based in\-context example selection for llms\-based machine translation\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\),pp\. 15398–15409\.Cited by:[§5\.3](https://arxiv.org/html/2606.03078#S5.SS3.p1.1)\.
- Y\. E\. Jiang, T\. Liu, S\. Ma, D\. Zhang, J\. Yang, H\. Huang, R\. Sennrich, R\. Cotterell, M\. Sachan, and M\. Zhou \(2022\)BlonDe: an automatic evaluation metric for document\-level machine translation\.InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,Seattle, United States,pp\. 1550–1565\.External Links:[Link](https://aclanthology.org/2022.naacl-main.111),[Document](https://dx.doi.org/10.18653/v1/2022.naacl-main.111)Cited by:[§4\.1](https://arxiv.org/html/2606.03078#S4.SS1.p1.1)\.
- U\. Khandelwal, A\. Fan, D\. Jurafsky, L\. Zettlemoyer, and M\. Lewis \(2021\)Nearest neighbor machine translation\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§5\.3](https://arxiv.org/html/2606.03078#S5.SS3.p1.1)\.
- N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang \(2023\)Lost in the middle: how language models use long contexts\.External Links:2307\.03172,[Link](https://arxiv.org/abs/2307.03172)Cited by:[§1](https://arxiv.org/html/2606.03078#S1.p1.1)\.
- Y\. Meng, X\. Li, X\. Zheng, F\. Wu, X\. Sun, T\. Zhang, and J\. Li \(2022\)Fast nearest neighbor machine translation\.InFindings of the Association for Computational Linguistics: ACL 2022,S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 555–565\.External Links:[Link](https://aclanthology.org/2022.findings-acl.47/),[Document](https://dx.doi.org/10.18653/v1/2022.findings-acl.47)Cited by:[§5\.3](https://arxiv.org/html/2606.03078#S5.SS3.p1.1)\.
- V\. Pham, M\. Wang, H\. Liao, and T\. Vu \(2025\)Discourse graph guided document translation with large language models\.arXiv preprint arXiv:2511\.07230\.Cited by:[§1](https://arxiv.org/html/2606.03078#S1.p1.1),[§5\.1](https://arxiv.org/html/2606.03078#S5.SS1.p1.2)\.
- R\. Rei, J\. G\. C\. de Souza, D\. Alves, C\. Zerva, A\. C\. Farinha, T\. Glushkova, A\. Lavie, L\. Coheur, and A\. F\. T\. Martins \(2022\)COMET\-22: unbabel\-IST 2022 submission for the metrics shared task\.InProceedings of the Seventh Conference on Machine Translation \(WMT\),P\. Koehn, L\. Barrault, O\. Bojar, F\. Bougares, R\. Chatterjee, M\. R\. Costa\-jussà, C\. Federmann, M\. Fishel, A\. Fraser, M\. Freitag, Y\. Graham, R\. Grundkiewicz, P\. Guzman, B\. Haddow, M\. Huck, A\. Jimeno Yepes, T\. Kocmi, A\. Martins, M\. Morishita, C\. Monz, M\. Nagata, T\. Nakazawa, M\. Negri, A\. Névéol, M\. Neves, M\. Popel, M\. Turchi, and M\. Zampieri \(Eds\.\),Abu Dhabi, United Arab Emirates \(Hybrid\),pp\. 578–585\.External Links:[Link](https://aclanthology.org/2022.wmt-1.52/)Cited by:[§4\.1](https://arxiv.org/html/2606.03078#S4.SS1.p1.1)\.
- B\. Thompson and M\. Post \(2020\)Paraphrase generation as zero\-shot multilingual translation: disentangling semantic similarity from lexical and syntactic diversity\.InProceedings of the Fifth Conference on Machine Translation \(Volume 1: Research Papers\),Online\.Cited by:[§4\.1](https://arxiv.org/html/2606.03078#S4.SS1.p1.1)\.
- G\. Vernikos, B\. Thompson, P\. Mathur, and M\. Federico \(2022\)Embarrassingly easy document\-level mt metrics: how to convert any pretrained metric into a document\-level metric\.InProceedings of the Seventh Conference on Machine Translation,Abu Dhabi, United Arab Emirates\.External Links:[Link](https://statmt.org/wmt22/pdf/2022.wmt-1.6.pdf)Cited by:[§4\.1](https://arxiv.org/html/2606.03078#S4.SS1.p1.1)\.
- L\. Wang, C\. Lyu, T\. Ji, Z\. Zhang, D\. Yu, S\. Shi, and Z\. Tu \(2023\)Document\-level machine translation with large language models\.ArXivabs/2304\.02210\.External Links:[Link](https://api.semanticscholar.org/CorpusID:257952312)Cited by:[§5\.1](https://arxiv.org/html/2606.03078#S5.SS1.p1.2)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. Le, E\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2022\)Self\-consistency improves chain of thought reasoning in language models\.arXiv preprint arXiv:2203\.11171\.Cited by:[§5\.2](https://arxiv.org/html/2606.03078#S5.SS2.p1.1)\.
- Y\. Wang, J\. Zeng, X\. Liu, D\. F\. Wong, F\. Meng, J\. Zhou, and M\. Zhang \(2025\)DelTA: an online document\-level translation agent based on multi\-level memory\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=hoYFLRNbhc)Cited by:[§1](https://arxiv.org/html/2606.03078#S1.p1.1),[5th item](https://arxiv.org/html/2606.03078#S3.I1.i5.p1.1),[Table 2](https://arxiv.org/html/2606.03078#S3.T2.9.16.7.1),[§5\.1](https://arxiv.org/html/2606.03078#S5.SS1.p1.2),[§5\.3](https://arxiv.org/html/2606.03078#S5.SS3.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§5\.2](https://arxiv.org/html/2606.03078#S5.SS2.p1.1)\.
- M\. Wu, T\. Vu, L\. Qu, G\. Foster, and G\. Haffari \(2024\)Adapting large language models for document\-level machine translation\.arXiv preprint arXiv:2401\.06468\.Cited by:[§5\.1](https://arxiv.org/html/2606.03078#S5.SS1.p1.2)\.
- S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. Griffiths, Y\. Cao, and K\. Narasimhan \(2023\)Tree of thoughts: deliberate problem solving with large language models\.Advances in neural information processing systems36,pp\. 11809–11822\.Cited by:[§5\.2](https://arxiv.org/html/2606.03078#S5.SS2.p1.1)\.
- A\. R\. Zebaze, B\. Sagot, and R\. Bawden \(2025\)In\-context example selection via similarity search improves low\-resource machine translation\.InFindings of the Association for Computational Linguistics: NAACL 2025,pp\. 1222–1252\.Cited by:[§5\.3](https://arxiv.org/html/2606.03078#S5.SS3.p1.1)\.

Similar Articles

GiLT: Augmenting Transformer Language Models with Dependency Graphs

arXiv cs.CL

The paper proposes GiLT (Graph-Infused Layers Transformer Language Model), which improves syntactic generalization by modulating attention weights using features from dependency graphs constructed incrementally during token prediction, outperforming baselines while maintaining competitive perplexity.

Counterfactual Graph for Multi-Agent LLM Calibration

arXiv cs.CL

This paper introduces CAGE, a counterfactual graph-based method for calibrating multi-agent LLM systems, evaluating on benchmarks like TriviaQA and MMLU-Pro across various communication topologies. The method outperforms existing post-hoc and LLM-elicited calibration approaches.