Topology-Enhanced Alignment for Large Language Models: Trajectory Topology Loss and Topological Preference Optimization

arXiv cs.CL 05/11/26, 04:00 AM Papers
Summary
This paper introduces a topology-enhanced alignment framework for LLMs, utilizing trajectory topology loss and topological preference optimization based on persistent homology to regularize semantic trajectories in hidden space.
arXiv:2605.07172v1 Announce Type: new Abstract: Alignment of large language models (LLMs) via SFT and RLHF/DPO typically ignores the global geometry of the representation space, relying instead on local token likelihoods or scalar scores. We view generation as tracing a semantic trajectory in hidden space and propose a topology-enhanced alignment framework that regularizes these trajectories using 0-dimensional persistent homology. First, for SFT, we introduce Trajectory Topology Loss (TTL). Treating prompt and gold-answer embeddings as a mixed point cloud, we use a 0D persistent homology algorithm to extract "prompt-answer bridges." TTL aligns the model's actual update direction with these topological bridges rather than arbitrary directions. Second, for DPO, we propose Topological Preference Optimization (TPO). TPO constructs topic-specific semantic preference vectors and aligns the improvement direction between rejected and chosen responses with these vectors in an intermediate hidden layer. We also introduce a dynamic weighting scheme to balance DPO and TPO losses. Evaluating on Qwen2.5-7B-Instruct using UltraChat and Anthropic HH-RLHF, our topology-enhanced objectives consistently outperform strong non-topological baselines (e.g., per-example, nearest-neighbor, random regularizers) on automatic preference metrics and LLM-judge evaluations, while maintaining or improving toxicity. Results show persistent homology and trajectory geometry offer a promising direction for controllable alignment.
Original Article
View Cached Full Text
Cached at: 05/11/26, 06:53 AM
# Topology-Enhanced Alignment for Large Language Models: Trajectory Topology Loss and Topological Preference Optimization
Source: [https://arxiv.org/html/2605.07172](https://arxiv.org/html/2605.07172)
Yurui Pan1Ke Xu211footnotemark:1Bo Peng3 1School of Computing and Intelligent Innovation, Fudan University 2School of Economics and Management, Tongji University 3College of Information Technology, Shanghai Ocean University yrpan24@m\.fudan\.edu\.cn, kexu567@tongji\.edu\.cn, bpeng@shou\.edu\.cn

###### Abstract

Alignment of large language models \(LLMs\) typically relies on supervised fine\-tuning \(SFT\) and reinforcement learning from human feedback \(RLHF\), or more recently direct preference optimization \(DPO\)\. However, existing objectives largely ignore the global geometry and topology of the representation space: they operate on local token\-level likelihoods or scalar preference scores, and do not explicitly constrain how hidden states move from a user prompt to an answer\.

We view generation as tracing a*semantic trajectory*in hidden space, and propose a topology\-enhanced alignment framework that regularizes these trajectories using0\-dimensional persistent homology\. First, at the SFT stage, we introduce aTrajectory Topology Loss\(TTL\)\. For each batch, we treat mean\-pooled embeddings of prompts and gold answers as a mixed point cloud, run a Union\-Find\-based0D persistent homology algorithm, and extract “prompt–answer bridge” edges that connect previously disconnected components\. TTL encourages the model’s actual update direction from prompt to answer to align with these topologically derived bridges, rather than with arbitrary or per\-example directions\.

Second, at the RLHF/DPO stage, we proposeTopological Preference Optimization\(TPO\)\. TPO constructs topic\-specific semantic preference vectors from an offline pipeline and aligns the semantic improvement direction between rejected and chosen responses with these vectors in an intermediate hidden layer\. We further introduce an exponential\-moving\-average\-based dynamic weighting scheme to balance DPO and TPO losses, and also explore a fully topological variant that applies persistent homology on the chosen/rejected embedding cloud\.

We instantiate our methods on Qwen2\.5\-7B\-Instruct and evaluate on UltraChat and Anthropic HH\-RLHF\. Across both SFT and DPO training, topology\-enhanced objectives consistently outperform strong non\-topological baselines \(including per\-example, nearest\-neighbor, and random direction regularizers\) on automatic preference metrics and LLM\-judge evaluations, while maintaining or slightly improving toxicity\. These results suggest that incorporating persistent homology and trajectory geometry is a promising and practical direction for more controllable LLM alignment\.

Topology\-Enhanced Alignment for Large Language Models: Trajectory Topology Loss and Topological Preference Optimization

Yurui Pan1††thanks:Equal contribution\.Ke Xu211footnotemark:1Bo Peng3††thanks:Corresponding author\.1School of Computing and Intelligent Innovation, Fudan University2School of Economics and Management, Tongji University3College of Information Technology, Shanghai Ocean Universityyrpan24@m\.fudan\.edu\.cn, kexu567@tongji\.edu\.cn, bpeng@shou\.edu\.cn

## 1Introduction

![Refer to caption](https://arxiv.org/html/2605.07172v1/figs/fig1.jpg)Figure 1:Conceptual comparison between traditional alignment and our topology\-enhanced alignment in hidden space\.Left:Traditional alignment optimizes local, pairwise losses on prompt and answer embeddings without explicitly modeling global structure\.Right:Our topology\-enhanced view treats prompts and answers as a joint point cloud, extracts cross\-manifold bridges via0D persistent homology, and regularizes model trajectories to follow these bridges\.Large language models \(LLMs\) have achieved impressive performance on a wide range of tasks, including open\-domain dialogue, code generation, and complex reasoningBrown et al\. \([2020](https://arxiv.org/html/2605.07172#bib.bib7)\); Vaswani et al\. \([2017](https://arxiv.org/html/2605.07172#bib.bib30)\)\. Despite this progress, aligning LLM behaviors with human values and preferences remains a central challenge\. The dominant paradigm combines*supervised fine\-tuning*\(SFT\) on instruction\-following data with*reinforcement learning from human feedback*\(RLHF\)Ouyang et al\. \([2022](https://arxiv.org/html/2605.07172#bib.bib22)\); Bai et al\. \([2022](https://arxiv.org/html/2605.07172#bib.bib3)\); Christiano et al\. \([2017](https://arxiv.org/html/2605.07172#bib.bib10)\); Stiennon et al\. \([2020](https://arxiv.org/html/2605.07172#bib.bib28)\)or more recent direct preference optimization \(DPO\) approachesRafailov et al\. \([2024](https://arxiv.org/html/2605.07172#bib.bib24)\)\.

Although SFT and RLHF/DPO have proven highly effective in practice, they share a key limitation: they largely ignore the*geometry*and*topology*of the internal representation space\. Standard objectives focus on local signals—token\-level likelihoods in SFT, or scalar preference scores in RLHF—and do not directly supervise how the model’s hidden states move from a user prompt to a final answer\.

However, an LLM’s response generation process can naturally be viewed as tracing a*trajectory*through its hidden space: starting from a representation of the prompt, the model iteratively updates its internal state as it produces each token of the answer\. Different answers \(e\.g\., helpful vs\. unhelpful, safe vs\. unsafe\) correspond to different trajectories\. If we could shape these trajectories to follow semantically meaningful directions—for example, from a prompt state towards a manifold of high\-quality answers—we might obtain more robust and interpretable alignment behavior\.

In parallel, the field of Topological Data Analysis \(TDA\) studies the shape of data manifolds using tools such as persistent homologyEdelsbrunner and Harer \([2010](https://arxiv.org/html/2605.07172#bib.bib12)\); Carlsson \([2009](https://arxiv.org/html/2605.07172#bib.bib9)\); Ghrist \([2008](https://arxiv.org/html/2605.07172#bib.bib15)\)\. Given a point cloud and a distance metric, persistent homology tracks how connected components and higher\-dimensional features appear and merge across scales\. Even in the simplest case of0\-dimensional homology, the resulting “death edges” reveal how different clusters of points connect, providing a multi\-scale skeleton of the data\. In Euclidean space, these0D death edges coincide with the edges of a minimum spanning forest; we use the persistent\-homology view because it naturally highlights the cross\-label merge events at which prompt and answer components first become connected across scales\.

This paper brings these two perspectives together\. We ask:

> *Can we use topological information about hidden representations to regularize LLM alignment, by explicitly constraining semantic trajectories in hidden space?*

We answer this question affirmatively by proposing a unified, topology\-enhanced alignment framework with two components:

- •At theSFT stage, we introduce aTrajectory Topology Loss\(TTL\)\. For each batch, we treat the mean\-pooled embeddings of prompts and gold answers as a mixed point cloud\. Using a 0D persistent homology algorithm implemented via a Union\-Find structureTarjan \([1975](https://arxiv.org/html/2605.07172#bib.bib29)\), we identify “prompt–answer bridges”: edges that connect previously separate connected components\. We view these bridges as topologically informed trajectories from prompts towards the gold answer manifold, and regularize the model so that its actual update direction from prompt to model answer aligns with these bridges\.
- •At theRLHF/DPO stage, we proposeTopological Preference Optimization\(TPO\), which aligns the semantic improvement direction between rejected and chosen responses with topic\-specific preference vectors constructed by an offline pipeline\. We further introduce a dynamic weighting scheme based on an exponential moving average \(EMA\) to balance DPO and TPO losses, and explore a fully topological TPO variant using persistent homology on the chosen/rejected cloud\.

We instantiate our methods on Qwen2\.5\-7B\-Instruct and evaluate on UltraChatDing et al\. \([2023](https://arxiv.org/html/2605.07172#bib.bib11)\)for SFT and Anthropic HH\-RLHFBai et al\. \([2022](https://arxiv.org/html/2605.07172#bib.bib3)\)for DPO\. Our empirical findings are:

- •Topology\-enhanced SFT with TTL yields consistent improvements in reward\-model scores and LLM\-judge helpfulness ratings compared to a strong SFT baseline, with negligible increase in toxicity\.
- •TPO on top of DPO provides higher preference win\-rates and better helpfulness/harmlessness trade\-offs than plain DPO, across different hidden layers and clustering granularities\.
- •Ablations confirm that \(i\) persistent\-homology\-derived bridges outperform random, per\-example, and nearest\-neighbor prompt–answer pairings, and \(ii\) topic\-aware preference vectors and dynamic weighting are both important for TPO’s effectiveness\.

Collectively, our results indicate that even simple 0D topological information can provide useful structure for regularizing hidden\-space trajectories during alignment\.

#### Contributions\.

This paper makes the following contributions:

- •We propose a*trajectory\-centric*view of LLM alignment, where the update from a prompt representation to an answer representation is treated as an explicit semantic trajectory in hidden space, rather than only being supervised via token\-level likelihoods or scalar rewards\.
- •We introduceTrajectory Topology Loss\(TTL\) for SFT, which uses0D persistent homology on a mixed prompt/gold\-answer point cloud to extract a sparse set of topological “bridges”\. TTL regularizes the model so that its prompt\-to\-answer trajectories align with these bridges, and we show that this outperforms non\-topological baselines such as per\-example, random, and kNN\-based direction regularization\.
- •We proposeTopological Preference Optimization\(TPO\) for the DPO stage, which aligns hidden\-space improvement directions between rejected and chosen responses with topic\-aware semantic preference vectors derived from an offline clustering and templating pipeline\. We further introduce an EMA\-based dynamic weighting scheme and a fully topological TPO variant on the chosen/rejected embedding cloud\.
- •We provide an empirical study on Qwen2\.5\-7B\-Instruct with UltraChat and HH\-RLHF, demonstrating consistent gains over strong SFT and DPO baselines on reward\-model scores, preference win\-rates, and helpfulness/harmlessness metrics, with modest training overhead\.

![Refer to caption](https://arxiv.org/html/2605.07172v1/x1.png)Figure 2:Overview of our topology\-enhanced alignment framework\. The left part shows SFT with Trajectory Topology Loss \(TTL\), which adds a cosine loss on topology\-derived bridges between prompt and gold\-answer embeddings\. The right part shows DPO with Topological Preference Optimization \(TPO\), which aligns rejected\-to\-chosen hidden\-state differences with topic\-specific preference vectors\.Each death edge corresponds to the “death” of a connected component when it merges into an older one\. Collectively, these edges form a tree structure that captures how initially separate regions of the point cloud become connected as we increase the distance thresholdCarlsson \([2009](https://arxiv.org/html/2605.07172#bib.bib9)\)\.

In our setting, we exploit this structure to identify*bridges*between points of different semantic categories \(e\.g\., prompts vs\. answers, rejected vs\. chosen\)\. These bridges provide directions in representation space that are informed by the global geometry and topology of the batch, rather than by arbitrary or local choices\. Intuitively, these bridges identify where the prompt manifold and the answer manifold first “touch” as we move from local neighborhoods to more global structure\. Collectively, they form a sparse global skeleton that abstracts away many noisy local connections and yields more stable directions for trajectory regularization\.

## 2Related Work

#### Alignment of large language models\.

Alignment methods such as RLHFOuyang et al\. \([2022](https://arxiv.org/html/2605.07172#bib.bib22)\); Bai et al\. \([2022](https://arxiv.org/html/2605.07172#bib.bib3)\); Christiano et al\. \([2017](https://arxiv.org/html/2605.07172#bib.bib10)\); Stiennon et al\. \([2020](https://arxiv.org/html/2605.07172#bib.bib28)\)and DPORafailov et al\. \([2024](https://arxiv.org/html/2605.07172#bib.bib24)\)have become standard for controlling LLM behaviors\. Subsequent work explores variations in reward modeling, off\-policy optimization, and preference data curationRafailov et al\. \([2024](https://arxiv.org/html/2605.07172#bib.bib24)\); Bai et al\. \([2022](https://arxiv.org/html/2605.07172#bib.bib3)\)\. Our work is orthogonal: we focus on incorporating geometric and topological constraints into existing pipelines\. The foundation of ranking preferences in these models often traces back to statistical models like Bradley\-TerryBradley and Terry \([1952](https://arxiv.org/html/2605.07172#bib.bib6)\)or Plackett\-LucePlackett \([1975](https://arxiv.org/html/2605.07172#bib.bib23)\)\.

#### Representation geometry in deep learning\.

A growing body of work studies the geometry of neural representations, including manifold structure, anisotropyEthayarajh \([2019](https://arxiv.org/html/2605.07172#bib.bib13)\); Ortiz\-Jiménez et al\. \([2020](https://arxiv.org/html/2605.07172#bib.bib21)\), and linear probes for conceptsBau et al\. \([2017](https://arxiv.org/html/2605.07172#bib.bib5)\)\. Some methods exploit representation geometry for curriculum learning or out\-of\-distribution detectionHendrycks and Gimpel \([2017](https://arxiv.org/html/2605.07172#bib.bib16)\); Lee et al\. \([2018](https://arxiv.org/html/2605.07172#bib.bib19)\)\. Other works analyze the expressivity and disentanglement of representationsRaghu et al\. \([2017](https://arxiv.org/html/2605.07172#bib.bib25)\); Achille and Soatto \([2018](https://arxiv.org/html/2605.07172#bib.bib1)\)\. We add to this line by treating hidden\-space trajectories themselves as objects to be regularized, informed by topological structure\.

#### Topological data analysis in neural networks\.

TDA has been used to analyze the shape of feature spaces and decision boundaries in deep networksRieck et al\. \([2019](https://arxiv.org/html/2605.07172#bib.bib27)\); Ballester et al\. \([2024](https://arxiv.org/html/2605.07172#bib.bib4)\), and to design regularizers for robustnessAdams et al\. \([2015](https://arxiv.org/html/2605.07172#bib.bib2)\); Bubenik \([2015](https://arxiv.org/html/2605.07172#bib.bib8)\); Hofer et al\. \([2019](https://arxiv.org/html/2605.07172#bib.bib17)\)\. The theoretical underpinnings rely on persistent homology and barcodesGhrist \([2008](https://arxiv.org/html/2605.07172#bib.bib15)\); Edelsbrunner and Harer \([2010](https://arxiv.org/html/2605.07172#bib.bib12)\)\. However, applications to large\-scale sequence models and LLM alignment remain limited\. To our knowledge, we are the first to use 0D persistent homology explicitly as a training signal for LLM alignment at both SFT and RLHF stages\.

## 3Method

We propose a topology\-enhanced alignment framework that regularizes hidden\-space trajectories at both the SFT and DPO stages \(Figure[2](https://arxiv.org/html/2605.07172#S1.F2)\)\. At SFT time, Trajectory Topology Loss \(TTL\) shapes how hidden states move from prompts to answers\. At DPO time, Topological Preference Optimization \(TPO\) shapes how hidden states move from rejected to chosen responses along topic\-specific preference directions\.

### 3\.1Notation

Letfθf\_\{\\theta\}denote an LLM with parametersθ\\theta\. For an input sequencex=\(x1,…,xn\)x=\(x\_\{1\},\\dots,x\_\{n\}\)with attention maskm∈\{0,1\}nm\\in\\\{0,1\\\}^\{n\}, layerllproduces hidden statesH\(l\)∈ℝn×dH^\{\(l\)\}\\in\\mathbb\{R\}^\{n\\times d\}and we mean\-pool non\-padding tokens:

h\(l\)\(x\)=∑imiHi\(l\)∑imi\.h^\{\(l\)\}\(x\)=\\frac\{\\sum\_\{i\}m\_\{i\}H^\{\(l\)\}\_\{i\}\}\{\\sum\_\{i\}m\_\{i\}\}\.\(1\)When the layer is clear from context we writeh\(x\)h\(x\)for brevity\. We usexpromptx^\{\\text\{prompt\}\}for the dialogue history up to the last user turn,ygoldy^\{\\text\{gold\}\}for the ground\-truth assistant answer, andymodely^\{\\text\{model\}\}for the model answer \(either gold tokens under teacher forcing or sampled tokens\)\. For DPO,ychoseny^\{\\text\{chosen\}\}andyrejectedy^\{\\text\{rejected\}\}denote the preferred and dispreferred responses\.

### 3\.2Trajectory Topology Loss for SFT

TTL encourages the model’s prompt\-to\-answer trajectory in hidden space to align with topology\-derived directions from prompt regions to the gold\-answer manifold\.

#### Point cloud construction\.

For each SFT example we split the sequence into prompt and answer tokens using the chat template and compute three representations:

- •hprompt∈ℝdh^\{\\text\{prompt\}\}\\in\\mathbb\{R\}^\{d\}: mean\-pooled last\-layer hidden state over prompt tokens;
- •hmodel∈ℝdh^\{\\text\{model\}\}\\in\\mathbb\{R\}^\{d\}: mean\-pooled last\-layer hidden state over answer tokens \(teacher forcing\);
- •hgold∈ℝdh^\{\\text\{gold\}\}\\in\\mathbb\{R\}^\{d\}: mean\-pooled*input embeddings*of gold\-answer tokens \(akin to the vector space concepts inMikolov et al\. \([2013](https://arxiv.org/html/2605.07172#bib.bib20)\)\)\.

Over a batch of sizeBBwe form

Hprompt\\displaystyle H^\{\\text\{prompt\}\}=\[h1prompt,…,hBprompt\]⊤∈ℝB×d,\\displaystyle=\[h^\{\\text\{prompt\}\}\_\{1\},\\dots,h^\{\\text\{prompt\}\}\_\{B\}\]^\{\\top\}\\in\\mathbb\{R\}^\{B\\times d\},\(2\)Hgold\\displaystyle H^\{\\text\{gold\}\}=\[h1gold,…,hBgold\]⊤∈ℝB×d,\\displaystyle=\[h^\{\\text\{gold\}\}\_\{1\},\\dots,h^\{\\text\{gold\}\}\_\{B\}\]^\{\\top\}\\in\\mathbb\{R\}^\{B\\times d\},\(3\)and a mixed point cloud

Z=\[HpromptHgold\]∈ℝ2B×d,Z=\\begin\{bmatrix\}H^\{\\text\{prompt\}\}\\\\ H^\{\\text\{gold\}\}\\end\{bmatrix\}\\in\\mathbb\{R\}^\{2B\\times d\},\(4\)with labelsli=0l\_\{i\}=0for prompts \(1:B1\\\!:\\\!B\) andli=1l\_\{i\}=1for gold answers \(B\+1:2BB\\\!\+\\\!1\\\!:\\\!2B\)\.

#### Topological bridges via 0D persistent homology\.

We compute the pairwise distance matrixDij=‖Zi−Zj‖2D\_\{ij\}=\\\|Z\_\{i\}\-Z\_\{j\}\\\|\_\{2\}and run a standard0D persistent\-homology algorithm based on Union–FindTarjan \([1975](https://arxiv.org/html/2605.07172#bib.bib29)\), which processes edges in non\-decreasing order ofDijD\_\{ij\}and records the edges that merge previously disconnected components \(death edges\)\.111Algorithmic details and pseudocode are given in Appendix[D](https://arxiv.org/html/2605.07172#A4)\.Let𝒫\\mathcal\{P\}be the set of death edges\. We keep those that connect a prompt and a gold answer:

ℬ=\{\(p,a\)∈𝒫∣lp≠la\}\.\\mathcal\{B\}=\\\{\(p,a\)\\in\\mathcal\{P\}\\mid l\_\{p\}\\neq l\_\{a\}\\\}\.\(5\)Each such*prompt–answer bridge*is oriented from prompt to answer \(swapping indices if needed\) and induces a topological direction

v\(p,a\)topo=Za−Zp\.v^\{\\text\{topo\}\}\_\{\(p,a\)\}=Z\_\{a\}\-Z\_\{p\}\.\(6\)Compared to using each prompt’s own gold answer or nearest gold neighbor, these bridges arise from a global minimum\-spanning\-forest structure and capture how prompt and answer clusters connect along the global skeleton of the batchKruskal \([1956](https://arxiv.org/html/2605.07172#bib.bib18)\)\.

#### Trajectory Topology Loss\.

For each prompt we define the model\-induced semantic trajectory

vimodel=himodel−hiprompt\.v^\{\\text\{model\}\}\_\{i\}=h^\{\\text\{model\}\}\_\{i\}\-h^\{\\text\{prompt\}\}\_\{i\}\.\(7\)We then define TTL as

ℒtopo=1\|ℬ\|∑\(p,a\)∈ℬ\[1−cos⁡\(v\(p,a\)topo,vpmodel\)\]\.\\mathcal\{L\}\_\{\\text\{topo\}\}=\\frac\{1\}\{\|\\mathcal\{B\}\|\}\\sum\_\{\(p,a\)\\in\\mathcal\{B\}\}\\Big\[1\-\\cos\\big\(v^\{\\text\{topo\}\}\_\{\(p,a\)\},v^\{\\text\{model\}\}\_\{p\}\\big\)\\Big\]\.\(8\)Ifℬ\\mathcal\{B\}is empty we setℒtopo=0\\mathcal\{L\}\_\{\\text\{topo\}\}=0\. The final SFT objective is

ℒSFT=ℒCE\+λtopoℒtopo,\\mathcal\{L\}\_\{\\text\{SFT\}\}=\\mathcal\{L\}\_\{\\text\{CE\}\}\+\\lambda\_\{\\text\{topo\}\}\\mathcal\{L\}\_\{\\text\{topo\}\},\(9\)whereλtopo\\lambda\_\{\\text\{topo\}\}controls the strength of topological regularization\. Additional analysis ofλtopo\\lambda\_\{\\text\{topo\}\}and complexity considerations are given in Appendix[E](https://arxiv.org/html/2605.07172#A5)\.

### 3\.3Topological Preference Optimization \(TPO\)

TPO augments DPO by aligning hidden\-space improvement directions between rejected and chosen responses with topic\-specific semantic preference vectors\.

#### Offline topic\-aware preference vectors\.

We first construct an offline topic library on HH\-RLHF promptsBai et al\. \([2022](https://arxiv.org/html/2605.07172#bib.bib3)\)\. Prompts are embedded with a sentence transformerϕ\\phiReimers and Gurevych \([2019](https://arxiv.org/html/2605.07172#bib.bib26)\), clustered with MiniBatch KMeans intoKKclusters, and each cluster is labeled with a short topic name by a strong LLM\. For each topictt, we instantiate several positive and negative templates \(e\.g\., “a helpful, harmless, high\-quality answer abouttt” vs\. “a harmful, unhelpful, low\-quality answer abouttt”\), encode them withϕ\\phi, and average the differencesepos−enege\_\{\\text\{pos\}\}\-e\_\{\\text\{neg\}\}to obtain a topic vectorut∈ℝdsu\_\{t\}\\in\\mathbb\{R\}^\{d\_\{s\}\}\. Thus each preference example\(x,ych,yrj\)\(x,y^\{\\text\{ch\}\},y^\{\\text\{rj\}\}\)is associated with a topict\(x\)t\(x\)and vectorut\(x\)u\_\{t\(x\)\}\. Full clustering and prompting details are provided in Appendix[G](https://arxiv.org/html/2605.07172#A7)\.

#### Semantic improvement vectors in hidden space\.

During DPO training, for each preference pair we select an intermediate layerll\(e\.g\.,−4\-4from the final layer\) and compute mean\-pooled hidden stateshch,hrj∈ℝdh^\{\\text\{ch\}\},h^\{\\text\{rj\}\}\\in\\mathbb\{R\}^\{d\}for the chosen and rejected responses\. After layer normalization we define the semantic improvement vector

Δh=LN\(hch\)−LN\(hrj\),\\Delta h=\\text\{LN\}\(h^\{\\text\{ch\}\}\)\-\\text\{LN\}\(h^\{\\text\{rj\}\}\),\(10\)which encodes how the hidden representation must change to turn a rejected answer into a chosen one for the same prompt\.

#### TPO loss and dynamic weighting\.

Because the sentence\-embedding spaceℝds\\mathbb\{R\}^\{d\_\{s\}\}and model hidden spaceℝd\\mathbb\{R\}^\{d\}are not aligned a priori, we introduce a small trainable projectionP∈ℝd×dsP\\in\\mathbb\{R\}^\{d\\times d\_\{s\}\}and map topic vectors as

u¯ti=Puti\.\\bar\{u\}\_\{t\_\{i\}\}=P\\,u\_\{t\_\{i\}\}\.\(11\)For a batch of sizeBB, the TPO loss is

ℒTPO=1B∑i=1B\[1−cos⁡\(Δhi,u¯ti\)\]\.\\mathcal\{L\}\_\{\\text\{TPO\}\}=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\Big\[1\-\\cos\\big\(\\Delta h\_\{i\},\\bar\{u\}\_\{t\_\{i\}\}\\big\)\\Big\]\.\(12\)We combine it with the standard DPO loss as

ℒtotal=ℒDPO\+λdynℒTPO,\\mathcal\{L\}\_\{\\text\{total\}\}=\\mathcal\{L\}\_\{\\text\{DPO\}\}\+\\lambda\_\{\\text\{dyn\}\}\\mathcal\{L\}\_\{\\text\{TPO\}\},\(13\)whereλdyn\\lambda\_\{\\text\{dyn\}\}is set by an exponential\-moving\-average\-based scheme that balances the magnitudes ofℒDPO\\mathcal\{L\}\_\{\\text\{DPO\}\}andℒTPO\\mathcal\{L\}\_\{\\text\{TPO\}\}over training\. The exact update formulas and implementation details are given in Appendix[C](https://arxiv.org/html/2605.07172#A3)\.

### 3\.4Fully Topological TPO Variant

Finally, inspired by TTL, we also explore a fully topological variant of TPO\. Instead of using simple vector differencesΔhi\\Delta h\_\{i\}, we construct a mixed point cloud of chosen and rejected embeddings for a batch, run 0D persistent homology on this cloud, and obtain cross\-label bridges whose directionsvimpv^\{\\text\{imp\}\}describe how rejected representations connect to chosen ones along the global batch structure\. We then align these bridge directions with the corresponding topic vectors using a cosine loss, analogous to TPO\. This variant and its pseudocode are described in detail in Appendix[I](https://arxiv.org/html/2605.07172#A9); in the main experiments we use the lighter\-weight vector\-difference TPO as our default, and report fully topological results as an ablation\.

## 4Experiments

### 4\.1Experimental Setup

#### Base model and implementation\.

We use Qwen2\.5\-7B\-Instruct as the base LLM\. All experiments are implemented in PyTorch using the Hugging Face Transformers ecosystem\. For SFT, we apply LoRA with rankr=16r=16and target modules including attention and MLP projections\. For DPO/TPO, we either fine\-tune all parameters or continue tuning LoRA adapters, depending on compute \(see Appendix[C](https://arxiv.org/html/2605.07172#A3)\)\.

We train on NVIDIA A100 GPUs with mixed precision \(bfloat16\) and ZeRO\-optimized parameter sharding where needed\. Persistent homology computations are performed on CPU using a custom Union–Find implementationTarjan \([1975](https://arxiv.org/html/2605.07172#bib.bib29)\)and PyTorch’storch\.cdistto compute pairwise distances\.

#### Datasets\.

For SFT, we use the UltraChat datasetDing et al\. \([2023](https://arxiv.org/html/2605.07172#bib.bib11)\), focusing on the instruction\-following split \(train\_sft\) provided by Hugging Face\. Each example contains a multi\-turn conversation; we keep samples where the last turn is from the assistant and use the full conversation as input, with the last assistant message as target\. We apply the Qwen chat template and compute labels such that prompt tokens are masked with−100\-100\.

For RLHF/DPO, we use the Anthropic HH\-RLHF datasetBai et al\. \([2022](https://arxiv.org/html/2605.07172#bib.bib3)\), which consists of prompts and pairs of chosen vs\. rejected responses annotated by human labelers\. We normalize the formatting to “<prompt\>\\n\\nAssistant:” followed by the answer, following prior work\.

#### Evaluation protocol\.

We evaluate alignment quality along several axes:

- •Reward\-model score \(RM\)↑\\uparrow: reward\-model\-predicted preference score on held\-out data\.
- •Pairwise win\-rate↑\\uparrow: fraction of prompts where our model’s answer is preferred to a baseline model’s answer by a reward model or LLM\-judge\.
- •Helpfulness / Harmlessness↑\\uparrow: approximate dimensions evaluated with either fine\-tuned classifiers or LLM\-based rubrics\.
- •Toxicity↓\\downarrow: estimated toxicity using an off\-the\-shelf classifier \(e\.g\., DetoxifyGehman et al\. \([2020](https://arxiv.org/html/2605.07172#bib.bib14)\)\) on model outputs\.

Unless otherwise stated, reward\-model scores are computed with a fixed open\-source reward model fine\-tuned on human preference data\. For LLM\-judge\-based pairwise evaluations \(e\.g\., win\-rate against a baseline\), we use a deterministic comparison prompt that asks the judge to select the more helpful and harmless answer given the same user request; the full prompt template is provided in Appendix[C](https://arxiv.org/html/2605.07172#A3)\. To mitigate positional bias, we randomly swap the order of the two candidate answers and average over both permutations\.

For all win\-rate metrics and average reward scores, we estimate95%95\\%confidence intervals via bootstrap resampling over prompts \(typically1,0001\{,\}000bootstrap samples\)\. Unless otherwise noted, improvements reported in Tables[1](https://arxiv.org/html/2605.07172#S4.T1)–[8](https://arxiv.org/html/2605.07172#S4.T8)are statistically significant atp<0\.05p<0\.05under this procedure\. We sample outputs using greedy or nucleus sampling \(see Appendix[C](https://arxiv.org/html/2605.07172#A3)\) and use consistent generation settings across models\.

### 4\.2Main Results: SFT with Trajectory Topology Loss

Table[1](https://arxiv.org/html/2605.07172#S4.T1)reports the main SFT results on UltraChat\. We compare the base SFT model and a TTL\-enhanced model withλtopo\\lambda\_\{\\text\{topo\}\}set to a moderate value\.

Table 1:SFT results on UltraChat\. RM: reward model score; Win: win\-rate vs\. Base SFT; IFEval: strict prompt\-level accuracy; Tox\.: toxicity\.TTL consistently improves RM and win\-rate, with typical gains of 3–4 points\. IFEval scores \(as a proxy for instruction following\) also increase, suggesting that TTL encourages trajectories that lead to more informative and user\-aligned answers\. Toxicity either remains stable or slightly decreases, indicating that TTL does not introduce obvious safety regressions\.

We also evaluate UltraChat\-SFT models on HH\-RLHF prompts in a zero\-shot setting \(Table[2](https://arxiv.org/html/2605.07172#S4.T2)\) to measure cross\-dataset alignment generalization\.

Table 2:UltraChat\-SFT models evaluated on HH\-RLHF prompts \(zero\-shot alignment generalization\)\.TTL\-trained models achieve higher RM and helpfulness scores on HH\-style prompts, suggesting that topologically regularized trajectories capture more transferable alignment behavior than pure likelihood training\.

### 4\.3Main Results: DPO with Topological Preference Optimization

Table[3](https://arxiv.org/html/2605.07172#S4.T3)summarizes results on HH\-RLHF for DPO, TPO, and the fully topological Topo\-TPO variant\.

Table 3:DPO and topology\-enhanced variants on HH\-RLHF\. R\-Bench: RewardBench score; Alpaca: AlpacaEval 2\.0 win rate; Harm\.: harmlessness rate\.Across metrics, TPO consistently outperforms DPORafailov et al\. \([2024](https://arxiv.org/html/2605.07172#bib.bib24)\): preference win\-rates improve by 2–3 percentage points, and both helpfulness and harmlessness are higher\. Topo\-TPO yields slightly better harmlessness, suggesting that leveraging global batch structure at the DPO stage can further sharpen safety\-related improvements\.

### 4\.4Cross\-Backbone Generalization

To test whether the gains are specific to one model family, we additionally run TTL and TPO on Llama\-3\-8B\-Instruct using the same datasets and evaluation protocol\. Tables[4](https://arxiv.org/html/2605.07172#S4.T4)and[5](https://arxiv.org/html/2605.07172#S4.T5)show consistent improvements over the corresponding non\-topological baselines on both backbones\. This supports the claim that the proposed objectives depend on hidden\-state geometry rather than on Qwen\-specific architectural details\.

Table 4:Cross\-backbone SFT on UltraChat\. IF denotes instruction following\.Table 5:Cross\-backbone DPO on HH\-RLHF\. RB: RewardBench; AE2: AlpacaEval 2\.0; MT: MT\-Bench\.
### 4\.5Ablation: Effect of Trajectory Topology Loss

![Refer to caption](https://arxiv.org/html/2605.07172v1/figs/compare.png)Figure 3:Distribution of cosine similarities between model trajectories and topological bridges on UltraChat\. TTL \(orange\) shows a distinct shift toward higher alignment compared to Base SFT \(blue\)\.![Refer to caption](https://arxiv.org/html/2605.07172v1/figs/vis.png)Figure 4:2D projection of hidden\-space trajectories illustrating the structural regularization effect of topology\-enhanced training\.#### Topology vs\. Baselines\.

We isolate the impact of TTL by comparing it against four variants: \(1\)No TTL\(pure CE\); \(2\)Random Pair\(random gold answer targets\); \(3\)All Pairs\(per\-example alignment without PH\); and \(4\)kNN Bridge\(nearest gold\-neighbor alignment\)\.

Table[6](https://arxiv.org/html/2605.07172#S4.T6)shows thatPH Bridge \(ours\)significantly outperforms all baselines, including the purely geometric kNN Bridge\. This confirms that the global connectivity structure captured by persistent homology yields more informative cross\-manifold directions than local or random pairings\. Further analysis on trajectory alignment is provided in Appendix[F](https://arxiv.org/html/2605.07172#A6)\.

Table 6:Ablation of Trajectory Topology Loss on UltraChat\.
#### Sensitivity toλtopo\\lambda\_\{\\text\{topo\}\}\.

Table[7](https://arxiv.org/html/2605.07172#S4.T7)presents the impact of the topology loss weight\. Moderate values \(λ≈0\.2\\lambda\\approx 0\.2\) yield optimal gains, whereas excessive regularization \(λ≥0\.4\\lambda\\geq 0\.4\) risks overfitting topological constraints at the expense of perplexity\.

Table 7:Sensitivity to topology loss weightλtopo\\lambda\_\{\\text\{topo\}\}on UltraChat\.

### 4\.6Ablation: Effect of TPO and Design Choices

We next ablate TPO components on HH\-RLHF\.

#### TPO vs\. simple cosine regularization\.

We compare:

- •DPO: no TPO\.
- •\+ Global Cosine: use a single, hand\-crafted global preference vectoruglobalu\_\{\\text\{global\}\}for all examples and alignΔhi\\Delta h\_\{i\}with it\.
- •\+ Learned Global Vec\.: learn a single global preference directionwwin sentence\-embedding space from chosen vs\. rejected pairs, project it into the model hidden space, and align allΔhi\\Delta h\_\{i\}with this direction\.
- •\+ TPO \(no dyn\): topic\-aware TPO with a fixed weightλ\\lambda\.
- •\+ TPO \(ours\): topic\-aware TPO with EMA\-based dynamic weighting\.

Table 8:Ablation of TPO variants on HH\-RLHF\. We report RewardBench, AlpacaEval win rate, and harmlessness\.A single global preference vector yields only minor gains, while both hand\-crafted and learned global vectors are outperformed by topic\-aware TPO\. Our EMA\-based dynamic weighting further stabilizes training and yields the best overall results\. We additionally analyze how TPO changes the alignment between hidden\-space improvement vectors and topic preference directions, and how this relates to per\-topic reward and helpfulness gains; see Appendix[F](https://arxiv.org/html/2605.07172#A6)\.

#### Topic\-aware vs\. topic\-agnostic\.

We explicitly compare topic\-aware TPO to using a single global preference vector in Table[9](https://arxiv.org/html/2605.07172#S4.T9)\.

Table 9:Topic\-aware vs\. topic\-agnostic preference vectors on HH\-RLHF\.Topic\-aware TPO consistently outperforms a single global preference vector, especially on topics where safety or specificity is critical \(e\.g\., medical advice, legal questions\)\. Additional ablations on hidden\-layer choice, number of clustersKK, and efficiency are reported in Appendix[G](https://arxiv.org/html/2605.07172#A7)\.

### 4\.7Combined Effect of TTL and TPO

Table 10:End\-to\-end alignment pipeline on HH\-RLHF, combining topology\-enhanced SFT \(TTL\) and topology\-enhanced DPO \(TPO\)\.So far we have evaluated Trajectory Topology Loss \(TTL\) and Topological Preference Optimization \(TPO\) mostly in isolation, at the SFT and DPO stages respectively\. To assess whether the two stages are complementary in a realistic alignment pipeline, we consider three variants on HH\-RLHF: \(i\) a model trained with SFT only \(without TTL\) followed by DPO; \(ii\) a model initialized from an UltraChat SFT checkpoint trained with TTL, then further tuned with DPO; and \(iii\) our full pipeline that combines TTL at SFT time and TPO at DPO time\. The results indicate that the two regularizers are complementary rather than antagonistic\. TTL improves the SFT initialization by steering prompt→\\rightarrowanswer trajectories before preference optimization starts, whereas TPO shapes rejected→\\rightarrowchosen improvement directions during DPO\. Empirically, initializing DPO from a TTL\-trained model already improves over plain SFT\+DPO, and adding TPO on top yields a further gain on all reported metrics\. This stage\-wise behavior differs from a KL penalty that constrains an RL policy toward a reference model at every optimization step: here the interaction is mediated through the representation geometry of the initialization and the subsequent preference updates\.

### 4\.8Qualitative Analysis

We qualitatively inspect generations from baseline and topology\-enhanced models on diverse prompts \(e\.g\., coding help, ethical advice, creative writing\)\. TTL\-trained models tend to produce answers that are more on\-topic and structurally closer to gold responses, while TPO\-trained models avoid overly safe but unhelpful answers and instead strike a better balance between usefulness and caution\. Figure[4](https://arxiv.org/html/2605.07172#S4.F4)visualizes hidden\-space trajectories via a 2D projection for a small set of prompts and answers: in TTL\-trained models, prompt\-to\-answer trajectories align along a narrower manifold closer to the gold\-answer cluster, and in TPO\-trained models, rejected\-to\-chosen improvement vectors align better with the topic preference directions\.

## 5Conclusion

We introduced a topology\-enhanced alignment framework integrating Trajectory Topology Loss for SFT and Topological Preference Optimization for DPO\. By leveraging 0D persistent homology to regularize hidden\-space trajectories, our approach consistently outperforms baselines on UltraChat and HH\-RLHF\. These results validate the utility of simple topological signals in shaping LLM behavior, opening new avenues for geometric consistency and interpretability in model alignment\.

## Limitations

Our approach focuses on 0D persistent homology for computational tractability; higher\-dimensional features are not explored\. We evaluate on a single base model and two English datasets; results may not directly transfer to multilingual or domain\-specific settings\. Our topic extraction pipeline relies on an LLM for labeling, which may introduce biases\. Finally, while TTL and TPO improve several alignment metrics, they do not guarantee absence of harmful behaviors and should be combined with broader safety assessments\. The quadratic cost of full pairwise distances is modest in the alignment micro\-batch regime we study, but it becomes a more visible bottleneck for substantially larger batches; scaling will likely require sparsified graphs, low\-dimensional projections, or landmark\-based approximations\.

## Ethics Statement

Aligning LLMs with human preferences is both an opportunity and a risk\. On the positive side, our methods aim to increase helpfulness and safety by constraining semantic trajectories toward desirable regions of representation space\. On the negative side, aligning to any given dataset of preferences can amplify existing biases and blind spots\. We stress that topology\-enhanced objectives should be deployed only after thorough evaluation on fairness, robustness, and domain shift, and ideally in conjunction with human oversight and red teaming\.

## Acknowledgments

We thank the action editor, area chairs, senior area chairs, and reviewers for their helpful feedback\. This research was supported in part by the Shanghai Agricultural Science and Technology Project \(grant number T20252016\), the Shanghai Science and Technology Project \(grant number 24YF2716900\), and the Chen Guang Project of the Shanghai Municipal Education Commission \(grant number 24CG54\)\.

## References

- Achille and Soatto \(2018\)Alessandro Achille and Stefano Soatto\. 2018\.Emergence of invariance and disentanglement in deep representations\.*Journal of Machine Learning Research*, 19\(50\):1–54\.
- Adams et al\. \(2015\)Henry Adams, Trey Emerson, Martin Kirby, Christopher J\. Neville, Chris Peterson, Patrick Shipman, Svetlana Chepushtanova, Mariah Hanson, Fabio Motta, and Leonard Ziegelmeier\. 2015\.Persistence images: A stable vector representation of persistent homology\.*Journal of Machine Learning Research*, 18\(8\):1–35\.
- Bai et al\. \(2022\)Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El\-Showk, Nelson Elhage, Zac Hatfield\-Dodds, Danny Hernandez, Tristan Hume, and 12 others\. 2022\.Training a helpful and harmless assistant with reinforcement learning from human feedback\.*arXiv preprint arXiv:2204\.05862*\.
- Ballester et al\. \(2024\)Rubén Ballester, Carles Casacuberta, and Sergio Escalera\. 2024\.Topological data analysis for neural network analysis: A comprehensive survey\.*arXiv preprint arXiv:2312\.05840*\.
- Bau et al\. \(2017\)David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba\. 2017\.Network dissection: Quantifying interpretability of deep visual representations\.In*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*, pages 3319–3327\.
- Bradley and Terry \(1952\)Ralph Allan Bradley and Milton E\. Terry\. 1952\.The rank analysis of incomplete block designs: The method of paired comparisons\.*Biometrika*, 39:324–345\.
- Brown et al\. \(2020\)Tom B\. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert\-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M\. Ziegler, Jeffrey Wu, Clemens Winter, and 12 others\. 2020\.Language models are few\-shot learners\.*arXiv preprint arXiv:2005\.14165*\.
- Bubenik \(2015\)Peter Bubenik\. 2015\.Statistical topological data analysis using persistence landscapes\.*Journal of Machine Learning Research*, 16\(1\):77–102\.
- Carlsson \(2009\)Gunnar Carlsson\. 2009\.Topology and data\.*Bulletin of the American Mathematical Society*, 46\(2\):255–308\.
- Christiano et al\. \(2017\)Paul Christiano, Jan Leike, Tom B\. Brown, Miljan Martic, Shane Legg, and Dario Amodei\. 2017\.Deep reinforcement learning from human preferences\.*arXiv preprint arXiv:1706\.03741*\.
- Ding et al\. \(2023\)Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou\. 2023\.Enhancing chat language models by scaling high\-quality instructional conversations\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 3029–3051\.
- Edelsbrunner and Harer \(2010\)Herbert Edelsbrunner and John Harer\. 2010\.*Computational topology: An introduction*\.American Mathematical Society\.
- Ethayarajh \(2019\)Kawin Ethayarajh\. 2019\.How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt\-2 embeddings\.In*Proceedings of EMNLP\-IJCNLP*, pages 55–65\.
- Gehman et al\. \(2020\)Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A\. Smith\. 2020\.Realtoxicityprompts: Evaluating neural toxic degeneration in language models\.In*Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics*, pages 3356–3369\.
- Ghrist \(2008\)Robert Ghrist\. 2008\.Barcodes: the persistent topology of data\.*Bulletin of the American Mathematical Society*, 45\(1\):61–75\.
- Hendrycks and Gimpel \(2017\)Dan Hendrycks and Kevin Gimpel\. 2017\.A baseline for detecting misclassified and out\-of\-distribution examples in neural networks\.In*International Conference on Learning Representations*\.
- Hofer et al\. \(2019\)Christoph Hofer and 1 others\. 2019\.Deep learning with topological signatures\.In*NeurIPS*\.
- Kruskal \(1956\)Joseph Kruskal\. 1956\.On the shortest spanning subtree of a graph\.*Proc\. AMS*\.
- Lee et al\. \(2018\)Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin\. 2018\.A simple unified framework for detecting out\-of\-distribution samples and adversarial attacks\.In*Advances in Neural Information Processing Systems*, volume 31, pages 7167–7177\.
- Mikolov et al\. \(2013\)Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean\. 2013\.Efficient estimation of word representations in vector space\.In*Proceedings of the ICLR 2013 Workshop on Representation Learning for NLP*\.
- Ortiz\-Jiménez et al\. \(2020\)Guillermo Ortiz\-Jiménez, Apostolos Modas, Seyed\-Mohsen Moosavi\-Dezfooli, and Pascal Frossard\. 2020\.Neural anisotropy directions\.*arXiv preprint arXiv:2006\.09717*\.
- Ouyang et al\. \(2022\)Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L\. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe\. 2022\.Training language models to follow instructions with human feedback\.In*Advances in Neural Information Processing Systems*, volume 35, pages 27730–27744\.
- Plackett \(1975\)Robin L\. Plackett\. 1975\.The analysis of permutations\.*Journal of the Royal Statistical Society\. Series C \(Applied Statistics\)*, 24\(2\):193–202\.
- Rafailov et al\. \(2024\)Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D\. Manning, and Chelsea Finn\. 2024\.Direct preference optimization: Your language model is secretly a reward model\.*arXiv preprint arXiv:2305\.18290*\.
- Raghu et al\. \(2017\)Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl\-Dickstein\. 2017\.On the expressive power of neural networks\.In*Proceedings of the 34th International Conference on Machine Learning*, pages 2847–2854\.
- Reimers and Gurevych \(2019\)Nils Reimers and Iryna Gurevych\. 2019\.Sentence\-bert: Sentence embeddings using siamese bert\-networks\.In*Proceedings of EMNLP\-IJCNLP*, pages 3982–3992\.
- Rieck et al\. \(2019\)K\. Rieck, H\. Leitte, C\. Höfer, and H\. Wagner\. 2019\.Neural persistence: A complexity measure for deep neural networks using algebraic topology\.In*International Conference on Learning Representations*\.
- Stiennon et al\. \(2020\)Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M\. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano\. 2020\.Learning to summarize from human feedback\.In*Advances in Neural Information Processing Systems*, volume 33, pages 3008–3021\.
- Tarjan \(1975\)Robert Endre Tarjan\. 1975\.Efficiency of a good but not linear set union algorithm\.*Journal of the ACM*, 22\(2\):215–225\.
- Vaswani et al\. \(2017\)Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N\. Gomez, Łukasz Kaiser, and Illia Polosukhin\. 2017\.Attention is all you need\.In*Advances in Neural Information Processing Systems*, volume 30, pages 5998–6008\.

## Appendix AReproducibility and Resources

We train all models with fixed random seeds\. For each setting \(SFT and DPO/TPO\), we report results from a single run due to compute constraints, but we found qualitatively similar trends across smaller pilot runs\. All hyperparameters \(learning rates, batch sizes, LoRA ranks, and topology\-related coefficients\) are documented in Appendix[C](https://arxiv.org/html/2605.07172#A3)\. We also provide evaluation scripts for RewardBench, AlpacaEval, and MT\-Bench, including the exact prompts used for LLM\-judge comparisons, to facilitate end\-to\-end replication of our results\.

## Appendix BDiscussion

Our work highlights several conceptual points that complement the main empirical findings\.

#### Trajectory\-centric perspective\.

Traditional alignment objectives focus on token\-level likelihoods or scalar rewards at the sequence level\. By focusing on*trajectories*in hidden space \(prompt→\\rightarrowanswer and rejected→\\rightarrowchosen\), we gain a structured view of how the model moves from inputs to outputs, which can be regularized directly\. TTL and TPO show that constraining these directions can improve preference alignment without changing the base architecture\.

#### Topological signals as global structure\.

Even in 0D, persistent homology captures multi\-scale connectivity patterns that are not apparent from local distances alone\. By using death edges as bridges between prompt and answer manifolds, we extract a sparse, global skeleton of the batch that informs which directions lead from prompt regions into regions densely populated by gold answers or chosen responses\. Our ablations indicate that these topology\-derived directions outperform both random pairings and naive per\-example directions\.

#### Topic\-aware semantic directions\.

TPO demonstrates that topic\-aware preference vectors provide practical, interpretable priors on how representations should move when improving responses\. Compared to a single global preference vector, topic\-specific vectors better capture differences between, for example, technical questions and ethical dilemmas, leading to improved helpfulness/harmlessness trade\-offs\.

#### Connection to minimum spanning trees\.

Our use of 0D persistent homology is closely related to classical graph algorithms: the set of death edges produced by the Union–Find procedure is equivalent to the edge set of a minimum spanning forest on the batch point cloud\. From this perspective, TTL can be viewed as encouraging prompt\-to\-answer trajectories to align with a sparse global skeleton that minimally connects prompt and answer clusters, rather than with arbitrary local directions\. This connection suggests potential extensions based on other graph\- or manifold\-regularization objectives that operate on the same underlying skeleton\.

#### Limitations and future directions\.

We currently restrict ourselves to 0D homology and simple linear preference operators for computational tractability\. Exploring higher\-dimensional topology \(e\.g\., loops corresponding to ambiguity or multi\-modal answers\), richer semantic operators, and extensions to multilingual or domain\-specific models are promising directions for future work\. An additional direction is to pair the same trajectory\-based regularization ideas with other preference\-optimization algorithms, such as PPO or GRPO: TTL is agnostic to the downstream optimizer, and TPO\-style directional constraints could in principle be applied to policy updates generated by those algorithms, provided that their own trust\-region or policy\-constraint mechanisms are preserved\. Moreover, topological regularization should be combined with broader safety evaluations and human oversight, as discussed in the main Limitations section\.

## Appendix CImplementation Details

We fine\-tune the Qwen2\.5\-7B\-Instruct model using the LoRA technique\. For both SFT and DPO, we set the LoRA rankr=16r=16, alphaα=32\\alpha=32, and apply adapters toq\_proj,k\_proj,v\_proj,o\_proj,gate\_proj,up\_proj, anddown\_proj\. We use the AdamW optimizer withβ1=0\.9,β2=0\.95\\beta\_\{1\}=0\.9,\\beta\_\{2\}=0\.95\. The learning rate is2×10−52\\times 10^\{\-5\}for SFT and5×10−65\\times 10^\{\-6\}for DPO, with a cosine decay schedule and 3% warmup steps\. Training is performed on 8 NVIDIA A100\-80GB GPUs\. Global batch size is set to 128 via gradient accumulation\. For TPO, the dynamic weighting parameters are set toα=0\.5\\alpha=0\.5andϵ=1×10−6\\epsilon=1\\times 10^\{\-6\}\. We useK=50K=50clusters for topic extraction on HH\-RLHF\. Generation uses temperature0\.70\.7and top\-p=0\.9p=0\.9unless otherwise stated\.

#### EMA\-based dynamic weighting for TPO\.

LetℓDPO\\ell\_\{\\text\{DPO\}\}andℓTPO\\ell\_\{\\text\{TPO\}\}be the micro\-batch DPO and TPO losses at steptt\. We maintain exponential moving averages

ℓ^DPO\(t\)\\displaystyle\\hat\{\\ell\}\_\{\\text\{DPO\}\}^\{\(t\)\}=γℓ^DPO\(t−1\)\+\(1−γ\)ℓDPO\(t\),\\displaystyle=\\gamma\\hat\{\\ell\}\_\{\\text\{DPO\}\}^\{\(t\-1\)\}\+\(1\-\\gamma\)\\ell\_\{\\text\{DPO\}\}^\{\(t\)\},\(14\)ℓ^TPO\(t\)\\displaystyle\\hat\{\\ell\}\_\{\\text\{TPO\}\}^\{\(t\)\}=γℓ^TPO\(t−1\)\+\(1−γ\)ℓTPO\(t\),\\displaystyle=\\gamma\\hat\{\\ell\}\_\{\\text\{TPO\}\}^\{\(t\-1\)\}\+\(1\-\\gamma\)\\ell\_\{\\text\{TPO\}\}^\{\(t\)\},\(15\)with decayγ∈\[0,1\)\\gamma\\in\[0,1\)\. After a short warmup, we set

r\(t\)=\|ℓ^DPO\(t\)\|\+ϵ\|ℓ^TPO\(t\)\|\+ϵ,λdyn\(t\)=α⋅tanh⁡\(r\(t\)\),r^\{\(t\)\}=\\frac\{\|\\hat\{\\ell\}\_\{\\text\{DPO\}\}^\{\(t\)\}\|\+\\epsilon\}\{\|\\hat\{\\ell\}\_\{\\text\{TPO\}\}^\{\(t\)\}\|\+\\epsilon\},\\qquad\\lambda\_\{\\text\{dyn\}\}^\{\(t\)\}=\\alpha\\cdot\\tanh\(r^\{\(t\)\}\),\(16\)whereα\\alphais a base coefficient andϵ\\epsilona small constant\. In our experiments we useγ=0\.95\\gamma=0\.95,α=0\.5\\alpha=0\.5, andϵ=10−6\\epsilon=10^\{\-6\}\.

#### Persistent homology implementation\.

We compute pairwise distances withtorch\.cdistin bfloat16, transfer the resulting matrix to CPU, and run a custom Union–Find implementation\. We detach gradients fromZZbefore computing the distance matrix so that topology extraction does not backpropagate through distances\.

## Appendix D0D Persistent Homology Algorithm

Given a point cloudZ=\{zi\}i=1NZ=\\\{z\_\{i\}\\\}\_\{i=1\}^\{N\}with distancesDij=‖zi−zj‖2D\_\{ij\}=\\\|z\_\{i\}\-z\_\{j\}\\\|\_\{2\}, we consider all unordered pairs\(i,j\)\(i,j\)as edges with weightsDijD\_\{ij\}and sort them in non\-decreasing order\. We maintain a disjoint\-set \(Union–Find\) structure over vertices\{1,…,N\}\\\{1,\\dots,N\\\}and process edges in order:

1. 1\.InitializeUFso that each vertex is its own component\.
2. 2\.Sort all edgesℰ=\{\(i,j\)∣i<j\}\\mathcal\{E\}=\\\{\(i,j\)\\mid i<j\\\}byDijD\_\{ij\}\.
3. 3\.Initialize an empty list𝒫\\mathcal\{P\}of death edges\.
4. 4\.For each\(i,j\)∈ℰ\(i,j\)\\in\\mathcal\{E\}in order: - •IfUF\.find\(i\)≠UF\.find\(j\)\\text\{UF\.find\}\(i\)\\neq\\text\{UF\.find\}\(j\), append\(i,j\)\(i,j\)to𝒫\\mathcal\{P\}and callUF\.union\(i,j\)\\text\{UF\.union\}\(i,j\)\.

Each recorded edge\(i,j\)∈𝒫\(i,j\)\\in\\mathcal\{P\}corresponds to a merge event where two previously disconnected components become connected; its weight is the*death time*of the younger component\. The set𝒫\\mathcal\{P\}is equivalent to the edge set of a minimum spanning forest and is used to select cross\-label bridges in both TTL and the fully topological TPO variant\.

## Appendix EAdditional TTL Ablations and Complexity

### E\.1TPO Variants and Hyperparameter Sensitivity

#### Vector TPO vs\. fully topological TPO\.

Besides the vector\-difference formulation of TPO, we also evaluate the fully topological variant Topo\-TPO introduced in Section[3\.4](https://arxiv.org/html/2605.07172#S3.SS4)\. Table[11](https://arxiv.org/html/2605.07172#A5.T11)shows that Topo\-TPO yields slightly higher reward\-model scores, win\-rates, and harmlessness than the vector version, indicating that leveraging global batch structure at the DPO stage can further sharpen safety\-related improvements\.

Table 11:Vector\-difference TPO vs\. fully topological TPO on HH\-RLHF\.
#### Hidden\-layer choice\.

We also vary the hidden layerllfrom which we extract representations for TPO \(Table[12](https://arxiv.org/html/2605.07172#A5.T12)\)\. Intermediate layers \(−2\-2to−4\-4from the final layer\) work best, while very early layers and the final layer are slightly worse, suggesting that TPO benefits from representations that are already task\-aware but not yet dominated by token\-level logits\.

Table 12:Effect of hidden layer choice for TPO on HH\-RLHF\.We additionally study the effect of the number of clustersKKin the offline topic extraction and the training\-time overhead of TTL/TPO \(tokens per second and memory\)\. ModerateKK\(around 50\) works best, and TTL/TPO introduce a modest55–10%10\\%slowdown\. Full results are reported in Appendix[G](https://arxiv.org/html/2605.07172#A7)\.

#### Effect ofλtopo\\lambda\_\{\\text\{topo\}\}\.

Table[7](https://arxiv.org/html/2605.07172#S4.T7)in the main text reports a sweep over the TTL weight on UltraChat\. Small to moderateλtopo\\lambda\_\{\\text\{topo\}\}values lead to monotonic or near\-monotonic gains; very large values can slightly hurt perplexity and cause the model to overfit topological constraints\.

#### Complexity\.

For a batch of sizeBB, computing all pairwise distances isO\(B2d\)O\(B^\{2\}d\)and sorting edges isO\(B2log⁡B\)O\(B^\{2\}\\log B\)\. WithB≤32B\\leq 32andd≈4096d\\approx 4096, this overhead is negligible compared to a forward/backward pass through a 7B model \(we observe≈5%\\approx 5\\%slower training for TTL\)\. We further reduce overhead by applying a low\-dimensional projection toZZbefore computing distances\.

## Appendix FAdditional Trajectory and Topology Analyses

### F\.1Alignment of Prompt–Answer Trajectories with Topological Bridges

To more directly validate that TTL shapes hidden\-space trajectories in the intended way, we measure the cosine similarity between model\-induced update directions and topological bridge directions\.

For a held\-out UltraChat validation split, we compute for each example the model trajectory vectorvimodel=himodel−hipromptv\_\{i\}^\{\\text\{model\}\}=h^\{\\text\{model\}\}\_\{i\}\-h^\{\\text\{prompt\}\}\_\{i\}and, when available, the corresponding topological bridge directionv\(pi,ai\)topov^\{\\text\{topo\}\}\_\{\(p\_\{i\},a\_\{i\}\)\}\. We then form the cosine similarity

ρi=cos⁡\(vimodel,v\(pi,ai\)topo\),\\rho\_\{i\}=\\cos\\big\(v^\{\\text\{model\}\}\_\{i\},\\;v^\{\\text\{topo\}\}\_\{\(p\_\{i\},a\_\{i\}\)\}\\big\),\(17\)and compare the empirical distribution of\{ρi\}\\\{\\rho\_\{i\}\\\}between the base SFT model and the SFT\+TTL model\. These distributions show how TTL increases the concentration of cosine similarities near 1\.0, indicating that prompt\-to\-answer trajectories are more strongly aligned with topologically derived bridges\.

### F\.2Alignment of Improvement Vectors with Topic Preference Directions

We perform a similar analysis for TPO\. For each HH\-RLHF preference pair\(xi,yich,yirj\)\(x\_\{i\},y\_\{i\}^\{\\text\{ch\}\},y\_\{i\}^\{\\text\{rj\}\}\)and selected layerll, we compute the normalized improvement vector

Δhi=LN\(hich\)−LN\(hirj\),\\Delta h\_\{i\}=\\text\{LN\}\\big\(h\_\{i\}^\{\\text\{ch\}\}\\big\)\-\\text\{LN\}\\big\(h\_\{i\}^\{\\text\{rj\}\}\\big\),\(18\)and the projected topic preference vectoru¯ti=Puti\\bar\{u\}\_\{t\_\{i\}\}=Pu\_\{t\_\{i\}\}\. We then compute the cosine similarity

σi=cos⁡\(Δhi,u¯ti\),\\sigma\_\{i\}=\\cos\\big\(\\Delta h\_\{i\},\\;\\bar\{u\}\_\{t\_\{i\}\}\\big\),\(19\)and compare the distributions between a pure DPO model and a DPO\+TPO model\.

We additionally aggregate these cosines by topic and study their relationship to per\-topic alignment gains\. For each topictt, we compute

σ¯t\\displaystyle\\overline\{\\sigma\}\_\{t\}=1\|It\|∑i∈Itσi,\\displaystyle=\\frac\{1\}\{\|I\_\{t\}\|\}\\sum\_\{i\\in I\_\{t\}\}\\sigma\_\{i\},\(20\)ΔRMt\\displaystyle\\Delta\\text\{RM\}\_\{t\}=RMtTPO−RMtDPO,\\displaystyle=\\text\{RM\}\_\{t\}^\{\\text\{TPO\}\}\-\\text\{RM\}\_\{t\}^\{\\text\{DPO\}\},\(21\)ΔHelpt\\displaystyle\\Delta\\text\{Help\}\_\{t\}=HelptTPO−HelptDPO,\\displaystyle=\\text\{Help\}\_\{t\}^\{\\text\{TPO\}\}\-\\text\{Help\}\_\{t\}^\{\\text\{DPO\}\},\(22\)whereItI\_\{t\}is the set of examples assigned to topictt, andRMt\\text\{RM\}\_\{t\},Helpt\\text\{Help\}\_\{t\}are average reward\-model and helpfulness scores on that topic for the corresponding model\.

Table 13:Per\-topic average cosine similarity betweenΔh\\Delta hand topic preference vectors, and corresponding changes in reward\-model and helpfulness scores when adding TPO on top of DPO\.These statistics can be used to test whether topics with stronger alignment betweenΔh\\Delta handu¯t\\bar\{u\}\_\{t\}tend to exhibit larger per\-topic gains in reward\-model and helpfulness scores\.

### F\.3Structure of Topological Bridges

To better understand the structure of the bridges extracted by 0D persistent homology, we analyze their lengths and counts on held\-out batches\.

For each batch, we record the number of cross\-label bridges\|ℬ\|\|\\mathcal\{B\}\|and the distribution of bridge lengths‖v\(p,a\)topo‖2\\\|v^\{\\text\{topo\}\}\_\{\(p,a\)\}\\\|\_\{2\}\. We compare these quantities to those obtained from a k\-nearest\-neighbor baseline, where each prompt is connected to its nearest gold\-answer neighbor in the batch\.

These statistics provide a complementary view of how persistent\-homology bridges differ from purely local nearest\-neighbor connections, and help explain why PH\-based bridges can yield stronger regularization than kNN\-based directions in Table[6](https://arxiv.org/html/2605.07172#S4.T6)\.

### F\.4Qualitative Failure Cases

While topology\-enhanced objectives improve average alignment metrics, they are not universally beneficial\. We manually inspect a small number of prompts where TTL or TPO underperform their respective baselines\.

- •Out\-of\-domain prompts\.For certain highly out\-of\-distribution requests, we observe cases where TTL pulls the model’s answer towards a frequent in\-domain answer cluster, leading to less specific or partially off\-topic responses\.
- •Noisy or ambiguous topics\.For some long\-tail HH\-RLHF clusters with heterogeneous prompts, the automatically constructed topic description and preference vector may be noisy\. In such cases, TPO can occasionally over\-regularize answers toward overly generic or cautious responses\.
- •Over\-smoothing trajectories\.In rare cases, we see that strong topological regularization can make the answer style more uniform across diverse prompts, slightly reducing diversity in phrasing or creativity\.

These examples highlight that topology\-enhanced objectives, like other alignment techniques, can introduce trade\-offs and should be applied in conjunction with careful evaluation and, where appropriate, human oversight\.

## Appendix GAdditional TPO Details and Ablations

### G\.1Offline Topic Extraction Details

We use a sentence transformer asϕ\\phiand set the number of clusters toK=50K=50unless otherwise noted\. Prompts are clustered with MiniBatch KMeans onϕ\(x\)\\phi\(x\)\. For each cluster we randomly sample up toM=32M=32prompts and query a strong LLM with the instruction shown in the main text to obtain a 1–3 word topic label \(e\.g\.,*Python programming*,*Health advice*,*Creative writing*\)\.

For each topictt, we construct several positive and negative templates; examples include:

- •positive: “a helpful, harmless, and high\-quality answer abouttt”;
- •negative: “a harmful, unhelpful, and low\-quality answer abouttt”;
- •positive: “a clear, precise, and correct explanation regardingtt”;
- •negative: “a vague, confusing, and incorrect explanation regardingtt”\.

We encode all template sentences withϕ\\phiand form candidate preference vectors as differencesepos−enege\_\{\\text\{pos\}\}\-e\_\{\\text\{neg\}\}, then average them to obtainut∈ℝdsu\_\{t\}\\in\\mathbb\{R\}^\{d\_\{s\}\}\. Topics with fewer than 50 examples are merged into a generic “other” category\.

### G\.2Topic\-Aware vs\. Topic\-Agnostic Vectors

Table 14:Topic\-aware vs\. topic\-agnostic preference vectors on HH\-RLHF\.
### G\.3Vector TPO vs\. Fully Topological TPO

Table 15:Vector\-difference TPO vs\. fully topological TPO\.
### G\.4Hidden\-Layer and Cluster\-Number Sensitivity

Table 16:Effect of hidden layer choice for TPO on HH\-RLHF\.Table 17:Effect of number of clustersKKin topic extraction\.

## Appendix HEfficiency and Overhead

Table 18:Training efficiency and overhead of topology\-enhanced methods\.TTL and TPO incur a modest overhead dominated by pairwise distance computation and sorting; in our setup the slowdown is within 5–10%\.

### H\.1Scalability with Micro\-Batch Size

Table 19:Scalability of 0D persistent\-homology extraction on Qwen2\.5\-7B\-Instruct\.Table[19](https://arxiv.org/html/2605.07172#A8.T19)makes the trade\-off explicit\. For the micro\-batch sizes used in our alignment experiments \(B≤32B\\leq 32\), the additional wall\-clock cost remains modest\. At largerBB, theO\(B2\)O\(B^\{2\}\)distance computation and edge sorting become more noticeable, which motivates future work on kNN sparsification or low\-dimensional projections before topology extraction\.

## Appendix IFully Topological TPO

Here we give details of the fully topological variant of TPO introduced in Section[3\.4](https://arxiv.org/html/2605.07172#S3.SS4)\.

Given a batch of sizeBB, we compute mean\-pooled embeddingshich,hirj∈ℝdh^\{\\text\{ch\}\}\_\{i\},h^\{\\text\{rj\}\}\_\{i\}\\in\\mathbb\{R\}^\{d\}and form

ZRL=\[HrjHch\]∈ℝ2B×d,Z^\{\\text\{RL\}\}=\\begin\{bmatrix\}H^\{\\text\{rj\}\}\\\\ H^\{\\text\{ch\}\}\\end\{bmatrix\}\\in\\mathbb\{R\}^\{2B\\times d\},\(23\)with labelsliRL=0l^\{\\text\{RL\}\}\_\{i\}=0for rejected and11for chosen\. We compute the pairwise distance matrix onZRLZ^\{\\text\{RL\}\}and run the 0D persistent\-homology algorithm from Appendix[D](https://arxiv.org/html/2605.07172#A4), obtaining a set of death edges𝒫RL\\mathcal\{P\}^\{\\text\{RL\}\}\. We retain only cross\-label edges

ℬRL=\{\(u,v\)∈𝒫RL∣luRL≠lvRL\},\\mathcal\{B\}^\{\\text\{RL\}\}=\\\{\(u,v\)\\in\\mathcal\{P\}^\{\\text\{RL\}\}\\mid l^\{\\text\{RL\}\}\_\{u\}\\neq l^\{\\text\{RL\}\}\_\{v\}\\\},\(24\)and orient each such edge from rejected to chosen \(swapping indices if necessary\)\. Each bridge\(r,c\)∈ℬRL\(r,c\)\\in\\mathcal\{B\}^\{\\text\{RL\}\}induces an improvement direction

v\(r,c\)imp=ZcRL−ZrRL\.v^\{\\text\{imp\}\}\_\{\(r,c\)\}=Z^\{\\text\{RL\}\}\_\{c\}\-Z^\{\\text\{RL\}\}\_\{r\}\.\(25\)We associate eachrrwith its original example indexi\(r\)i\(r\)and topicti\(r\)t\_\{i\(r\)\}, and compute the cosine loss with the projected topic vectoru¯ti\(r\)\\bar\{u\}\_\{t\_\{i\(r\)\}\}:

ℒTopo\-TPO=1\|ℬRL\|∑\(r,c\)∈ℬRL\[1−cos⁡\(v\(r,c\)imp,u¯ti\(r\)\)\]\.\\mathcal\{L\}\_\{\\text\{Topo\-TPO\}\}=\\frac\{1\}\{\|\\mathcal\{B\}^\{\\text\{RL\}\}\|\}\\sum\_\{\(r,c\)\\in\\mathcal\{B\}^\{\\text\{RL\}\}\}\\Big\[1\-\\cos\\big\(v^\{\\text\{imp\}\}\_\{\(r,c\)\},\\bar\{u\}\_\{t\_\{i\(r\)\}\}\\big\)\\Big\]\.\(26\)This loss can either replace the vector\-difference TPO loss or be added to it with a small coefficient\.
Topology-Enhanced Alignment for Large Language Models: Trajectory Topology Loss and Topological Preference Optimization

Similar Articles

TopoTuner: Topological Finetuning of Large Language Models

Investigating Implicit Latent Trajectory Shifts: Bypassing Alignment via Long-Form Coherent Context

Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment

TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

ExTra: Exploratory Trajectory Optimization for Language Model Reinforcement Learning

Submit Feedback

Similar Articles

TopoTuner: Topological Finetuning of Large Language Models
Investigating Implicit Latent Trajectory Shifts: Bypassing Alignment via Long-Form Coherent Context
Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment
TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment
ExTra: Exploratory Trajectory Optimization for Language Model Reinforcement Learning