Linguistics-Aware Non-Distortionary LLM Watermarking
Summary
Introduces LUNA, a linguistics-aware LLM watermarking method that achieves non-distortionary embedding and model-free detection across multiple languages, significantly improving AUROC and perplexity preservation.
View Cached Full Text
Cached at: 06/02/26, 03:38 PM
# Linguistics-Aware Non-Distortionary LLM Watermarking
Source: [https://arxiv.org/html/2606.00613](https://arxiv.org/html/2606.00613)
Shinwoo Park1,Hyejin Park2,Hyeseon An1,Yo\-Sub Han1,† 1Yonsei University, Seoul, Republic of Korea \{[pshkhh](https://arxiv.org/html/2606.00613v1/mailto:[email protected]),[hsan](https://arxiv.org/html/2606.00613v1/mailto:[email protected]),[emmous](https://arxiv.org/html/2606.00613v1/mailto:[email protected])\}@yonsei\.ac\.kr 2Rensselaer Polytechnic Institute, Troy, NY, USA [parkh12@rpi\.edu](https://arxiv.org/html/2606.00613v1/mailto:[email protected])
###### Abstract
Watermarking should identify language\-model output without degrading quality or limiting verification to the model provider\. Multilingual deployment makes this harder because morphology, segmentation, and script change where watermark evidence can enter naturally\. We introduce LUNA, a linguistically adaptive watermark that combines model\-free detection with single\-token non\-distortion under the standard random\-key model\. LUNA estimates normalized next\-tag entropy from part\-of\-speech contexts in an external corpus and uses it to set the depth of a non\-distortionary binary tournament sampler; the detector reconstructs the same schedule from text, a tokenizer, a tagger, and a secret key\. We evaluate six typologically diverse languages and two domains against eight primary baselines\. LUNA attains AUROC0\.99590\.9959and the lowest mean absolute median perplexity shift,0\.0450\.045, across the twelve settings; its95%95\\%bootstrap interval\[0\.022,0\.073\]\[0\.022,0\.073\]lies below all baseline intervals\. LUNA also records the lowest mean on Self\-BLEU, Distinct\-1, surprisal, and entropy shifts; it is the only method that simultaneously achieves AUROC\>0\.99\>0\.99and\|ΔPPLmed\|<0\.1\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|<0\.1in a majority of settings, reaching this regime in99of the1212settings while no baseline reaches it in more than22\. Our code is available at[https://github\.com/Shinwoo\-Park/luna\_watermark](https://github.com/Shinwoo-Park/luna_watermark)\.
Linguistics\-Aware Non\-Distortionary LLM Watermarking
Shinwoo Park1, Hyejin Park2, Hyeseon An1, Yo\-Sub Han1,†1Yonsei University, Seoul, Republic of Korea\{[pshkhh](https://arxiv.org/html/2606.00613v1/mailto:[email protected]),[hsan](https://arxiv.org/html/2606.00613v1/mailto:[email protected]),[emmous](https://arxiv.org/html/2606.00613v1/mailto:[email protected])\}@yonsei\.ac\.kr2Rensselaer Polytechnic Institute, Troy, NY, USA[parkh12@rpi\.edu](https://arxiv.org/html/2606.00613v1/mailto:[email protected])
††footnotetext:†\\daggerCorresponding author\.## 1Introduction
Large language models now generate fluent text at scale, creating practical needs for provenance, attribution, and disinformation control\(Liuet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib30); Lalaiet al\.,[2025](https://arxiv.org/html/2606.00613#bib.bib31); European Parliament and Council of the European Union,[2024](https://arxiv.org/html/2606.00613#bib.bib29)\)\. Decoding\-time watermarking addresses these needs by embedding a statistical signal during generation and testing for it after deployment\(Kirchenbaueret al\.,[2023](https://arxiv.org/html/2606.00613#bib.bib34); Dathathriet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib40)\)\. A deployment\-ready watermark should satisfy three properties together:single\-token non\-distortion, where the next\-token distribution equals the base distribution after marginalizing over watermark randomness\(Aaronson and Kirchner,[2022](https://arxiv.org/html/2606.00613#bib.bib14); Kuditipudiet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib39); Dathathriet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib40)\);model\-free detection, so platforms and third\-party auditors can verify provenance without querying the original model or a surrogate\(Kirchenbaueret al\.,[2023](https://arxiv.org/html/2606.00613#bib.bib34); Parket al\.,[2026](https://arxiv.org/html/2606.00613#bib.bib11)\); andadaptivity, since different contexts provide different amounts of reliable capacity\(Luet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib36); Wanget al\.,[2025](https://arxiv.org/html/2606.00613#bib.bib37); Parket al\.,[2026](https://arxiv.org/html/2606.00613#bib.bib11)\)\. Prior work has not, to our knowledge, combined all three; recent adaptive non\-distortionary designs draw adaptivity from model\-side uncertainty, which ties detection to logits or surrogate forward passes and weakens public verifiability\.
The central observation behind LUNA is linguistic\. Languages differ systematically in how much grammatical choice a position permits\. For example, after the part\-of\-speech contextDET ADJin English \(e\.g\.“a quiet …”\), the next tag is almost alwaysNOUN, carrying little grammatical choice; after the Korean morpheme sequenceNNG JKO\(object marker\), the next slot can be a verb, adverbial, or adnominal modifier, spreading probability over several tags\. The first context yields a low normalized next\-tag entropy, the second a high one\. Such variation reflects the language and its analysis pipeline rather than to any particular language model, so a part\-of\-speech tagged corpus can estimate a reusable signal of local syntactic uncertainty\(Comrie,[1989](https://arxiv.org/html/2606.00613#bib.bib99); Greenberg and others,[1963](https://arxiv.org/html/2606.00613#bib.bib100); Haspelmath,[2005](https://arxiv.org/html/2606.00613#bib.bib101)\)\. Paired with a prefix\-measurable non\-distortionary sampler, this signal guides watermark capacity toward positions with greater grammatical choice while preserving the one\-step marginal distribution, and it enables detection from the tokenizer, a tagger, and the secret key without model logits\.
We introduce LUNA \(Linguistics\-AwareNon\-Distortionary LLM Watermarking\)\. LUNA estimates normalized next\-tag entropy for part\-of\-speech contexts from an external corpus, reconstructs the current contextctc\_\{t\}from the prefix, retrievesλ\(ct\)∈\[0,1\]\\lambda\(c\_\{t\}\)\\in\[0,1\], and maps it to a depthmtm\_\{t\}for a binary tournament sampler\(Dathathriet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib40)\)\. The schedule is prefix\-measurable becausemtm\_\{t\}is fixed before samplingxtx\_\{t\}, which preserves single\-token marginals under the random\-key model and allows the detector to reconstruct the same depth sequence from text alone\. We evaluate LUNA on a compact, typology\-aware six\-language grid spanning analytic English\(Quirket al\.,[1985](https://arxiv.org/html/2606.00613#bib.bib104); Marcuset al\.,[1993](https://arxiv.org/html/2606.00613#bib.bib103)\), isolating Chinese\(Li and Thompson,[1981](https://arxiv.org/html/2606.00613#bib.bib79); Xueet al\.,[2005](https://arxiv.org/html/2606.00613#bib.bib106)\), agglutinative Korean\(Sohn,[2001](https://arxiv.org/html/2606.00613#bib.bib80); Kimet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib71)\)and Japanese\(Tsujimura,[2013](https://arxiv.org/html/2606.00613#bib.bib107); Kuno,[1973](https://arxiv.org/html/2606.00613#bib.bib109)\), fusional German\(Haider,[2010](https://arxiv.org/html/2606.00613#bib.bib113); Vikner,[1995](https://arxiv.org/html/2606.00613#bib.bib112)\), and templatic Semitic Arabic\(McCarthy,[1981](https://arxiv.org/html/2606.00613#bib.bib117); Watson,[2002](https://arxiv.org/html/2606.00613#bib.bib118); Ryding,[2005](https://arxiv.org/html/2606.00613#bib.bib119)\)\. Empirically, LUNA reaches AUROC0\.99590\.9959and TPR at5%5\\%FPR0\.98680\.9868, within0\.0110\.011of the strongest baseline, and records the lowest mean shift on each of the five quality metrics across the twelve settings\.
## 2Related Work
Table 1:Operational taxonomy of the primary baselines and LUNA\. Column definitions appear in Section[2\.4](https://arxiv.org/html/2606.00613#S2.SS4)\. The dagger \(†\\dagger\) marks the diversified GumbelSoft variant, which softens the deterministic Gumbel\-max decoding and therefore does not inherit the exact single\-token distribution\-preservation guarantee of EXP or SynthID\-Text\.### 2\.1Distribution\-Shifting and Adaptive Watermarks
A first family of language\-model watermarks embeds detectable evidence by modifying the next\-token distribution during decoding\. KGW\(Kirchenbaueret al\.,[2023](https://arxiv.org/html/2606.00613#bib.bib34)\)partitions the vocabulary into keyed green and red lists, biases green\-list logits before sampling, and detects the watermark through a one\-proportion test on the observed green\-token count\. This design enables efficient model\-free detection because the detector needs the text, key, and tokenizer rather than target\-model logits\. The same mechanism makes KGW single\-token distortionary, since the sampler explicitly changes probability mass assigned to green\-list tokens\.
Adaptive variants change insertion or detection across positions\. SWEET\(Leeet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib35)\)targets code generation and applies KGW\-style bias only at positions whose model entropy exceeds a threshold; its detector reuses the same threshold\. EWD\(Luet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib36)\)leaves KGW\-style generation unchanged and instead weights detected tokens by model\-side entropy\. MorphMark\(Wanget al\.,[2025](https://arxiv.org/html/2606.00613#bib.bib37)\)adapts insertion strength according to the cumulative probability mass of green\-list tokens and keeps KGW\-style detection\. STELA\(Parket al\.,[2026](https://arxiv.org/html/2606.00613#bib.bib11)\)estimates part\-of\-speech context indeterminacy from a corpus and uses that signal to modulate both green\-list bias and detection weighting\. These methods show that context\-dependent allocation can improve watermarking, while their operational requirements differ: SWEET and EWD require model\-side entropy at detection time, MorphMark preserves KGW\-style model\-free detection, and STELA obtains model\-free linguistic adaptivity through a tagger rather than logits\.
### 2\.2Distribution\-Preserving and Gumbel\-Based Watermarks
A second family seeks watermark evidence while preserving the base decoding distribution under explicit randomness assumptions\. Aaronson\-style exponential\-minimum sampling\(Aaronson and Kirchner,[2022](https://arxiv.org/html/2606.00613#bib.bib14)\)and the framework ofKuditipudiet al\.\([2024](https://arxiv.org/html/2606.00613#bib.bib39)\)instantiate this principle through keyed sampling schemes such as inverse\-transform and exponential\-minimum sampling\. SynthID\-Text\(Dathathriet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib40)\)introduces tournament sampling and supports a single\-token non\-distortionary configuration with binary tournaments; its detector computes keyed scores without using the language model at detection time\. Although DAWA\(Heet al\.,[2025](https://arxiv.org/html/2606.00613#bib.bib82)\)jointly optimizes generation and detection under explicit distortion constraints, its adaptive mechanism is derived from the model distribution and a surrogate model rather than from external linguistic signals\. GumbelSoft\(Fuet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib12)\)addresses generation diversity in Gumbel\-keyed watermarking\. It replaces deterministic decoding with a softmax variant of Logits\-Addition, sampling fromsoftmax\(\(ℓt\+ξt\)/τ\)\\mathrm\{softmax\}\(\(\\ell\_\{t\}\+\\xi\_\{t\}\)/\\tau\), and detects by aggregating keyed scoresξt\[xt\]\\xi\_\{t\}\[x\_\{t\}\]for observed tokens\. This makes GumbelSoft a strong model\-free baseline, although the paper does not establish the exact one\-step distribution\-preservation guarantee that we assign to EXP\(Aaronson and Kirchner,[2022](https://arxiv.org/html/2606.00613#bib.bib14)\)and the non\-distortionary SynthID\-Text configuration in Table[1](https://arxiv.org/html/2606.00613#S2.T1)\.
### 2\.3Multilingual and Cross\-Lingual Watermarks
Multilingual and cross\-lingual settings expose difficulties that English\-only evaluations can hide: translation, segmentation, morphology, and script can alter the evidence available to a detector\. Prior work examines watermark survival under translation, cross\-lingual manipulation, and back\-translation robustness\(Heet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib58); Al Ghanimet al\.,[2025](https://arxiv.org/html/2606.00613#bib.bib97); Mohamed and Gubri,[2025](https://arxiv.org/html/2606.00613#bib.bib98)\), and robustness benchmarks show that paraphrasing, editing, and other transformations can substantially change watermark evidence\(Rastogi and Pruthi,[2024](https://arxiv.org/html/2606.00613#bib.bib92); Tuet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib93); Lianget al\.,[2025](https://arxiv.org/html/2606.00613#bib.bib96)\)\.
This line of work primarily asks whether watermark evidence remains detectable after text has been transformed across languages, domains, or surface forms\. LUNA addresses a complementary question at generation time: where should watermark capacity enter the text when languages differ in morphology, segmentation, word order, and script? Its schedule conditions tournament depth on language\-specific part\-of\-speech context entropy, making the source of watermark evidence measurable before any downstream transformation occurs\.
### 2\.4Operational Taxonomy
Table[1](https://arxiv.org/html/2606.00613#S2.T1)summarizes the primary baselines and LUNA\.Single\-token Non\-distortiondenotes one\-step marginal preservation under the stated sampling assumptions;Adaptive InsertionandAdaptive Detectiondenote context\-dependent signal allocation during generation and detection;Model\-free Detectiondenotes detection without target or surrogate language\-model forward passes; andLinguistic Signaldenotes whether the adaptive signal is derived from corpus\-estimated linguistic structure rather than model logits\.
Green\-list methods obtain evidence through logit bias and sacrifice single\-token non\-distortion\. Distribution\-preserving methods preserve one\-step marginals under their sampling assumptions, yet they do not use an interpretable linguistic signal\. Adaptive methods split across insertion and detection, with some relying on model\-side entropy\. LUNA occupies the missing operational point: it inherits a non\-distortionary tournament backbone, replaces fixed schedules with part\-of\-speech context uncertainty, adapts both insertion and detection through the same signal, and supports detection without target or surrogate model access\.
## 3Background
### 3\.1Typological Stress Test
LUNA assumes that watermark capacity should track how much grammatical choice a position affords; this depends on the morphological and syntactic profile of the language\. The evaluation uses six languages that stress distinct interactions among morphology, word order, spacing, and script: analytic English and isolating Chinese \(low\-inflection SVO with different writing systems\), agglutinative Korean and Japanese \(particles and endings creating fine\-grained POS sequences\), fusional German \(verb\-second syntax with case and agreement\), and templatic Arabic \(Semitic root\-and\-pattern morphology with an abjad script\)\. Table[2](https://arxiv.org/html/2606.00613#S3.T2)summarizes the stress points\.
Table 2:Typological stress test used by the evaluation\.
### 3\.2Tournament Sampling and Detection
SynthID\-Text is a generative watermarking scheme built from three components: a random seed generator, a sampling algorithm, and a scoring function\. Let𝒱\\mathcal\{V\}denote the vocabulary,x<tx\_\{<t\}the prefix before positiontt, and
pt\(v\)=Prbase\(xt=v∣x<t\)p\_\{t\}\(v\)=\\Pr\_\{\\mathrm\{base\}\}\(x\_\{t\}=v\\mid x\_\{<t\}\)the next\-token distribution passed to the sampling layer\. Given a seedrtr\_\{t\}derived from the recent context and a watermarking key, SynthID\-Text defines layer\-wise keyed functionsg1,…,gmg\_\{1\},\\ldots,g\_\{m\}\. For the binary configuration used in the non\-distortionary setting, eachgℓ\(v,rt\)g\_\{\\ell\}\(v,r\_\{t\}\)assigns a value in\{0,1\}\\\{0,1\\\}to candidate tokenvv\.
At a fixed depthmm, tournament sampling first draws2m2^\{m\}candidate tokens fromptp\_\{t\}, with repetitions allowed\. It then runs anmm\-layer knockout tournament: layerℓ\\ellcompares paired candidates withgℓ\(⋅,rt\)g\_\{\\ell\}\(\\cdot,r\_\{t\}\), breaks ties randomly, and passes winners to the next layer until one token remains\. SynthID\-Text also admits a distortionary configuration with more than two competitors per match, which strengthens the watermark at the cost of token\-level distortion\. This subsection uses only the fixed\-depth binary configuration; Section[4](https://arxiv.org/html/2606.00613#S4)introduces the adaptive depth schedule used by LUNA\.
For detection, SynthID\-Text recomputes the same keyed scores on an observed sequence and aggregates them into a text\-level statistic\. For fixed depthmm, a simplified score over valid positionsℐ\\mathcal\{I\}is
Scorem\(x\)=1m\|ℐ\|∑t∈ℐ∑ℓ=1mgℓ\(xt,rt\)\.\\operatorname\{Score\}\_\{m\}\(x\)=\\frac\{1\}\{m\|\\mathcal\{I\}\|\}\\sum\_\{t\\in\\mathcal\{I\}\}\\sum\_\{\\ell=1\}^\{m\}g\_\{\\ell\}\(x\_\{t\},r\_\{t\}\)\.\(1\)Watermarked text tends to receive higher keyed scores because tournament sampling favors candidates with larger layer values\. This score depends on the observed text, the key, and the seed generator; it does not require a forward pass through the language model at detection time\.
Figure 1:Illustrative cross\-language LUNA depth schedules for translations of the same semantic sentence\. Each colored cell shows the tournament\-depth tier selected from normalized next\-tag entropyλ\(ct\)\\lambda\(c\_\{t\}\): shallow usesmt=5m\_\{t\}=5, mid usesmt=15m\_\{t\}=15, and deep usesmt=30m\_\{t\}=30\.
## 4Method
### 4\.1Linguistic Depth Scheduling
LUNA modulates the fixed\-depth SynthID\-Text backbone by choosing the tournament depth from a linguistic signal\. For languageLL, letQ′Q^\{\\prime\}denote the next fine\-grained part\-of\-speech tag after contextcc,𝒮L,c\\mathcal\{S\}\_\{L,c\}the observed support of next tags in an external calibration corpus, andKL,c=\|𝒮L,c\|K\_\{L,c\}=\|\\mathcal\{S\}\_\{L,c\}\|\. With empirical probabilitiesP^L\(q′∣c\)\\hat\{P\}\_\{L\}\(q^\{\\prime\}\\mid c\), define
HL\(c\)=−∑q′∈𝒮L,cP^L\(q′∣c\)log2P^L\(q′∣c\),H\_\{L\}\(c\)=\-\\sum\_\{q^\{\\prime\}\\in\\mathcal\{S\}\_\{L,c\}\}\\hat\{P\}\_\{L\}\(q^\{\\prime\}\\mid c\)\\log\_\{2\}\\hat\{P\}\_\{L\}\(q^\{\\prime\}\\mid c\),\(2\)λL\(c\)=\{0,KL,c≤1,HL\(c\)log2KL,c,KL,c\>1\.\\lambda\_\{L\}\(c\)=\\begin\{cases\}0,&K\_\{L,c\}\\leq 1,\\\\\[2\.84526pt\] \\dfrac\{H\_\{L\}\(c\)\}\{\\log\_\{2\}K\_\{L,c\}\},&K\_\{L,c\}\>1\.\\end\{cases\}\(3\)ThusλL\(c\)∈\[0,1\]\\lambda\_\{L\}\(c\)\\in\[0,1\]measures how diffuse the observed next\-tag distribution is relative to its support\. LUNA estimates these tables on CulturaX\(Nguyenet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib5)\), separate from evaluation data\. At generation and detection time, lookup backs off from the primary order to lower\-order contexts and returnsλdef=0\.5\\lambda\_\{\\mathrm\{def\}\}=0\.5when no supported context is available\.
LUNA mapsλ\(ct\)\\lambda\(c\_\{t\}\)to a three\-tier depth schedule,
mt=\{mmin,λ\(ct\)<τ1,mmid,τ1≤λ\(ct\)<τ2,mmax,λ\(ct\)≥τ2,m\_\{t\}=\\begin\{cases\}m\_\{\\min\},&\\lambda\(c\_\{t\}\)<\\tau\_\{1\},\\\\ m\_\{\\mathrm\{mid\}\},&\\tau\_\{1\}\\leq\\lambda\(c\_\{t\}\)<\\tau\_\{2\},\\\\ m\_\{\\max\},&\\lambda\(c\_\{t\}\)\\geq\\tau\_\{2\},\\end\{cases\}\(4\)with the default schedule\(mmin,mmid,mmax\)=\(5,15,30\)\(m\_\{\\min\},m\_\{\\mathrm\{mid\}\},m\_\{\\max\}\)=\(5,15,30\)\. Thresholdsτ1\\tau\_\{1\}andτ2\\tau\_\{2\}are frequency\-weighted 25th and 75th percentiles ofλL\\lambda\_\{L\}on the calibration table\. We adopt a three\-tier discretization as a simple and auditable instantiation of depth scheduling: the schedule has only two free thresholds that are calibrated from the same corpus used forλ\\lambda, and tier identities are easy to inspect during error analysis\. Finer discretizations or a continuous mappingmt=f\(λ\(ct\)\)m\_\{t\}=f\(\\lambda\(c\_\{t\}\)\)are natural extensions\. The schedule is prefix\-measurable becausectc\_\{t\},λ\(ct\)\\lambda\(c\_\{t\}\), andmtm\_\{t\}are all determined before sampling the current token\.
Figure[1](https://arxiv.org/html/2606.00613#S3.F1)illustrates the typological motivation: the same semantic content induces different LUNA depth schedules across the six evaluation languages\.
### 4\.2Variable\-Depth Generation and Model\-Free Detection
LUNA extends the fixed\-depth binary tournament in Section[3\.2](https://arxiv.org/html/2606.00613#S3.SS2)by replacing the constant depthmmwith the prefix\-measurable depthmtm\_\{t\}\. Conditioned on a prefix and its depth, the current sampling step applies the same binary tournament layers used by SynthID\-Text\. For notation and implementation, we write the binary tournament in its probability\-rescaling form\. LetGt,v\(ℓ\)∈\{0,1\}G\_\{t,v\}^\{\(\\ell\)\}\\in\\\{0,1\\\}denote the value assigned to candidate tokenvvat layerℓ\\ellfor positiontt\. Starting fromqt\(0\)=ptq\_\{t\}^\{\(0\)\}=p\_\{t\}, LUNA applies
μt\(ℓ\)=∑u∈𝒱qt\(ℓ−1\)\(u\)Gt,u\(ℓ\),\\mu\_\{t\}^\{\(\\ell\)\}=\\sum\_\{u\\in\\mathcal\{V\}\}q\_\{t\}^\{\(\\ell\-1\)\}\(u\)G\_\{t,u\}^\{\(\\ell\)\},\(5\)qt\(ℓ\)\(v\)=qt\(ℓ−1\)\(v\)\(1\+Gt,v\(ℓ\)−μt\(ℓ\)\)q\_\{t\}^\{\(\\ell\)\}\(v\)=q\_\{t\}^\{\(\\ell\-1\)\}\(v\)\\bigl\(1\+G\_\{t,v\}^\{\(\\ell\)\}\-\\mu\_\{t\}^\{\(\\ell\)\}\\bigr\)\(6\)forℓ=1,…,mt\\ell=1,\\ldots,m\_\{t\}, and then samples
xt∼qt\(mt\)\.x\_\{t\}\\sim q\_\{t\}^\{\(m\_\{t\}\)\}\.\(7\)A repeated\-context safeguard leaves the base distribution unchanged when the current hash context repeats in the recent history; the detector skips the same positions\. Figure[2](https://arxiv.org/html/2606.00613#S4.F2)illustrates the generation\-time operation of LUNA\.
Figure 2:Generation\-time operation of LUNA\. For each prefixx<tx\_\{<t\}, the base language model supplies the next\-token distributionpt\(v\)p\_\{t\}\(v\), while the linguistic branch reconstructs the POS contextctc\_\{t\}, looks up the precomputed normalized next\-tag entropyλ\(ct\)\\lambda\(c\_\{t\}\), and maps it to a tournament depthmtm\_\{t\}\. LUNA then applies anmtm\_\{t\}\-layer binary tournament that reweightspt\(v\)p\_\{t\}\(v\)before samplingxtx\_\{t\}\.Detection uses the text, tokenizer, part\-of\-speech tagger, linguistic signal \(λ\\lambda\) table, and secret key\. It does not access logits or forward passes of the original generation model, nor does it run a surrogate model\. The detector aligns tag spans to token positions, reconstructsctc\_\{t\},λ\(ct\)\\lambda\(c\_\{t\}\), andmtm\_\{t\}at every valid position, and computes
St=∑ℓ=1mt\(Gt,xt\(ℓ\)−12\),S\_\{t\}=\\sum\_\{\\ell=1\}^\{m\_\{t\}\}\\left\(G\_\{t,x\_\{t\}\}^\{\(\\ell\)\}\-\\frac\{1\}\{2\}\\right\),\(8\)Z=∑t∈ℐωtSt14∑t∈ℐmtωt2,Z=\\frac\{\\sum\_\{t\\in\\mathcal\{I\}\}\\omega\_\{t\}S\_\{t\}\}\{\\sqrt\{\\frac\{1\}\{4\}\\sum\_\{t\\in\\mathcal\{I\}\}m\_\{t\}\\omega\_\{t\}^\{2\}\}\},\(9\)whereℐ\\mathcal\{I\}is the set of valid positions andωt=λ\(ct\)\\omega\_\{t\}=\\lambda\(c\_\{t\}\)\. Under the random\-key null, each centered valueGt,xt\(ℓ\)−1/2G\_\{t,x\_\{t\}\}^\{\(\\ell\)\}\-1/2has variance1/41/4, so the denominator standardizes the weighted sum andZZis comparable to a standard normal score\. Appendix[A](https://arxiv.org/html/2606.00613#A1)gives full pseudocode for lookup, generation, and detection\.
### 4\.3Single\-Token Marginal Preservation
###### Theorem 1\(Single\-token marginal preservation\)\.
Fix a prefixx<tx\_\{<t\}and letptp\_\{t\}be the base distribution passed to the sampler\. Assume thatmt=m\(x<t\)m\_\{t\}=m\(x\_\{<t\}\)is prefix\-measurable and independent of the layer\-wise watermark randomness at positiontt\. Under the standard random\-key model\(Aaronson and Kirchner,[2022](https://arxiv.org/html/2606.00613#bib.bib14); Kuditipudiet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib39); Dathathriet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib40)\), in whichGt,v\(ℓ\)∼iidBernoulli\(1/2\)G\_\{t,v\}^\{\(\\ell\)\}\\overset\{\\text\{iid\}\}\{\\sim\}\\mathrm\{Bernoulli\}\(1/2\)across the index tuples\(t,ℓ,v\)\(t,\\ell,v\), the tournament update of Equations[5](https://arxiv.org/html/2606.00613#S4.E5)and[6](https://arxiv.org/html/2606.00613#S4.E6)satisfies
𝔼G\[PrLUNA\(xt=v∣x<t,G\)\]=pt\(v\)\\mathbb\{E\}\_\{G\}\\left\[\\Pr\_\{\\mathrm\{LUNA\}\}\(x\_\{t\}=v\\mid x\_\{<t\},G\)\\right\]=p\_\{t\}\(v\)for everyv∈𝒱v\\in\\mathcal\{V\}\.
Theorem[1](https://arxiv.org/html/2606.00613#Thmtheorem1)establishes a one\-step marginal result under the random\-key model\. It does not claim equality of the realized fixed\-key distribution at a single step, nor equality of the full joint distribution over sequences\. The proof follows by conditioning on the prefix so thatmtm\_\{t\}is fixed, applying the fixed\-depth tournament expectation layer by layer, and using𝔼\[Gt,v\(ℓ\)\]=1/2\\mathbb\{E\}\[G\_\{t,v\}^\{\(\\ell\)\}\]=1/2\. Appendix[A\.3](https://arxiv.org/html/2606.00613#A1.SS3)provides the full proof and implementation\-level details\.
Table 3:Evaluation languages, generation models, and part\-of\-speech pipelines\.
## 5Experimental Settings
### 5\.1Languages and Models
The evaluation covers six languages and two domains \(Wikipedia, news\), yielding 12 language\-by\-domain settings\. Each language uses an instruction\-tuned generation model that natively supports it, alongside a language\-specific part\-of\-speech \(POS\) pipeline\. Table[3](https://arxiv.org/html/2606.00613#S4.T3)summarizes the main experimental setup\. Appendix[B](https://arxiv.org/html/2606.00613#A2)gives full model identifiers, part\-of\-speech backends, tagsets, and selected context orders\. For perplexity\-based quality evaluation, we use Qwen2\.5\-1\.5B\(Yanget al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib59)\)as a shared reference model across languages\.
### 5\.2Datasets
LUNA estimatesλ\\lambdatables from CulturaX\(Nguyenet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib5)\), using 20,000 held\-out records per language with a length filter of 300 to 4000 characters\. The same held\-out corpus supplies calibration for STELA\. No evaluation prompt or generated output enters the calibration corpus\. We use two dataset families: Wikipedia continuations for all six languages\(Foundation,[2023](https://arxiv.org/html/2606.00613#bib.bib6)\), and news continuations from XL\-Sum\(Hasanet al\.,[2021](https://arxiv.org/html/2606.00613#bib.bib7)\)for English, Chinese, Korean, Japanese, and Arabic, plus MLSum\(Scialomet al\.,[2020](https://arxiv.org/html/2606.00613#bib.bib8)\)for German\. Each language\-by\-domain setting contains 500 records, so each algorithm runs on 6,000 evaluation records\.
### 5\.3Baselines and Generation Protocol
We compare LUNA with eight baselines: KGW, EWD, SWEET, MorphMark, STELA, GumbelSoft, EXP, and SynthID\-Text\. SynthID\-Text is configured to match the expected tournament budgetB=𝔼\[2mt\]B=\\mathbb\{E\}\[2^\{m\_\{t\}\}\]induced by the LUNA depth ladder, equalizing the average per\-token distortion budget across the two methods; the matching formula appears in Appendix[B\.2](https://arxiv.org/html/2606.00613#A2.SS2)\. All methods sample with temperature0\.70\.7, nucleus probability0\.950\.95, no top\-kkcap, and200200–256256new tokens; Qwen2\.5\-0\.5B uses repetition penalty1\.11\.1, others1\.01\.0\. Watermarked, unwatermarked, and human\-reference texts are truncated to at most256256generation\-tokenizer tokens before detection so that model\-aware detectors fit within GPU memory at equal evidence length\. Detailed seeds, context orders, and method\-specific hyperparameters appear in Appendix[B\.3](https://arxiv.org/html/2606.00613#A2.SS3); experiments run on a single NVIDIA RTX 3090 GPU with2424GB of memory\.
Table 4:Main detection and quality preservation results,1212\-setting mean\. SBleu,Dist1\\mathrm\{Dist\}\_\{1\}, Surp, and Entr abbreviate Self\-BLEU, Distinct\-11, surprisal, and entropy\.
### 5\.4Evaluation Metrics
#### Detection metrics\.
We use AUROC and TPR at5%5\\%FPR\. Both compare watermarked outputs with unwatermarked outputs generated by the same base model from the same prompts\. AUROC summarizes the full ROC curve; TPR at5%5\\%FPR fixes a deployment\-relevant operating point\.
#### Quality metrics\.
For each text\-level quality statisticQQ, we form the absolute setting\-level shift\|ΔQ\|=\|Qw−Qu\|\|\\Delta Q\|=\|Q\_\{\\mathrm\{w\}\}\-Q\_\{\\mathrm\{u\}\}\|, whereQwQ\_\{\\mathrm\{w\}\}is computed on the watermarked outputs of a setting andQuQ\_\{\\mathrm\{u\}\}on the unwatermarked outputs of the same setting and prompts\. We define five statistics covering complementary notions of distortion\.\|ΔPPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|uses median perplexity under Qwen2\.5\-1\.5B and captures the likelihood of the generated text under the reference model\.\|ΔSelf\-BLEU\|\|\\Delta\\textsc\{Self\-BLEU\}\|uses corpus\-level Self\-BLEU for intra\-output lexical repetition\.\|ΔDistinct\-1\|\|\\Delta\\mathrm\{Distinct\\text\{\-\}1\}\|uses the Distinct\-1 ratio for unigram diversity at the surface level\.\|ΔSurprisal\|\|\\Delta\\mathrm\{Surprisal\}\|and\|ΔEntropy\|\|\\Delta\\mathrm\{Entropy\}\|use the mean token\-level surprisal and predictive entropy under the same reference model, capturing distortion at the next\-token\-distribution level\.
#### Aggregation and confidence intervals\.
All statistics are aggregated at the setting level: we first compute each statistic within each of the1212language\-by\-domain settings and then report the mean over settings\. Bootstrap95%95\\%confidence intervals resample the1212settings with replacement over10001000iterations\. Section[6\.1](https://arxiv.org/html/2606.00613#S6.SS1)reports both the mean and the bootstrap interval for\|ΔPPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|; intervals for the other four quality metrics appear in Appendix[C](https://arxiv.org/html/2606.00613#A3)\.
## 6Experimental Results
### 6\.1Main Detection\-Quality Results
Table[4](https://arxiv.org/html/2606.00613#S5.T4)reports the experimental results\. For every method that exposes a part\-of\-speech context order as a hyperparameter, namely LUNA and STELA, results use the per\-algorithm per\-setting best context order from Table[9](https://arxiv.org/html/2606.00613#A2.T9)\(Appendix[B\.4](https://arxiv.org/html/2606.00613#A2.SS4)\)\. Bold values mark the best entry per column\.
#### Detection saturation\.
Six methods achieve AUROC above0\.9950\.995: EWD, SWEET, KGW, STELA, SynthID\-Text, and LUNA\. Within this regime, the AUROC gap between EWD and LUNA is only0\.00310\.0031, while the TPR\-at\-5%5\\%\-FPR gap is0\.01040\.0104\. Both gaps are small in absolute terms and fall within the bootstrap variability reported in Section[6\.1](https://arxiv.org/html/2606.00613#S6.SS1)and Appendix[C](https://arxiv.org/html/2606.00613#A3), so the detection ranking at this level no longer reflects a deployment\-meaningful performance separation\. Furthermore, EWD and SWEET require language\-model forward passes at detection time, while KGW, STELA, SynthID\-Text, and LUNA detect from text, tokenizer, tagger, and secret key alone; LUNA therefore matches the strongest model\-based detector within these margins without requiring the language model at verification\.
Table 5:Controlled comparisons against LUNA, averaged over the1212settings\. Detection columns report LUNA minus the control; quality columns report the control divided by LUNA, so factors above11indicate that LUNA changes the metric less\.
#### Dominant multi\-metric quality preservation\.
LUNA records the lowest mean shift on every one of the five quality metrics\. Relative to the closest baseline \(MorphMark across all five metrics\), LUNA achieves a9\.5×9\.5\\timesreduction on\|ΔPPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|,1\.5×1\.5\\timesreduction on\|ΔSelf\-BLEU\|\|\\Delta\\textsc\{Self\-BLEU\}\|,1\.8×1\.8\\timeson\|ΔDistinct\-1\|\|\\Delta\\mathrm\{Distinct\\text\{\-\}1\}\|,8\.1×8\.1\\timeson\|ΔSurprisal\|\|\\Delta\\mathrm\{Surprisal\}\|, and2\.4×2\.4\\timeson\|ΔEntropy\|\|\\Delta\\mathrm\{Entropy\}\|\. The dominance covers complementary aspects of distortion at once: the language\-model probability of the generated text \(\|ΔPPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|\), its lexical structure \(\|ΔSelf\-BLEU\|\|\\Delta\\textsc\{Self\-BLEU\}\|,\|ΔDistinct\-1\|\|\\Delta\\mathrm\{Distinct\\text\{\-\}1\}\|\), and the realized next\-token\-distribution statistics \(\|ΔSurprisal\|\|\\Delta\\mathrm\{Surprisal\}\|,\|ΔEntropy\|\|\\Delta\\mathrm\{Entropy\}\|\)\.
#### Bootstrap\-significant gap on the quality metric\.
The bootstrap analysis confirms that the perplexity\-shift gap is statistically robust\. The LUNA confidence interval\[0\.022,0\.073\]\[0\.022,0\.073\]does not overlap any baseline interval, and the next\-lowest baseline lower bound is0\.1580\.158\(MorphMark\)\. LUNA exhibits bootstrap\-significantly lower\|ΔPPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|than every baseline at the95%95\\%confidence level\. Appendix[C](https://arxiv.org/html/2606.00613#A3)reports the full CI table\.
### 6\.2Ablation Study
Table[5](https://arxiv.org/html/2606.00613#S6.T5)compares LUNA with three targeted references that isolate the main design decisions behind the method\. STELA is the closest linguistic baseline: it uses a corpus\-estimated part\-of\-speech signal, yet injects that signal through a distortionary green\-list bias\. SynthID\-Text is the closest tournament baseline: it uses a non\-distortionary binary tournament backbone, yet allocates watermark capacity without a linguistic signal\. SynthID\-Text\-Entropy is a controlled baseline introduced in this paper\. It replaces the corpus\-estimated linguistic signal of LUNA with language\-model entropy, thereby testing whether model\-side uncertainty can substitute for the proposed POS\-context signal\. Appendix[D](https://arxiv.org/html/2606.00613#A4)gives the full construction\.
#### Linguistic signal without non\-distortion: STELA\.
STELA and LUNA both use corpus\-estimated POS\-context uncertainty\. The difference lies in the sampling backbone: STELA injects the signal through green\-list logit bias, whereas LUNA uses it to modulate the depth of a non\-distortionary tournament sampler\. This comparison shows the value of replacing a distortionary linguistic watermark with a non\-distortionary tournament mechanism\. At comparable detection \(AUROC and TPR@5% within0\.00230\.0023and0\.00850\.0085respectively\), LUNA reduces the five quality shifts by3\.96×3\.96\\timesto26\.41×26\.41\\times\.
#### Tournament sampling without linguistic scheduling: SynthID\-Text\.
SynthID\-Text and LUNA share the binary tournament backbone\. The difference is the source of the schedule: SynthID\-Text uses prefix\-hash randomness, while LUNA usesλ\(ct\)\\lambda\(c\_\{t\}\)to place more capacity in high\-uncertainty POS contexts\. This comparison isolates the effect of linguistic scheduling within the same tournament family\. LUNA reduces all five quality shifts by2\.15×2\.15\\timesto10\.35×10\.35\\timeswhile retaining nearly the same AUROC and TPR@5%\.
#### Model entropy instead of linguistic entropy: SynthID\-Text\-Entropy\.
SynthID\-Text\-Entropy is a new controlled baseline designed for this study\. It asks whether model\-derived entropy can replace the external linguistic signal used by LUNA\. The variant keeps the SynthID\-Text tournament family and budget matching, yet uses language\-model entropy as the adaptive signal rather than the corpus\-estimated POS\-context entropy used by LUNA\. This gives a strong model\-aware comparison point: detection is nearly identical to LUNA, with gaps of only−0\.0001\-0\.0001AUROC and−0\.0007\-0\.0007TPR@5%\. The detector requires language\-model forward passes at verification time, which sacrifices model\-free detection, and LUNA still improves four of five quality metrics by1\.59×1\.59\\timesto1\.76×1\.76\\timeson average\.
## 7Conclusion
LUNA combines part\-of\-speech context entropy with a non\-distortionary tournament sampler to jointly satisfy single\-token non\-distortion, model\-free detection, and linguistic adaptivity\. Across six typologically diverse languages and two domains, it records the lowest mean shift on five quality metrics and is the only method reaching AUROC\>0\.99\>0\.99and\|ΔPPLmed\|<0\.1\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|<0\.1in a majority of settings\.
## Limitations
LUNA uses part\-of\-speech context entropy as a linguistic proxy for watermark capacity\. This proxy captures syntactic uncertainty rather than every form of linguistic choice\. It does not directly model semantic alternatives, discourse structure, pragmatic constraints, or register\. The empirical results suggest that syntactic uncertainty provides a useful control signal, while richer linguistic schedules could combine POS context with morphology, dependency structure, discourse state, or semantic classes\. LUNA also discretizesλ\(ct\)\\lambda\(c\_\{t\}\)into three depth tiers; finer\-grained tiers or a continuous mappingmt=f\(λ\(ct\)\)m\_\{t\}=f\(\\lambda\(c\_\{t\}\)\)are natural extensions that we leave to future work\. Such extensions would test how much of the watermark capacity arises from syntax alone and how much comes from broader linguistic organization\.
The method also depends on language\-specific analyzers and entropy tables\. We use deterministic POS pipelines and keep the same tagger and tagset across calibration, generation, and detection\. This design makes the schedule auditable, yet it transfers responsibility to the linguistic preprocessing layer\. Languages with limited taggers, unstable segmentation, code switching, or domain\-specific orthography may require additional calibration\. Future work can study tagger uncertainty, multilingual tagset normalization, and analyzer ensembles that preserve model\-free detection while reducing dependence on a single preprocessing pipeline\.
The theoretical guarantee has a precise scope\. LUNA preserves single\-token marginals under the standard random\-key model for the non\-distortionary tournament sampler\. This statement does not imply equality of the full joint sequence distribution for a fixed key, and it does not provide an inherent guarantee against paraphrase, translation, editing, or adversarial attacks\. These transformations can change the observed POS sequence, the reconstructed schedule, or the keyed evidence\. Our evaluation therefore treats robustness as an empirical question rather than as a theorem\-level property\.
Finally, model\-free detection does not mean infrastructure\-free detection\. A verifier still needs the tokenizer, the POS analyzer, the entropy table, and the secret key\. This requirement is substantially weaker than access to target\-model logits or surrogate forward passes, and it supports public\-verification scenarios more naturally than model\-dependent adaptive schemes\. Nevertheless, deployment would need key management, versioning of entropy tables, and documented analyzer configurations\. These operational requirements define a concrete path for extending LUNA from a research watermark to an auditable multilingual provenance system\.
## References
- Watermarking GPT outputs\.Note:Technical report / blog postExternal Links:[Link](https://www.scottaaronson.com/blog/?p=6823)Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.00613#S2.SS2.p1.2),[Table 1](https://arxiv.org/html/2606.00613#S2.T1.2.10.6.1),[Theorem 1](https://arxiv.org/html/2606.00613#Thmtheorem1.p1.6.6)\.
- M\. Al Ghanim, J\. Xue, R\. P\. Hastuti, M\. Zheng, Y\. Solihin, and Q\. Lou \(2025\)Evaluating the robustness and accuracy of text watermarking under real\-world cross\-lingual manipulations\.InFindings of the Association for Computational Linguistics \(EMNLP\),Cited by:[§2\.3](https://arxiv.org/html/2606.00613#S2.SS3.p1.1)\.
- B\. Comrie \(1989\)Language universals and linguistic typology: syntax and morphology\.University of Chicago press\.Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p2.1)\.
- S\. Dathathri, A\. See, S\. Ghaisas, P\. Huang, R\. McAdam, J\. Welbl, V\. Bachani, A\. Kaskasoli, R\. Stanforth, T\. Matejovicova,et al\.\(2024\)Scalable watermarking for identifying large language model outputs\.Nature\.Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p1.1),[§1](https://arxiv.org/html/2606.00613#S1.p3.9),[§2\.2](https://arxiv.org/html/2606.00613#S2.SS2.p1.2),[Table 1](https://arxiv.org/html/2606.00613#S2.T1.2.11.7.1),[Theorem 1](https://arxiv.org/html/2606.00613#Thmtheorem1.p1.6.6)\.
- European Parliament and Council of the European Union \(2024\)Regulation \(EU\) 2024/1689 laying down harmonised rules on artificial intelligence\.Note:Official Journal of the European UnionExternal Links:[Link](https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng)Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p1.1)\.
- W\. Foundation \(2023\)Wikimedia wikipedia dataset\.External Links:[Link](https://huggingface.co/datasets/wikimedia/wikipedia)Cited by:[§5\.2](https://arxiv.org/html/2606.00613#S5.SS2.p1.1)\.
- J\. Fu, X\. Zhao, R\. Yang, Y\. Zhang, J\. Chen, and Y\. Xiao \(2024\)GumbelSoft: diversified language model watermarking via the GumbelMax\-trick\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[§2\.2](https://arxiv.org/html/2606.00613#S2.SS2.p1.2),[Table 1](https://arxiv.org/html/2606.00613#S2.T1.1.1.1)\.
- J\. H\. Greenberget al\.\(1963\)Some universals of grammar with particular reference to the order of meaningful elements\.Universals of language\.Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p2.1)\.
- H\. Haider \(2010\)The syntax of german\.Cambridge University Press\.Cited by:[Appendix H](https://arxiv.org/html/2606.00613#A8.SS0.SSS0.Px2.p2.1),[§1](https://arxiv.org/html/2606.00613#S1.p3.9)\.
- T\. Hasan, A\. Bhattacharjee, Md\. S\. Islam, K\. Mubasshir, Y\. Li, Y\. Kang, M\. S\. Rahman, and R\. Shahriyar \(2021\)XL\-sum: large\-scale multilingual abstractive summarization for 44 languages\.InFindings of the Association for Computational Linguistics \(ACL\),Cited by:[§5\.2](https://arxiv.org/html/2606.00613#S5.SS2.p1.1)\.
- M\. Haspelmath \(2005\)The world atlas of language structures\.Oxford University Press\.Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p2.1)\.
- H\. He, Y\. Liu, Z\. Wang, Y\. Mao, and Y\. Bu \(2025\)Theoretically Grounded Framework for LLM Watermarking: A Distribution\-Adaptive Approach\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems \(NeurIPS\),Cited by:[§2\.2](https://arxiv.org/html/2606.00613#S2.SS2.p1.2)\.
- Z\. He, B\. Zhou, H\. Hao, A\. Liu, X\. Wang, Z\. Tu, Z\. Zhang, and R\. Wang \(2024\)Can watermarks survive translation? on the cross\-lingual consistency of text watermark for large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Cited by:[§2\.3](https://arxiv.org/html/2606.00613#S2.SS3.p1.1)\.
- J\. M\. Kim, Y\. Lee, Y\. Han, H\. Choi, and S\. Jung \(2024\)Does incomplete syntax influence korean language model? focusing on word order and case markers\.InFirst Conference on Language Modeling \(COLM\),Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p3.9)\.
- J\. Kirchenbauer, J\. Geiping, Y\. Wen, J\. Katz, I\. Miers, and T\. Goldstein \(2023\)A watermark for large language models\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.00613#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2606.00613#S2.T1.2.5.1.1)\.
- R\. Kuditipudi, J\. Thickstun, T\. Hashimoto, and P\. Liang \(2024\)Robust distortion\-free watermarks for language models\.Transactions on Machine Learning Research\.Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.00613#S2.SS2.p1.2),[Theorem 1](https://arxiv.org/html/2606.00613#Thmtheorem1.p1.6.6)\.
- S\. Kuno \(1973\)The structure of japanese\.Cambridge: MIT Press\.Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p3.9)\.
- H\. N\. Lalai, A\. Anantha Ramakrishnan, R\. S\. Shah, and D\. Lee \(2025\)From intentions to techniques: a comprehensive taxonomy and challenges in text watermarking for large language models\.InFindings of the Association for Computational Linguistics \(NAACL\),Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p1.1)\.
- T\. Lee, S\. Hong, J\. Ahn, I\. Hong, H\. Lee, S\. Yun, J\. Shin, and G\. Kim \(2024\)Who wrote this code? watermarking for code generation\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[§2\.1](https://arxiv.org/html/2606.00613#S2.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.00613#S2.T1.2.7.3.1)\.
- C\. N\. Li and S\. A\. Thompson \(1981\)Mandarin chinese: a functional reference grammar\.Univ of California Press\.Cited by:[Appendix H](https://arxiv.org/html/2606.00613#A8.SS0.SSS0.Px2.p5.1),[§1](https://arxiv.org/html/2606.00613#S1.p3.9)\.
- J\. Liang, Z\. Wang, S\. Hong, S\. Ji, and T\. Wang \(2025\)Watermark under fire: a robustness evaluation of LLM watermarking\.InFindings of the Association for Computational Linguistics \(EMNLP\),Cited by:[§2\.3](https://arxiv.org/html/2606.00613#S2.SS3.p1.1)\.
- A\. Liu, L\. Pan, Y\. Lu, J\. Li, X\. Hu, X\. Zhang, L\. Wen, I\. King, H\. Xiong, and P\. Yu \(2024\)A survey of text watermarking in the era of large language models\.ACM Computing Surveys\.Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p1.1)\.
- Y\. Lu, A\. Liu, D\. Yu, J\. Li, and I\. King \(2024\)An entropy\-based text watermarking detection method\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.00613#S2.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.00613#S2.T1.2.6.2.1)\.
- M\. Marcus, B\. Santorini, and M\. A\. Marcinkiewicz \(1993\)Building a large annotated corpus of english: the penn treebank\.Computational linguistics\.Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p3.9)\.
- J\. J\. McCarthy \(1981\)A prosodic theory of nonconcatenative morphology\.Linguistic inquiry\.Cited by:[Appendix H](https://arxiv.org/html/2606.00613#A8.SS0.SSS0.Px2.p6.2),[§1](https://arxiv.org/html/2606.00613#S1.p3.9)\.
- A\. Mohamed and M\. Gubri \(2025\)Is Multilingual LLM Watermarking Truly Multilingual? Scaling Robustness to 100\+ Languages via Back\-Translation\.arXiv preprint arXiv:2510\.18019\.Cited by:[§2\.3](https://arxiv.org/html/2606.00613#S2.SS3.p1.1)\.
- T\. Nguyen, C\. V\. Nguyen, V\. D\. Lai, H\. Man, N\. T\. Ngo, F\. Dernoncourt, R\. A\. Rossi, and T\. H\. Nguyen \(2024\)CulturaX: a cleaned, enormous, and multilingual dataset for large language models in 167 languages\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\),Cited by:[§4\.1](https://arxiv.org/html/2606.00613#S4.SS1.p1.8),[§5\.2](https://arxiv.org/html/2606.00613#S5.SS2.p1.1)\.
- S\. Park, H\. Park, H\. An, and Y\. Han \(2026\)A Linguistics\-Aware LLM Watermarking via Syntactic Predictability\.InProceedings of the Annual Meeting of the Association for Computational Linguistics \(ACL\),Note:To appearCited by:[§1](https://arxiv.org/html/2606.00613#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.00613#S2.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.00613#S2.T1.2.9.5.1)\.
- R\. Quirk, S\. Greenbaum, G\. Leech, and J\. Svartvik \(1985\)A comprehensive grammar of the english language\.Longman,London\.Cited by:[Appendix H](https://arxiv.org/html/2606.00613#A8.SS0.SSS0.Px2.p3.1),[§1](https://arxiv.org/html/2606.00613#S1.p3.9)\.
- S\. Rastogi and D\. Pruthi \(2024\)Revisiting the robustness of watermarking to paraphrasing attacks\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Cited by:[§2\.3](https://arxiv.org/html/2606.00613#S2.SS3.p1.1)\.
- K\. C\. Ryding \(2005\)A reference grammar of modern standard arabic\.Cambridge university press\.Cited by:[Appendix H](https://arxiv.org/html/2606.00613#A8.SS0.SSS0.Px2.p6.2),[§1](https://arxiv.org/html/2606.00613#S1.p3.9)\.
- T\. Scialom, P\. Dray, S\. Lamprier, B\. Piwowarski, and J\. Staiano \(2020\)MLSUM: the multilingual summarization corpus\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Cited by:[§5\.2](https://arxiv.org/html/2606.00613#S5.SS2.p1.1)\.
- H\. Sohn \(2001\)The korean language\.Cambridge University Press\.Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p3.9)\.
- K\. Takaoka, S\. Hisamoto, N\. Kawahara, M\. Sakamoto, Y\. Uchida, and Y\. Matsumoto \(2018\)Sudachi: a Japanese Tokenizer for Business\.InProceedings of the Eleventh International Conference on Language Resources and Evaluation \(LREC\),Cited by:[Appendix H](https://arxiv.org/html/2606.00613#A8.SS0.SSS0.Px2.p4.1)\.
- N\. Tsujimura \(2013\)An introduction to japanese linguistics\.John Wiley & Sons\.Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p3.9)\.
- S\. Tu, Y\. Sun, Y\. Bai, J\. Yu, L\. Hou, and J\. Li \(2024\)WaterBench: towards holistic evaluation of watermarks for large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[§2\.3](https://arxiv.org/html/2606.00613#S2.SS3.p1.1)\.
- S\. Vikner \(1995\)Verb movement and expletive subjects in the germanic languages\.Oxford University Press\.Cited by:[Appendix H](https://arxiv.org/html/2606.00613#A8.SS0.SSS0.Px2.p2.1),[§1](https://arxiv.org/html/2606.00613#S1.p3.9)\.
- Z\. Wang, T\. Gu, B\. Wu, and Y\. Yang \(2025\)MorphMark: flexible adaptive watermarking for large language models\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.00613#S2.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.00613#S2.T1.2.8.4.1)\.
- J\. C\. Watson \(2002\)The phonology and morphology of arabic\.OUP Oxford\.Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p3.9)\.
- N\. Xue, F\. Xia, F\. Chiou, and M\. Palmer \(2005\)The penn chinese treebank: phrase structure annotation of a large corpus\.Natural language engineering\.Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p3.9)\.
- A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei,et al\.\(2024\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[§5\.1](https://arxiv.org/html/2606.00613#S5.SS1.p1.1)\.
## Appendix AMethod Details
This appendix provides the implementation details omitted from the main text for space\. Algorithm[1](https://arxiv.org/html/2606.00613#alg1)gives the deterministic entropy lookup with order backoff\. Algorithm[2](https://arxiv.org/html/2606.00613#alg2)gives per\-position generation, and Algorithm[3](https://arxiv.org/html/2606.00613#alg3)gives model\-free detection\.
Algorithm 1Order\-backoff lookup of normalized next\-tag entropy\.1:POS context
ctc\_\{t\}; language
LL; lookup tables and thresholds\.
2:Normalized next\-tag entropy
λ∈\[0,1\]\\lambda\\in\[0,1\]\.
3:for
r=kprimary,kprimary−1,…,2r=k\_\{\\mathrm\{primary\}\},k\_\{\\mathrm\{primary\}\}\-1,\\ldots,2do
4:
c\(r\)←c^\{\(r\)\}\\leftarrowtruncate
ctc\_\{t\}to the last
r−1r\-1tags
5:if
c\(r\)∈𝒯L\(r\)c^\{\(r\)\}\\in\\mathcal\{T\}\_\{L\}^\{\(r\)\}and
NL\(c\(r\)\)≥νN\_\{L\}\(c^\{\(r\)\}\)\\geq\\nuthen
6:return
λL\(c\(r\)\)\\lambda\_\{L\}\(c^\{\(r\)\}\)
7:endif
8:endfor
9:return
λdef\\lambda\_\{\\mathrm\{def\}\}
### A\.1Entropy Lookup with Order Backoff
We summarize the additional notation used by Algorithm[1](https://arxiv.org/html/2606.00613#alg1)\. For languageLLand orderr∈\{2,…,kprimary\}r\\in\\\{2,\\ldots,k\_\{\\mathrm\{primary\}\}\\\}, let𝒯L\(r\)\\mathcal\{T\}\_\{L\}^\{\(r\)\}denote the set of length\-\(r−1\)\(r\{\-\}1\)POS contexts observed in the calibration corpus,NL\(c\)N\_\{L\}\(c\)denote the empirical occurrence count of contextccin that corpus, andν\\nudenote a fixed minimum\-count threshold that controls when a stored value is reused\. The thresholdν\\nuis shared across orders and languages, and is chosen on the calibration corpus so that storedλ\\lambdavalues rely only on contexts with stable empirical estimates\.
The lookup starts at the primary orderkprimaryk\_\{\\mathrm\{primary\}\}and backs off through lower orders down to order22\. It returns a stored value only when the context exists in𝒯L\(r\)\\mathcal\{T\}\_\{L\}^\{\(r\)\}and its empirical frequency reaches the thresholdν\\nu\. If no supported context appears, it returnsλdef=0\.5\\lambda\_\{\\mathrm\{def\}\}=0\.5\.
Algorithm 2LUNA generation at positiontt\.1:Prefix
x<tx\_\{<t\}; distribution
ptp\_\{t\}; keys;
Φ\\Phi; schedule; tagger
𝒜\\mathcal\{A\}; history
ℋ\\mathcal\{H\}\.
2:Next token
xtx\_\{t\}\.
3:
ct←POSContext\(𝒜,x<t\)c\_\{t\}\\leftarrow\\mathrm\{POSContext\}\(\\mathcal\{A\},x\_\{<t\}\)
4:
λ←Lookup\(ct\)\\lambda\\leftarrow\\mathrm\{Lookup\}\(c\_\{t\}\)
5:
mt←MapToDepth\(λ\)m\_\{t\}\\leftarrow\\mathrm\{MapToDepth\}\(\\lambda\)using Equation[4](https://arxiv.org/html/2606.00613#S4.E4)
6:
h←HashContext\(x<t\)h\\leftarrow\\mathrm\{HashContext\}\(x\_\{<t\}\)
7:
r←𝟏\{h∈ℋ\}r\\leftarrow\\mathbf\{1\}\\\{h\\in\\mathcal\{H\}\\\}
8:
ℋ←UpdateHistory\(ℋ,h\)\\mathcal\{H\}\\leftarrow\\mathrm\{UpdateHistory\}\(\\mathcal\{H\},h\)
9:if
r=1r=1then
10:return
xt∼ptx\_\{t\}\\sim p\_\{t\}
11:endif
12:
q\(0\)←ptq^\{\(0\)\}\\leftarrow p\_\{t\}
13:for
ℓ=1\\ell=1to
mtm\_\{t\}do
14:
Gv\(ℓ\)←Φ\(kℓ,h,v\)G\_\{v\}^\{\(\\ell\)\}\\leftarrow\\Phi\(k\_\{\\ell\},h,v\)for each
v∈𝒱v\\in\\mathcal\{V\}
15:
μ\(ℓ\)←∑u∈𝒱q\(ℓ−1\)\(u\)Gu\(ℓ\)\\mu^\{\(\\ell\)\}\\leftarrow\\sum\_\{u\\in\\mathcal\{V\}\}q^\{\(\\ell\-1\)\}\(u\)G\_\{u\}^\{\(\\ell\)\}
16:
q\(ℓ\)\(v\)←q\(ℓ−1\)\(v\)\(1\+Gv\(ℓ\)−μ\(ℓ\)\)q^\{\(\\ell\)\}\(v\)\\leftarrow q^\{\(\\ell\-1\)\}\(v\)\(1\+G\_\{v\}^\{\(\\ell\)\}\-\\mu^\{\(\\ell\)\}\)
17:endfor
18:return
xt∼q\(mt\)x\_\{t\}\\sim q^\{\(m\_\{t\}\)\}
### A\.2Generation and Detection Algorithms
### A\.3Proof of Theorem[1](https://arxiv.org/html/2606.00613#Thmtheorem1)
Algorithm 3LUNA model\-free detection\.1:Text
x=\(x1,…,xT\)x=\(x\_\{1\},\\ldots,x\_\{T\}\); tokenizer; tagger
𝒜\\mathcal\{A\};
λ\\lambdatable; keys; threshold
γ\\gamma\.
2:Decision: watermarked or not\.
3:Run
𝒜\\mathcal\{A\}on the decoded full text and align tag spans to token positions
4:
ℐ←∅\\mathcal\{I\}\\leftarrow\\emptyset;
ℋ←∅\\mathcal\{H\}\\leftarrow\\emptyset
5:for
t=1t=1to
TTdo
6:
h←HashContext\(x<t\)h\\leftarrow\\mathrm\{HashContext\}\(x\_\{<t\}\)
7:
r←𝟏\{h∈ℋ\}r\\leftarrow\\mathbf\{1\}\\\{h\\in\\mathcal\{H\}\\\}
8:
ℋ←UpdateHistory\(ℋ,h\)\\mathcal\{H\}\\leftarrow\\mathrm\{UpdateHistory\}\(\\mathcal\{H\},h\)
9:if
xtx\_\{t\}is EOSor
r=1r=1then
10:continue
11:endif
12:Recover
ctc\_\{t\}before position
tt
13:
λ←Lookup\(ct\)\\lambda\\leftarrow\\mathrm\{Lookup\}\(c\_\{t\}\)
14:
mt←MapToDepth\(λ\)m\_\{t\}\\leftarrow\\mathrm\{MapToDepth\}\(\\lambda\)
15:
ωt←λ\\omega\_\{t\}\\leftarrow\\lambda
16:
St←∑ℓ=1mt\(Gt,xt\(ℓ\)−1/2\)S\_\{t\}\\leftarrow\\sum\_\{\\ell=1\}^\{m\_\{t\}\}\(G\_\{t,x\_\{t\}\}^\{\(\\ell\)\}\-1/2\)
17:
ℐ←ℐ∪\{t\}\\mathcal\{I\}\\leftarrow\\mathcal\{I\}\\cup\\\{t\\\}
18:endfor
19:Compute
ZZwith Equation[9](https://arxiv.org/html/2606.00613#S4.E9)
20:return
𝟏\{Z\>γ\}\\mathbf\{1\}\\\{Z\>\\gamma\\\}
Condition on the prefixx<tx\_\{<t\}\. The POS reconstruction returnsctc\_\{t\}, the lookup returnsλ\(ct\)\\lambda\(c\_\{t\}\), and the schedule fixesmtm\_\{t\}before tokenxtx\_\{t\}is sampled\. The current step therefore reduces to fixed\-depth binary tournament sampling with depthmtm\_\{t\}\. Letq\(0\)=ptq^\{\(0\)\}=p\_\{t\}\. For layerℓ\\ell, condition on previous layers, soq\(ℓ−1\)q^\{\(\\ell\-1\)\}is fixed\. Under the random\-key model, the binary valuesGv\(ℓ\)G\_\{v\}^\{\(\\ell\)\}are independent ofq\(ℓ−1\)q^\{\(\\ell\-1\)\}and satisfy𝔼\[Gv\(ℓ\)\]=1/2\\mathbb\{E\}\[G\_\{v\}^\{\(\\ell\)\}\]=1/2, so
𝔼G\(ℓ\)\[q\(ℓ\)\(v\)∣q\(ℓ−1\)\]\\displaystyle\\mathbb\{E\}\_\{G^\{\(\\ell\)\}\}\[q^\{\(\\ell\)\}\(v\)\\mid q^\{\(\\ell\-1\)\}\]=q\(ℓ−1\)\(v\)\(1\+12−12\)\\displaystyle=q^\{\(\\ell\-1\)\}\(v\)\\left\(1\+\\frac\{1\}\{2\}\-\\frac\{1\}\{2\}\\right\)=q\(ℓ−1\)\(v\)\.\\displaystyle=q^\{\(\\ell\-1\)\}\(v\)\.Iterating across the active layers and applying the tower property of conditional expectation yields𝔼G\[q\(mt\)\(v\)\]=pt\(v\)\\mathbb\{E\}\_\{G\}\[q^\{\(m\_\{t\}\)\}\(v\)\]=p\_\{t\}\(v\)\. Sincextx\_\{t\}is drawn fromq\(mt\)q^\{\(m\_\{t\}\)\}conditional onGG,PrLUNA\(xt=v∣x<t,G\)=q\(mt\)\(v\)\\Pr\_\{\\mathrm\{LUNA\}\}\(x\_\{t\}=v\\mid x\_\{<t\},G\)=q^\{\(m\_\{t\}\)\}\(v\), and taking expectation overGGgives𝔼G\[PrLUNA\(xt=v∣x<t,G\)\]=pt\(v\)\\mathbb\{E\}\_\{G\}\[\\Pr\_\{\\mathrm\{LUNA\}\}\(x\_\{t\}=v\\mid x\_\{<t\},G\)\]=p\_\{t\}\(v\)\. Prefix measurability ensures thatmtm\_\{t\}does not depend on the current sampled token, so it remains fixed throughout this argument\.
Table 6:POS backends and tagsets used by LUNA\. These choices match the tagsets used to build the correspondingλ\\lambdatables\.
## Appendix BExperimental Setting Details
### B\.1Language Typology, Models, and POS Pipelines
Table[6](https://arxiv.org/html/2606.00613#A1.T6)lists the POS backend and tagset used at entropy estimation, generation, and detection time\. For every language, the same tagger and tagset are used across these three stages\.
Table[7](https://arxiv.org/html/2606.00613#A2.T7)lists the full generation\-model identifiers used in the experiments\.
Table 7:Generation\-model identifiers\.
### B\.2Budget Matching for the SynthID\-Text
The SynthID\-Text baseline uses the same binary tournament update as LUNA and matches the expected tournament budget induced by the LUNA depth ladder\. This budget matching removes the linguistic signal from the comparison: the depth is derived from a prefix hash and a salt rather than fromλ\(ct\)\\lambda\(c\_\{t\}\), and detection uses uniform weightsωt=1\\omega\_\{t\}=1\. The schedule chooses between adjacent depthsmfloor=⌊log2B⌋m\_\{\\mathrm\{floor\}\}=\\lfloor\\log\_\{2\}B\\rfloorandmceil=⌈log2B⌉m\_\{\\mathrm\{ceil\}\}=\\lceil\\log\_\{2\}B\\rceilso that
𝔼\[2mt\]=\(1−pceil\)2mfloor\+pceil2mceil=B\.\\mathbb\{E\}\[2^\{m\_\{t\}\}\]=\(1\-p\_\{\\mathrm\{ceil\}\}\)2^\{m\_\{\\mathrm\{floor\}\}\}\+p\_\{\\mathrm\{ceil\}\}2^\{m\_\{\\mathrm\{ceil\}\}\}=B\.Ifmfloor=mceilm\_\{\\mathrm\{floor\}\}=m\_\{\\mathrm\{ceil\}\}, the schedule uses that depth deterministically\. Otherwise,
pceil=B−2mfloor2mceil−2mfloor\.p\_\{\\mathrm\{ceil\}\}=\\frac\{B\-2^\{m\_\{\\mathrm\{floor\}\}\}\}\{2^\{m\_\{\\mathrm\{ceil\}\}\}\-2^\{m\_\{\\mathrm\{floor\}\}\}\}\.\(10\)When calibration supplies language\-specific tier proportions\(plow,pmid,phigh\)\(p\_\{\\mathrm\{low\}\},p\_\{\\mathrm\{mid\}\},p\_\{\\mathrm\{high\}\}\), the matched budget is
B=plow2mmin\+pmid2mmid\+phigh2mmax\.B=p\_\{\\mathrm\{low\}\}2^\{m\_\{\\min\}\}\+p\_\{\\mathrm\{mid\}\}2^\{m\_\{\\mathrm\{mid\}\}\}\+p\_\{\\mathrm\{high\}\}2^\{m\_\{\\max\}\}\.\(11\)At nominal proportions\(0\.25,0\.5,0\.25\)\(0\.25,0\.5,0\.25\)and ladder\(5,15,30\)\(5,15,30\), this formula givesB0=268,451,848B\_\{0\}=268\{,\}451\{,\}848\.
Table 8:Watermark\-specific settings used in the primary comparison\.
### B\.3Watermark Baselines and Hyperparameters
Table[8](https://arxiv.org/html/2606.00613#A2.T8)summarizes the main watermark\-specific settings\. The KGW\-family baselines follow the MarkLLM implementations used in the experiments\. The SynthID\-Text row uses the SynthID\-Text binary tournament backbone and applies the budget\-matching procedure in Appendix[B\.2](https://arxiv.org/html/2606.00613#A2.SS2)for fair comparison with LUNA\.
### B\.4Calibration Details
The context orderkkis selected from\{2,3,4\}\\\{2,3,4\\\}separately for LUNA, and STELA in each language\-by\-domain setting\. For each algorithm, we choose thekkthat minimizes\|ΔPPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|on the watermarked outputs\. The selectedkkvalues for LUNA and STELA are shown in Table[9](https://arxiv.org/html/2606.00613#A2.T9); the two methods agree on the selected order in44of the1212settings\.
Table 9:Per\-algorithm selected POS context orderkkfor the linguistic methods\. Selection minimizes\|ΔPPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|within each language\-by\-domain setting\. LUNA prefersk∈\{3,4\}k\\in\\\{3,4\\\}in 10 of 12 settings, while STELA prefersk=2k=2in 7 of 12\. The per\-kkcomparison appears in Appendix[G](https://arxiv.org/html/2606.00613#A7)\.
## Appendix CBootstrap Confidence Intervals for Quality Metrics
Section[6\.1](https://arxiv.org/html/2606.00613#S6.SS1)reports the bootstrap95%95\\%confidence interval for the quality metric\|ΔPPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|in Table[4](https://arxiv.org/html/2606.00613#S5.T4)\. This appendix lists the bootstrap intervals for the four remaining quality metrics under the same protocol:10001000iterations, resampling the1212language\-by\-domain settings with replacement, seed4242\. For LUNA and STELA, the intervals use the per\-algorithm per\-setting best context order from Table[9](https://arxiv.org/html/2606.00613#A2.T9)\.
Tables[11](https://arxiv.org/html/2606.00613#A3.T11)and[11](https://arxiv.org/html/2606.00613#A3.T11)group the four remaining quality metrics by the aspect of distortion they capture: lexical structure and next\-token\-distribution statistics\.
Table 10:Bootstrap95%95\\%confidence intervals for the lexical\-structure quality metrics\. LUNA’s upper bound lies strictly below the lower bound of six of eight baselines on each metric; MorphMark and SynthID\-Text are the two methods whose intervals overlap LUNA’s on both metrics\.
Table 11:Bootstrap95%95\\%confidence intervals for the next\-token\-distribution quality metrics\. LUNA’s upper bound lies strictly below the lower bound of every baseline on\|ΔSurprisal\|\|\\Delta\\mathrm\{Surprisal\}\|and of seven of eight baselines on\|ΔEntropy\|\|\\Delta\\mathrm\{Entropy\}\|, with MorphMark the only overlap on the latter\.
## Appendix DDesign and Analysis of SynthID\-Text\-Entropy
This appendix defines the SynthID\-Text\-Entropy used in Section[6\.2](https://arxiv.org/html/2606.00613#S6.SS2)\. This variant is not a previously published watermark\. It is a diagnostic baseline that asks whether a model\-derived entropy signal can replace the external linguistic signal used by LUNA\.
### D\.1Design Rationale
LUNA combines three ingredients: a non\-distortionary SynthID\-Text tournament backbone, a prefix\-measurable adaptive schedule derived from POS\-context entropy, and a detector that reconstructs the same linguistic schedule without language\-model forward passes\. STELA tests the value of replacing a distortionary linguistic watermark with a non\-distortionary tournament backbone\. SynthID\-Text tests the value of adding a linguistic schedule to a tournament sampler\. SynthID\-Text\-Entropy tests a third question: whether model\-side entropy can play the role that POS\-context entropy plays in LUNA\.
SynthID\-Text\-Entropy keeps the SynthID\-Text tournament family and the budget\-matching procedure used for the SynthID\-Text baseline\. It replaces the external linguistic signal with language\-model entropy in the adaptive detector\. This choice creates a strong model\-aware comparison point\. It also removes model\-free detection, since the verifier must run a language model to obtain per\-token entropy values\.
### D\.2Budget Matching with LUNA
We match the expected tournament budget of SynthID\-Text\-Entropy to LUNA using the same procedure as Appendix[B\.2](https://arxiv.org/html/2606.00613#A2.SS2)\. LetB=𝔼\[2mt\]B=\\mathbb\{E\}\[2^\{m\_\{t\}\}\]denote the expected tournament budget induced by the LUNA depth ladder under the calibration proportions\(plow,pmid,phigh\)\(p\_\{\\mathrm\{low\}\},p\_\{\\mathrm\{mid\}\},p\_\{\\mathrm\{high\}\}\)\. The SynthID\-Text\-Entropy configuration uses the corresponding budget\-matched SynthID\-Text tournament schedule, so the comparison is not driven by a larger average tournament budget\.
### D\.3Comparison and Practical Implications
At the1212\-setting mean, SynthID\-Text\-Entropy attains AUROC0\.99600\.9960and\|ΔPPLmed\|=0\.0787\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|=0\.0787, while LUNA attains AUROC0\.99590\.9959and\|ΔPPLmed\|=0\.0447\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|=0\.0447\. Detection is effectively indistinguishable at this aggregate level: the AUROC gap is0\.00010\.0001in favor of SynthID\-Text\-Entropy, and the TPR@5% gap is0\.00070\.0007\. On quality preservation, LUNA improves four of the five reported quality metrics by factors of1\.59×1\.59\\timesto1\.76×1\.76\\times, while SynthID\-Text\-Entropy is1\.09×1\.09\\timesbetter on\|ΔDistinct\-1\|\|\\Delta\\mathrm\{Distinct\\text\{\-\}1\}\|\.
The comparison clarifies the deployment trade\-off\. Model entropy supplies a powerful adaptive signal, yet it requires language\-model forward passes at verification time\. This dependence creates serving cost, version coupling, and weaker third\-party verifiability when the generator or an appropriate surrogate is not available\. LUNA reaches the same detection regime without this dependence and preserves quality better on most reported metrics\.
## Appendix EDetection\-Quality Trade\-off
This appendix visualizes the per\-setting structure that underlies the aggregate detection and quality results in Section[6\.1](https://arxiv.org/html/2606.00613#S6.SS1)\. We characterize the trade\-off space through three complementary views: the Pareto frontier \([E\.1](https://arxiv.org/html/2606.00613#A5.SS1)\), the per\-setting sweet\-spot distribution \([E\.2](https://arxiv.org/html/2606.00613#A5.SS2)\), and the per\-baseline multi\-metric quality advantage \([E\.3](https://arxiv.org/html/2606.00613#A5.SS3)\)\.
### E\.1Pareto Frontier of the Detection–Quality Trade\-off
Figure[3](https://arxiv.org/html/2606.00613#A5.F3)plots the1212\-setting mean of AUROC against\|ΔPPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|on a logarithmic horizontal axis\. Four of the nine methods are Pareto\-optimal: EWD, SWEET, SynthID\-Text, and LUNA; the remaining five \(KGW, STELA, MorphMark, EXP, GumbelSoft\) are dominated by some method that achieves both better detection and lower distortion\.
Figure 3:Pareto frontier of the detection\-quality trade\-off, with AUROC on the vertical axis and\|ΔPPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|on the horizontal axis \(log scale\), averaged over the1212language\-by\-domain settings\. Four of nine methods are Pareto\-optimal \(filled markers, connected by the frontier\); the other five are dominated \(gray markers\)\. The shaded sweet\-spot region in the upper\-left corner marks AUROC\>0\.99\>0\.99and shift<0\.1<0\.1; LUNA is the only method that enters it\.LUNA occupies the left endpoint of the Pareto front\. The nearest Pareto neighbor, SynthID\-Text, sits at\|ΔPPLmed\|=0\.463\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|=0\.463with AUROC0\.99720\.9972; moving to LUNA reduces\|ΔPPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|by a factor of10\.4×10\.4\\timesat an AUROC cost of0\.00130\.0013\. The shaded sweet\-spot region marks the operating regime where AUROC\>0\.99\>0\.99and\|ΔPPLmed\|<0\.1\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|<0\.1; LUNA is the only Pareto\-optimal method inside this region, and the only method of the nine to enter it at the1212\-setting mean\.
### E\.2Per\-Setting Sweet\-Spot Distribution
The aggregate sweet\-spot finding holds at the per\-setting level\. Figure[4](https://arxiv.org/html/2606.00613#A5.F4)colors each \(method, language\-domain\) cell by\|ΔPPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|on a logarithmic scale; green circles mark the cells that jointly satisfy AUROC\>0\.99\>0\.99and\|ΔPPLmed\|<0\.1\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|<0\.1\. LUNA reaches the sweet\-spot in99of1212settings\.
Figure 4:Per\-setting\|ΔPPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|on a logarithmic scale for all nine methods across the1212language\-by\-domain settings\. Rows are ordered by the number of sweet\-spot cells \(green circles, marking AUROC\>0\.99\>0\.99and shift<0\.1<0\.1\)\. LUNA enters the sweet\-spot in99of1212settings; the next\-best baseline \(MorphMark\) enters it in22of1212settings\.
### E\.3Per\-Baseline Multi\-Metric Quality Advantage
Figure[5](https://arxiv.org/html/2606.00613#A5.F5)extends the comparison from the single\|ΔPPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|axis to all five quality metrics simultaneously\. Each cell reports the ratio of the baseline’s mean distortion to LUNA’s; the rightmost column reports the geometric mean across the five metrics\. LUNA’s geometric\-mean advantage over the closest baseline \(MorphMark\) is3\.5×3\.5\\times, and the advantage exceeds an order of magnitude against KGW, STELA, EWD, GumbelSoft, and EXP\. The largest single\-metric ratios are observed for\|ΔPPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|and\|ΔSurprisal\|\|\\Delta\\mathrm\{Surprisal\}\|, both of which are governed by the realized next\-token probability under the reference model\. The lexical\-structure metrics \(\|ΔSelf\-BLEU\|\|\\Delta\\textsc\{Self\-BLEU\}\|,\|ΔDistinct\-1\|\|\\Delta\\mathrm\{Distinct\\text\{\-\}1\}\|\) also show consistent positive gains at smaller magnitudes, suggesting that the quality preservation remains robust across diverse forms of distortion rather than being attributable to improvements in a single metric alone\.
Figure 5:Per\-baseline multi\-metric quality advantage of LUNA, computed as the ratio between each baseline’s1212\-setting mean and LUNA’s on the five quality metrics\. Cells with ratio\>1\>1indicate LUNA changes the metric by a smaller amount\. The rightmost column reports the geometric mean across the five metrics\. LUNA holds a uniform advantage on every \(baseline, metric\) cell of the table across the eight main baselines\. The SynthID\-Text\-Entropy ablation in Section[6\.2](https://arxiv.org/html/2606.00613#S6.SS2)is not shown here and is the one comparison in which LUNA does not dominate on every metric\.
## Appendix FBehavior Across Experimental Axes
This appendix examines whether the aggregate behavior in Section[6](https://arxiv.org/html/2606.00613#S6)is uniform across the experimental axes\. We report per\-language ranks \([F\.1](https://arxiv.org/html/2606.00613#A6.SS1)\) and per\-domain ranks \([F\.2](https://arxiv.org/html/2606.00613#A6.SS2)\)\.
### F\.1Per\-Language Behavior
Table[12](https://arxiv.org/html/2606.00613#A6.T12)reports LUNA’s rank among the nine methods on each of the seven metrics, separately for each language\. Ranks are computed on the per\-language mean over Wikipedia and news\. LUNA holds rank11on\|ΔPPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|and\|ΔSurprisal\|\|\\Delta\\mathrm\{Surprisal\}\|in all six languages, on\|ΔSelf\-BLEU\|\|\\Delta\\textsc\{Self\-BLEU\}\|and\|ΔEntropy\|\|\\Delta\\mathrm\{Entropy\}\|in five of six, and on\|ΔDistinct\-1\|\|\\Delta\\mathrm\{Distinct\\text\{\-\}1\}\|in three of six\. The quality advantage is consistently preserved across typologically diverse languages, including analytic English, isolating Chinese, agglutinative Korean and Japanese, fusional German, and Semitic\-templatic Arabic\. Detection ranks range from33to88, while remaining entirely within the saturated AUROC regime identified in Section[6\.1](https://arxiv.org/html/2606.00613#S6.SS1)\.
Table 12:LUNA’s rank out of99methods on each metric, per language, computed on the per\-language mean over Wikipedia and news\. Bold entries indicate rank11\.
### F\.2Per\-Domain Behavior
Appendix[F\.1](https://arxiv.org/html/2606.00613#A6.SS1)reports per\-language ranks\. This subsection reports the same rank summary for the two text domains \(Table[13](https://arxiv.org/html/2606.00613#A6.T13)\)\. LUNA is rank11on\|ΔPPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|,\|ΔSurprisal\|\|\\Delta\\mathrm\{Surprisal\}\|, and\|ΔEntropy\|\|\\Delta\\mathrm\{Entropy\}\|in both Wikipedia and news; on the two lexical\-structure metrics it alternates between rank11and rank22across domains\. Detection rank is66of99in both domains, inside the saturated AUROC band\. The trade\-off profile is symmetric across the two domains\.
Table 13:LUNA’s rank out of99methods on each metric, per domain, computed on the66\-language mean within each domain\. Bold entries indicate rank11\.
## Appendix GContext\-Order Analysis
This appendix examines the POS context\-order hyperparameterkkfor the two linguistic methods, LUNA and STELA\. Both methods consume the same corpus\-estimated POS\-context signal, yet they use it in different sampling mechanisms\. LUNA turnsλ\(ct\)\\lambda\(c\_\{t\}\)into tournament depth, whereas STELA turns the same signal into green\-list bias and detector weights\. We therefore restrict this analysis to LUNA and STELA, since the goal is to understand how linguistic context length interacts with the two linguistic\-signal mechanisms\.
### G\.1kk\-Stratified Comparison
Tables[14](https://arxiv.org/html/2606.00613#A7.T14)–[16](https://arxiv.org/html/2606.00613#A7.T16)report the fixed\-kkmeans for LUNA and STELA atk∈\{2,3,4\}k\\in\\\{2,3,4\\\}\. The comparison shows that LUNA preserves the quality advantage across context orders, while the optimal order varies by method and setting\. This pattern motivates the setting\-level selection rule in Table[9](https://arxiv.org/html/2606.00613#A2.T9)\.
Table 14:Fixed\-order comparison between LUNA and STELA atk=2k=2, averaged over the1212language\-by\-domain settings\.Table 15:Fixed\-order comparison between LUNA and STELA atk=3k=3, averaged over the1212language\-by\-domain settings\.Table 16:Fixed\-order comparison between LUNA and STELA atk=4k=4, averaged over the1212language\-by\-domain settings\.
### G\.2Context\-Order Selection Patterns
The setting\-level selections in Table[9](https://arxiv.org/html/2606.00613#A2.T9)show that LUNA and STELA prefer different context lengths\. LUNA selectsk∈\{3,4\}k\\in\\\{3,4\\\}in1010of1212settings, whereas STELA selectsk=2k=2in77of1212settings\. The two methods agree on the selectedkkin only44of1212settings\. This difference suggests that the same linguistic signal interacts differently with the sampling mechanism\. LUNA can exploit longer POS contexts through depth modulation, while STELA often prefers shorter contexts when the signal drives a distortionary green\-list bias\.
## Appendix HLinguistic Behavior ofλ\\lambdaAcross Languages
This appendix expands the linguistic intuition behind the normalized next\-tag entropyλ\(c\)\\lambda\(c\)\. We describe the kind of POS context that LUNA tends to mark as low or highλ\\lambdain each language, and we report the spread ofτ2−τ1\\tau\_\{2\}\-\\tau\_\{1\}measured on the calibration corpus at the selected primary order from Table[9](https://arxiv.org/html/2606.00613#A2.T9)\. The spread is the gap between the frequency\-weighted 25th and 75th percentiles ofλ\\lambdain that language; it summarizes how widelyλ\\lambdavaries across positions, and therefore how often LUNA chooses the deepest tier rather than the shallowest\.
#### Why the spread ofλ\\lambdamatters\.
LUNA applies the deep tournament tier only at positions whoseλ\\lambdavalue exceedsτ2\\tau\_\{2\}\. A wider spread therefore means that the deep tier is reserved for positions that are genuinely more uncertain than typical positions in the same language, rather than being applied uniformly\. A narrow spread means that most positions sit close to a commonλ\\lambdavalue and the three\-tier schedule collapses toward a near\-uniform depth assignment\. The spread is a property of the language and its tagger, not of the watermark; the watermark only consumes this signal\.
#### Per\-language behavior\.
Korean\.Korean is agglutinative with overt particles and verbal endings, and the Sejong tagset distinguishes nominal markers, case markers, and verb\-ending morphemes\. A POS context that ends with a topic marker can be followed by many tag types depending on whether the sentence continues with a verb phrase, a coordinated clause, or an embedded clause\. A POS context that ends with a clausal final ending is far more constrained\. The two regimes are reflected in a wideλ\\lambdaspread\.
German\.German shows fusional case\-and\-number agreement and verb\-second main\-clause syntax\(Haider,[2010](https://arxiv.org/html/2606.00613#bib.bib113); Vikner,[1995](https://arxiv.org/html/2606.00613#bib.bib112)\)\. The position immediately after a fronted constituent in a main clause is fixed to a finite verb\. The position after a finite verb is much more open, since it can host a subject, an adverb, or a nominal complement depending on the construction\. This contrast between syntactically constrained verb\-second positions and freer post\-verb positions yields a wideλ\\lambdaspread, close to Korean\.
English\.English has light inflection and rigid SVO word order\(Quirket al\.,[1985](https://arxiv.org/html/2606.00613#bib.bib104)\)\. Determiner\-adjective contexts almost always continue with a noun, while preposition\-noun contexts can be followed by several functional categories\. The result is moderate spread\.
Japanese\.Japanese is agglutinative like Korean, yet writing mixes hiragana, katakana, and kanji, and SudachiPy splits compound nouns into morphemes\(Takaokaet al\.,[2018](https://arxiv.org/html/2606.00613#bib.bib9)\)\. This segmentation flattens distinctions among many nominal contexts, soλ\\lambdavaries less across positions than in Korean despite a comparable underlying morphology\.
Chinese\.Mandarin Chinese is isolating and uses few overt grammatical markers\(Li and Thompson,[1981](https://arxiv.org/html/2606.00613#bib.bib79)\)\. Most POS contexts allow a similar set of continuations, dominated by nouns and verbs, soλ\\lambdaremains close to its language\-level mean\.
Arabic\.Arabic combines templatic root\-and\-pattern morphology with rich agreement and an abjad script\(McCarthy,[1981](https://arxiv.org/html/2606.00613#bib.bib117); Ryding,[2005](https://arxiv.org/html/2606.00613#bib.bib119)\)\. The CAMeL Tools tagger emits fine\-grained tags that already encode much of this morphological information, so consecutive tags carry a high mutual constraint\. Combined with the small selected orderk=2k=2, the resultingλ\\lambdadistribution is comparatively flat\.
#### Measured spread\.
Table[17](https://arxiv.org/html/2606.00613#A8.T17)reportsτ1\\tau\_\{1\},τ2\\tau\_\{2\}, and the spreadτ2−τ1\\tau\_\{2\}\-\\tau\_\{1\}averaged over the Wikipedia and news calibration corpora at the selected primary order\. The order from widest to narrowest spread is Korean, German, English, Japanese, Chinese, Arabic\. This ordering matches the per\-language narrative above and supports the interpretation ofλ\\lambdaas a linguistic capacity signal\.
Table 17:Frequency\-weighted 25th and 75th percentile thresholds ofλ\\lambdafor LUNA at the selected primary order from Table[9](https://arxiv.org/html/2606.00613#A2.T9), averaged over Wikipedia and news\. The spreadτ2−τ1\\tau\_\{2\}\-\\tau\_\{1\}summarizes how widely the linguistic capacity signal varies across positions in each language\.Similar Articles
A Linguistics-Aware LLM Watermarking via Syntactic Predictability
This paper introduces STELA, a linguistics-aware watermarking framework for LLMs that leverages syntactic predictability via POS n-grams to balance text quality and detection robustness. The method enables publicly verifiable watermark detection without requiring access to model logits, demonstrating superior performance across typologically diverse languages (English, Chinese, Korean).
SLAM: Structural Linguistic Activation Marking for Language Models
SLAM is a novel white-box watermarking scheme that embeds marks into the structural geometry of LLM residual streams using sparse autoencoders, achieving 100% detection accuracy with minimal quality loss on Gemma-2 models, avoiding the token-distribution biasing of prior methods.
Dataset Watermarking for Closed LLMs with Provable Detection
This paper introduces a novel dataset watermarking method for closed LLMs that uses co-occurrence patterns of word pairs to provably detect if proprietary data was used in training, even when it constitutes a small fraction of the dataset.
PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks
The paper introduces PASA, a robust watermarking algorithm for LLM-generated text that operates at the semantic level using latent embedding spaces to resist semantic-invariant attacks like paraphrasing.
Linear Ensembles Wash Away Watermarks: On the Fragility of Distributional Perturbations in LLMs
This paper reveals a fundamental vulnerability in LLM watermarking: when users have access to multiple models, averaging their output distributions cancels watermark perturbations, enabling detection evasion. The authors propose WASH and demonstrate empirically that averaging 3-5 models suppresses detection z-scores below thresholds while improving text quality.