Linguistics-Aware Non-Distortionary LLM Watermarking

arXiv cs.CL Papers

Summary

Introduces LUNA, a linguistics-aware LLM watermarking method that achieves non-distortionary embedding and model-free detection across multiple languages, significantly improving AUROC and perplexity preservation.

arXiv:2606.00613v1 Announce Type: new Abstract: Watermarking should identify language-model output without degrading quality or limiting verification to the model provider. Multilingual deployment makes this harder because morphology, segmentation, and script change where watermark evidence can enter naturally. We introduce LUNA, a linguistically adaptive watermark that combines model-free detection with single-token non-distortion under the standard random-key model. LUNA estimates normalized next-tag entropy from part-of-speech contexts in an external corpus and uses it to set the depth of a non-distortionary binary tournament sampler; the detector reconstructs the same schedule from text, a tokenizer, a tagger, and a secret key. We evaluate six typologically diverse languages and two domains against eight primary baselines. LUNA attains an AUROC of 0.9959 and the lowest mean absolute median perplexity shift of 0.045 across the twelve settings; its 95% bootstrap interval [0.022, 0.073] lies below all baseline intervals. LUNA also records the lowest mean Self-BLEU, Distinct-1, surprisal, and entropy shifts. It is the only method that simultaneously achieves AUROC > 0.99 and an absolute median perplexity shift below 0.1 in a majority of settings, reaching this regime in 9 of the 12 settings while no baseline reaches it in more than 2. Our code is available at: https://github.com/Shinwoo-Park/luna_watermark
Original Article
View Cached Full Text

Cached at: 06/02/26, 03:38 PM

# Linguistics-Aware Non-Distortionary LLM Watermarking
Source: [https://arxiv.org/html/2606.00613](https://arxiv.org/html/2606.00613)
Shinwoo Park1,Hyejin Park2,Hyeseon An1,Yo\-Sub Han1,† 1Yonsei University, Seoul, Republic of Korea \{[pshkhh](https://arxiv.org/html/2606.00613v1/mailto:[email protected]),[hsan](https://arxiv.org/html/2606.00613v1/mailto:[email protected]),[emmous](https://arxiv.org/html/2606.00613v1/mailto:[email protected])\}@yonsei\.ac\.kr 2Rensselaer Polytechnic Institute, Troy, NY, USA [parkh12@rpi\.edu](https://arxiv.org/html/2606.00613v1/mailto:[email protected])

###### Abstract

Watermarking should identify language\-model output without degrading quality or limiting verification to the model provider\. Multilingual deployment makes this harder because morphology, segmentation, and script change where watermark evidence can enter naturally\. We introduce LUNA, a linguistically adaptive watermark that combines model\-free detection with single\-token non\-distortion under the standard random\-key model\. LUNA estimates normalized next\-tag entropy from part\-of\-speech contexts in an external corpus and uses it to set the depth of a non\-distortionary binary tournament sampler; the detector reconstructs the same schedule from text, a tokenizer, a tagger, and a secret key\. We evaluate six typologically diverse languages and two domains against eight primary baselines\. LUNA attains AUROC0\.99590\.9959and the lowest mean absolute median perplexity shift,0\.0450\.045, across the twelve settings; its95%95\\%bootstrap interval\[0\.022,0\.073\]\[0\.022,0\.073\]lies below all baseline intervals\. LUNA also records the lowest mean on Self\-BLEU, Distinct\-1, surprisal, and entropy shifts; it is the only method that simultaneously achieves AUROC\>0\.99\>0\.99and\|Δ​PPLmed\|<0\.1\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|<0\.1in a majority of settings, reaching this regime in99of the1212settings while no baseline reaches it in more than22\. Our code is available at[https://github\.com/Shinwoo\-Park/luna\_watermark](https://github.com/Shinwoo-Park/luna_watermark)\.

Linguistics\-Aware Non\-Distortionary LLM Watermarking

Shinwoo Park1, Hyejin Park2, Hyeseon An1, Yo\-Sub Han1,†1Yonsei University, Seoul, Republic of Korea\{[pshkhh](https://arxiv.org/html/2606.00613v1/mailto:[email protected]),[hsan](https://arxiv.org/html/2606.00613v1/mailto:[email protected]),[emmous](https://arxiv.org/html/2606.00613v1/mailto:[email protected])\}@yonsei\.ac\.kr2Rensselaer Polytechnic Institute, Troy, NY, USA[parkh12@rpi\.edu](https://arxiv.org/html/2606.00613v1/mailto:[email protected])

††footnotetext:†\\daggerCorresponding author\.## 1Introduction

Large language models now generate fluent text at scale, creating practical needs for provenance, attribution, and disinformation control\(Liuet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib30); Lalaiet al\.,[2025](https://arxiv.org/html/2606.00613#bib.bib31); European Parliament and Council of the European Union,[2024](https://arxiv.org/html/2606.00613#bib.bib29)\)\. Decoding\-time watermarking addresses these needs by embedding a statistical signal during generation and testing for it after deployment\(Kirchenbaueret al\.,[2023](https://arxiv.org/html/2606.00613#bib.bib34); Dathathriet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib40)\)\. A deployment\-ready watermark should satisfy three properties together:single\-token non\-distortion, where the next\-token distribution equals the base distribution after marginalizing over watermark randomness\(Aaronson and Kirchner,[2022](https://arxiv.org/html/2606.00613#bib.bib14); Kuditipudiet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib39); Dathathriet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib40)\);model\-free detection, so platforms and third\-party auditors can verify provenance without querying the original model or a surrogate\(Kirchenbaueret al\.,[2023](https://arxiv.org/html/2606.00613#bib.bib34); Parket al\.,[2026](https://arxiv.org/html/2606.00613#bib.bib11)\); andadaptivity, since different contexts provide different amounts of reliable capacity\(Luet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib36); Wanget al\.,[2025](https://arxiv.org/html/2606.00613#bib.bib37); Parket al\.,[2026](https://arxiv.org/html/2606.00613#bib.bib11)\)\. Prior work has not, to our knowledge, combined all three; recent adaptive non\-distortionary designs draw adaptivity from model\-side uncertainty, which ties detection to logits or surrogate forward passes and weakens public verifiability\.

The central observation behind LUNA is linguistic\. Languages differ systematically in how much grammatical choice a position permits\. For example, after the part\-of\-speech contextDET ADJin English \(e\.g\.“a quiet …”\), the next tag is almost alwaysNOUN, carrying little grammatical choice; after the Korean morpheme sequenceNNG JKO\(object marker\), the next slot can be a verb, adverbial, or adnominal modifier, spreading probability over several tags\. The first context yields a low normalized next\-tag entropy, the second a high one\. Such variation reflects the language and its analysis pipeline rather than to any particular language model, so a part\-of\-speech tagged corpus can estimate a reusable signal of local syntactic uncertainty\(Comrie,[1989](https://arxiv.org/html/2606.00613#bib.bib99); Greenberg and others,[1963](https://arxiv.org/html/2606.00613#bib.bib100); Haspelmath,[2005](https://arxiv.org/html/2606.00613#bib.bib101)\)\. Paired with a prefix\-measurable non\-distortionary sampler, this signal guides watermark capacity toward positions with greater grammatical choice while preserving the one\-step marginal distribution, and it enables detection from the tokenizer, a tagger, and the secret key without model logits\.

We introduce LUNA \(Linguistics\-AwareNon\-Distortionary LLM Watermarking\)\. LUNA estimates normalized next\-tag entropy for part\-of\-speech contexts from an external corpus, reconstructs the current contextctc\_\{t\}from the prefix, retrievesλ​\(ct\)∈\[0,1\]\\lambda\(c\_\{t\}\)\\in\[0,1\], and maps it to a depthmtm\_\{t\}for a binary tournament sampler\(Dathathriet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib40)\)\. The schedule is prefix\-measurable becausemtm\_\{t\}is fixed before samplingxtx\_\{t\}, which preserves single\-token marginals under the random\-key model and allows the detector to reconstruct the same depth sequence from text alone\. We evaluate LUNA on a compact, typology\-aware six\-language grid spanning analytic English\(Quirket al\.,[1985](https://arxiv.org/html/2606.00613#bib.bib104); Marcuset al\.,[1993](https://arxiv.org/html/2606.00613#bib.bib103)\), isolating Chinese\(Li and Thompson,[1981](https://arxiv.org/html/2606.00613#bib.bib79); Xueet al\.,[2005](https://arxiv.org/html/2606.00613#bib.bib106)\), agglutinative Korean\(Sohn,[2001](https://arxiv.org/html/2606.00613#bib.bib80); Kimet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib71)\)and Japanese\(Tsujimura,[2013](https://arxiv.org/html/2606.00613#bib.bib107); Kuno,[1973](https://arxiv.org/html/2606.00613#bib.bib109)\), fusional German\(Haider,[2010](https://arxiv.org/html/2606.00613#bib.bib113); Vikner,[1995](https://arxiv.org/html/2606.00613#bib.bib112)\), and templatic Semitic Arabic\(McCarthy,[1981](https://arxiv.org/html/2606.00613#bib.bib117); Watson,[2002](https://arxiv.org/html/2606.00613#bib.bib118); Ryding,[2005](https://arxiv.org/html/2606.00613#bib.bib119)\)\. Empirically, LUNA reaches AUROC0\.99590\.9959and TPR at5%5\\%FPR0\.98680\.9868, within0\.0110\.011of the strongest baseline, and records the lowest mean shift on each of the five quality metrics across the twelve settings\.

## 2Related Work

Table 1:Operational taxonomy of the primary baselines and LUNA\. Column definitions appear in Section[2\.4](https://arxiv.org/html/2606.00613#S2.SS4)\. The dagger \(†\\dagger\) marks the diversified GumbelSoft variant, which softens the deterministic Gumbel\-max decoding and therefore does not inherit the exact single\-token distribution\-preservation guarantee of EXP or SynthID\-Text\.### 2\.1Distribution\-Shifting and Adaptive Watermarks

A first family of language\-model watermarks embeds detectable evidence by modifying the next\-token distribution during decoding\. KGW\(Kirchenbaueret al\.,[2023](https://arxiv.org/html/2606.00613#bib.bib34)\)partitions the vocabulary into keyed green and red lists, biases green\-list logits before sampling, and detects the watermark through a one\-proportion test on the observed green\-token count\. This design enables efficient model\-free detection because the detector needs the text, key, and tokenizer rather than target\-model logits\. The same mechanism makes KGW single\-token distortionary, since the sampler explicitly changes probability mass assigned to green\-list tokens\.

Adaptive variants change insertion or detection across positions\. SWEET\(Leeet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib35)\)targets code generation and applies KGW\-style bias only at positions whose model entropy exceeds a threshold; its detector reuses the same threshold\. EWD\(Luet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib36)\)leaves KGW\-style generation unchanged and instead weights detected tokens by model\-side entropy\. MorphMark\(Wanget al\.,[2025](https://arxiv.org/html/2606.00613#bib.bib37)\)adapts insertion strength according to the cumulative probability mass of green\-list tokens and keeps KGW\-style detection\. STELA\(Parket al\.,[2026](https://arxiv.org/html/2606.00613#bib.bib11)\)estimates part\-of\-speech context indeterminacy from a corpus and uses that signal to modulate both green\-list bias and detection weighting\. These methods show that context\-dependent allocation can improve watermarking, while their operational requirements differ: SWEET and EWD require model\-side entropy at detection time, MorphMark preserves KGW\-style model\-free detection, and STELA obtains model\-free linguistic adaptivity through a tagger rather than logits\.

### 2\.2Distribution\-Preserving and Gumbel\-Based Watermarks

A second family seeks watermark evidence while preserving the base decoding distribution under explicit randomness assumptions\. Aaronson\-style exponential\-minimum sampling\(Aaronson and Kirchner,[2022](https://arxiv.org/html/2606.00613#bib.bib14)\)and the framework ofKuditipudiet al\.\([2024](https://arxiv.org/html/2606.00613#bib.bib39)\)instantiate this principle through keyed sampling schemes such as inverse\-transform and exponential\-minimum sampling\. SynthID\-Text\(Dathathriet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib40)\)introduces tournament sampling and supports a single\-token non\-distortionary configuration with binary tournaments; its detector computes keyed scores without using the language model at detection time\. Although DAWA\(Heet al\.,[2025](https://arxiv.org/html/2606.00613#bib.bib82)\)jointly optimizes generation and detection under explicit distortion constraints, its adaptive mechanism is derived from the model distribution and a surrogate model rather than from external linguistic signals\. GumbelSoft\(Fuet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib12)\)addresses generation diversity in Gumbel\-keyed watermarking\. It replaces deterministic decoding with a softmax variant of Logits\-Addition, sampling fromsoftmax​\(\(ℓt\+ξt\)/τ\)\\mathrm\{softmax\}\(\(\\ell\_\{t\}\+\\xi\_\{t\}\)/\\tau\), and detects by aggregating keyed scoresξt​\[xt\]\\xi\_\{t\}\[x\_\{t\}\]for observed tokens\. This makes GumbelSoft a strong model\-free baseline, although the paper does not establish the exact one\-step distribution\-preservation guarantee that we assign to EXP\(Aaronson and Kirchner,[2022](https://arxiv.org/html/2606.00613#bib.bib14)\)and the non\-distortionary SynthID\-Text configuration in Table[1](https://arxiv.org/html/2606.00613#S2.T1)\.

### 2\.3Multilingual and Cross\-Lingual Watermarks

Multilingual and cross\-lingual settings expose difficulties that English\-only evaluations can hide: translation, segmentation, morphology, and script can alter the evidence available to a detector\. Prior work examines watermark survival under translation, cross\-lingual manipulation, and back\-translation robustness\(Heet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib58); Al Ghanimet al\.,[2025](https://arxiv.org/html/2606.00613#bib.bib97); Mohamed and Gubri,[2025](https://arxiv.org/html/2606.00613#bib.bib98)\), and robustness benchmarks show that paraphrasing, editing, and other transformations can substantially change watermark evidence\(Rastogi and Pruthi,[2024](https://arxiv.org/html/2606.00613#bib.bib92); Tuet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib93); Lianget al\.,[2025](https://arxiv.org/html/2606.00613#bib.bib96)\)\.

This line of work primarily asks whether watermark evidence remains detectable after text has been transformed across languages, domains, or surface forms\. LUNA addresses a complementary question at generation time: where should watermark capacity enter the text when languages differ in morphology, segmentation, word order, and script? Its schedule conditions tournament depth on language\-specific part\-of\-speech context entropy, making the source of watermark evidence measurable before any downstream transformation occurs\.

### 2\.4Operational Taxonomy

Table[1](https://arxiv.org/html/2606.00613#S2.T1)summarizes the primary baselines and LUNA\.Single\-token Non\-distortiondenotes one\-step marginal preservation under the stated sampling assumptions;Adaptive InsertionandAdaptive Detectiondenote context\-dependent signal allocation during generation and detection;Model\-free Detectiondenotes detection without target or surrogate language\-model forward passes; andLinguistic Signaldenotes whether the adaptive signal is derived from corpus\-estimated linguistic structure rather than model logits\.

Green\-list methods obtain evidence through logit bias and sacrifice single\-token non\-distortion\. Distribution\-preserving methods preserve one\-step marginals under their sampling assumptions, yet they do not use an interpretable linguistic signal\. Adaptive methods split across insertion and detection, with some relying on model\-side entropy\. LUNA occupies the missing operational point: it inherits a non\-distortionary tournament backbone, replaces fixed schedules with part\-of\-speech context uncertainty, adapts both insertion and detection through the same signal, and supports detection without target or surrogate model access\.

## 3Background

### 3\.1Typological Stress Test

LUNA assumes that watermark capacity should track how much grammatical choice a position affords; this depends on the morphological and syntactic profile of the language\. The evaluation uses six languages that stress distinct interactions among morphology, word order, spacing, and script: analytic English and isolating Chinese \(low\-inflection SVO with different writing systems\), agglutinative Korean and Japanese \(particles and endings creating fine\-grained POS sequences\), fusional German \(verb\-second syntax with case and agreement\), and templatic Arabic \(Semitic root\-and\-pattern morphology with an abjad script\)\. Table[2](https://arxiv.org/html/2606.00613#S3.T2)summarizes the stress points\.

Table 2:Typological stress test used by the evaluation\.
### 3\.2Tournament Sampling and Detection

SynthID\-Text is a generative watermarking scheme built from three components: a random seed generator, a sampling algorithm, and a scoring function\. Let𝒱\\mathcal\{V\}denote the vocabulary,x<tx\_\{<t\}the prefix before positiontt, and

pt​\(v\)=Prbase⁡\(xt=v∣x<t\)p\_\{t\}\(v\)=\\Pr\_\{\\mathrm\{base\}\}\(x\_\{t\}=v\\mid x\_\{<t\}\)the next\-token distribution passed to the sampling layer\. Given a seedrtr\_\{t\}derived from the recent context and a watermarking key, SynthID\-Text defines layer\-wise keyed functionsg1,…,gmg\_\{1\},\\ldots,g\_\{m\}\. For the binary configuration used in the non\-distortionary setting, eachgℓ​\(v,rt\)g\_\{\\ell\}\(v,r\_\{t\}\)assigns a value in\{0,1\}\\\{0,1\\\}to candidate tokenvv\.

At a fixed depthmm, tournament sampling first draws2m2^\{m\}candidate tokens fromptp\_\{t\}, with repetitions allowed\. It then runs anmm\-layer knockout tournament: layerℓ\\ellcompares paired candidates withgℓ​\(⋅,rt\)g\_\{\\ell\}\(\\cdot,r\_\{t\}\), breaks ties randomly, and passes winners to the next layer until one token remains\. SynthID\-Text also admits a distortionary configuration with more than two competitors per match, which strengthens the watermark at the cost of token\-level distortion\. This subsection uses only the fixed\-depth binary configuration; Section[4](https://arxiv.org/html/2606.00613#S4)introduces the adaptive depth schedule used by LUNA\.

For detection, SynthID\-Text recomputes the same keyed scores on an observed sequence and aggregates them into a text\-level statistic\. For fixed depthmm, a simplified score over valid positionsℐ\\mathcal\{I\}is

Scorem⁡\(x\)=1m​\|ℐ\|​∑t∈ℐ∑ℓ=1mgℓ​\(xt,rt\)\.\\operatorname\{Score\}\_\{m\}\(x\)=\\frac\{1\}\{m\|\\mathcal\{I\}\|\}\\sum\_\{t\\in\\mathcal\{I\}\}\\sum\_\{\\ell=1\}^\{m\}g\_\{\\ell\}\(x\_\{t\},r\_\{t\}\)\.\(1\)Watermarked text tends to receive higher keyed scores because tournament sampling favors candidates with larger layer values\. This score depends on the observed text, the key, and the seed generator; it does not require a forward pass through the language model at detection time\.

![Refer to caption](https://arxiv.org/html/2606.00613v1/x1.png)Figure 1:Illustrative cross\-language LUNA depth schedules for translations of the same semantic sentence\. Each colored cell shows the tournament\-depth tier selected from normalized next\-tag entropyλ​\(ct\)\\lambda\(c\_\{t\}\): shallow usesmt=5m\_\{t\}=5, mid usesmt=15m\_\{t\}=15, and deep usesmt=30m\_\{t\}=30\.

## 4Method

### 4\.1Linguistic Depth Scheduling

LUNA modulates the fixed\-depth SynthID\-Text backbone by choosing the tournament depth from a linguistic signal\. For languageLL, letQ′Q^\{\\prime\}denote the next fine\-grained part\-of\-speech tag after contextcc,𝒮L,c\\mathcal\{S\}\_\{L,c\}the observed support of next tags in an external calibration corpus, andKL,c=\|𝒮L,c\|K\_\{L,c\}=\|\\mathcal\{S\}\_\{L,c\}\|\. With empirical probabilitiesP^L​\(q′∣c\)\\hat\{P\}\_\{L\}\(q^\{\\prime\}\\mid c\), define

HL​\(c\)=−∑q′∈𝒮L,cP^L​\(q′∣c\)​log2⁡P^L​\(q′∣c\),H\_\{L\}\(c\)=\-\\sum\_\{q^\{\\prime\}\\in\\mathcal\{S\}\_\{L,c\}\}\\hat\{P\}\_\{L\}\(q^\{\\prime\}\\mid c\)\\log\_\{2\}\\hat\{P\}\_\{L\}\(q^\{\\prime\}\\mid c\),\(2\)λL​\(c\)=\{0,KL,c≤1,HL​\(c\)log2⁡KL,c,KL,c\>1\.\\lambda\_\{L\}\(c\)=\\begin\{cases\}0,&K\_\{L,c\}\\leq 1,\\\\\[2\.84526pt\] \\dfrac\{H\_\{L\}\(c\)\}\{\\log\_\{2\}K\_\{L,c\}\},&K\_\{L,c\}\>1\.\\end\{cases\}\(3\)ThusλL​\(c\)∈\[0,1\]\\lambda\_\{L\}\(c\)\\in\[0,1\]measures how diffuse the observed next\-tag distribution is relative to its support\. LUNA estimates these tables on CulturaX\(Nguyenet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib5)\), separate from evaluation data\. At generation and detection time, lookup backs off from the primary order to lower\-order contexts and returnsλdef=0\.5\\lambda\_\{\\mathrm\{def\}\}=0\.5when no supported context is available\.

LUNA mapsλ​\(ct\)\\lambda\(c\_\{t\}\)to a three\-tier depth schedule,

mt=\{mmin,λ​\(ct\)<τ1,mmid,τ1≤λ​\(ct\)<τ2,mmax,λ​\(ct\)≥τ2,m\_\{t\}=\\begin\{cases\}m\_\{\\min\},&\\lambda\(c\_\{t\}\)<\\tau\_\{1\},\\\\ m\_\{\\mathrm\{mid\}\},&\\tau\_\{1\}\\leq\\lambda\(c\_\{t\}\)<\\tau\_\{2\},\\\\ m\_\{\\max\},&\\lambda\(c\_\{t\}\)\\geq\\tau\_\{2\},\\end\{cases\}\(4\)with the default schedule\(mmin,mmid,mmax\)=\(5,15,30\)\(m\_\{\\min\},m\_\{\\mathrm\{mid\}\},m\_\{\\max\}\)=\(5,15,30\)\. Thresholdsτ1\\tau\_\{1\}andτ2\\tau\_\{2\}are frequency\-weighted 25th and 75th percentiles ofλL\\lambda\_\{L\}on the calibration table\. We adopt a three\-tier discretization as a simple and auditable instantiation of depth scheduling: the schedule has only two free thresholds that are calibrated from the same corpus used forλ\\lambda, and tier identities are easy to inspect during error analysis\. Finer discretizations or a continuous mappingmt=f​\(λ​\(ct\)\)m\_\{t\}=f\(\\lambda\(c\_\{t\}\)\)are natural extensions\. The schedule is prefix\-measurable becausectc\_\{t\},λ​\(ct\)\\lambda\(c\_\{t\}\), andmtm\_\{t\}are all determined before sampling the current token\.

Figure[1](https://arxiv.org/html/2606.00613#S3.F1)illustrates the typological motivation: the same semantic content induces different LUNA depth schedules across the six evaluation languages\.

### 4\.2Variable\-Depth Generation and Model\-Free Detection

LUNA extends the fixed\-depth binary tournament in Section[3\.2](https://arxiv.org/html/2606.00613#S3.SS2)by replacing the constant depthmmwith the prefix\-measurable depthmtm\_\{t\}\. Conditioned on a prefix and its depth, the current sampling step applies the same binary tournament layers used by SynthID\-Text\. For notation and implementation, we write the binary tournament in its probability\-rescaling form\. LetGt,v\(ℓ\)∈\{0,1\}G\_\{t,v\}^\{\(\\ell\)\}\\in\\\{0,1\\\}denote the value assigned to candidate tokenvvat layerℓ\\ellfor positiontt\. Starting fromqt\(0\)=ptq\_\{t\}^\{\(0\)\}=p\_\{t\}, LUNA applies

μt\(ℓ\)=∑u∈𝒱qt\(ℓ−1\)​\(u\)​Gt,u\(ℓ\),\\mu\_\{t\}^\{\(\\ell\)\}=\\sum\_\{u\\in\\mathcal\{V\}\}q\_\{t\}^\{\(\\ell\-1\)\}\(u\)G\_\{t,u\}^\{\(\\ell\)\},\(5\)qt\(ℓ\)​\(v\)=qt\(ℓ−1\)​\(v\)​\(1\+Gt,v\(ℓ\)−μt\(ℓ\)\)q\_\{t\}^\{\(\\ell\)\}\(v\)=q\_\{t\}^\{\(\\ell\-1\)\}\(v\)\\bigl\(1\+G\_\{t,v\}^\{\(\\ell\)\}\-\\mu\_\{t\}^\{\(\\ell\)\}\\bigr\)\(6\)forℓ=1,…,mt\\ell=1,\\ldots,m\_\{t\}, and then samples

xt∼qt\(mt\)\.x\_\{t\}\\sim q\_\{t\}^\{\(m\_\{t\}\)\}\.\(7\)A repeated\-context safeguard leaves the base distribution unchanged when the current hash context repeats in the recent history; the detector skips the same positions\. Figure[2](https://arxiv.org/html/2606.00613#S4.F2)illustrates the generation\-time operation of LUNA\.

![Refer to caption](https://arxiv.org/html/2606.00613v1/x2.png)Figure 2:Generation\-time operation of LUNA\. For each prefixx<tx\_\{<t\}, the base language model supplies the next\-token distributionpt​\(v\)p\_\{t\}\(v\), while the linguistic branch reconstructs the POS contextctc\_\{t\}, looks up the precomputed normalized next\-tag entropyλ​\(ct\)\\lambda\(c\_\{t\}\), and maps it to a tournament depthmtm\_\{t\}\. LUNA then applies anmtm\_\{t\}\-layer binary tournament that reweightspt​\(v\)p\_\{t\}\(v\)before samplingxtx\_\{t\}\.Detection uses the text, tokenizer, part\-of\-speech tagger, linguistic signal \(λ\\lambda\) table, and secret key\. It does not access logits or forward passes of the original generation model, nor does it run a surrogate model\. The detector aligns tag spans to token positions, reconstructsctc\_\{t\},λ​\(ct\)\\lambda\(c\_\{t\}\), andmtm\_\{t\}at every valid position, and computes

St=∑ℓ=1mt\(Gt,xt\(ℓ\)−12\),S\_\{t\}=\\sum\_\{\\ell=1\}^\{m\_\{t\}\}\\left\(G\_\{t,x\_\{t\}\}^\{\(\\ell\)\}\-\\frac\{1\}\{2\}\\right\),\(8\)Z=∑t∈ℐωt​St14​∑t∈ℐmt​ωt2,Z=\\frac\{\\sum\_\{t\\in\\mathcal\{I\}\}\\omega\_\{t\}S\_\{t\}\}\{\\sqrt\{\\frac\{1\}\{4\}\\sum\_\{t\\in\\mathcal\{I\}\}m\_\{t\}\\omega\_\{t\}^\{2\}\}\},\(9\)whereℐ\\mathcal\{I\}is the set of valid positions andωt=λ​\(ct\)\\omega\_\{t\}=\\lambda\(c\_\{t\}\)\. Under the random\-key null, each centered valueGt,xt\(ℓ\)−1/2G\_\{t,x\_\{t\}\}^\{\(\\ell\)\}\-1/2has variance1/41/4, so the denominator standardizes the weighted sum andZZis comparable to a standard normal score\. Appendix[A](https://arxiv.org/html/2606.00613#A1)gives full pseudocode for lookup, generation, and detection\.

### 4\.3Single\-Token Marginal Preservation

###### Theorem 1\(Single\-token marginal preservation\)\.

Fix a prefixx<tx\_\{<t\}and letptp\_\{t\}be the base distribution passed to the sampler\. Assume thatmt=m​\(x<t\)m\_\{t\}=m\(x\_\{<t\}\)is prefix\-measurable and independent of the layer\-wise watermark randomness at positiontt\. Under the standard random\-key model\(Aaronson and Kirchner,[2022](https://arxiv.org/html/2606.00613#bib.bib14); Kuditipudiet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib39); Dathathriet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib40)\), in whichGt,v\(ℓ\)​∼iid​Bernoulli​\(1/2\)G\_\{t,v\}^\{\(\\ell\)\}\\overset\{\\text\{iid\}\}\{\\sim\}\\mathrm\{Bernoulli\}\(1/2\)across the index tuples\(t,ℓ,v\)\(t,\\ell,v\), the tournament update of Equations[5](https://arxiv.org/html/2606.00613#S4.E5)and[6](https://arxiv.org/html/2606.00613#S4.E6)satisfies

𝔼G​\[PrLUNA⁡\(xt=v∣x<t,G\)\]=pt​\(v\)\\mathbb\{E\}\_\{G\}\\left\[\\Pr\_\{\\mathrm\{LUNA\}\}\(x\_\{t\}=v\\mid x\_\{<t\},G\)\\right\]=p\_\{t\}\(v\)for everyv∈𝒱v\\in\\mathcal\{V\}\.

Theorem[1](https://arxiv.org/html/2606.00613#Thmtheorem1)establishes a one\-step marginal result under the random\-key model\. It does not claim equality of the realized fixed\-key distribution at a single step, nor equality of the full joint distribution over sequences\. The proof follows by conditioning on the prefix so thatmtm\_\{t\}is fixed, applying the fixed\-depth tournament expectation layer by layer, and using𝔼​\[Gt,v\(ℓ\)\]=1/2\\mathbb\{E\}\[G\_\{t,v\}^\{\(\\ell\)\}\]=1/2\. Appendix[A\.3](https://arxiv.org/html/2606.00613#A1.SS3)provides the full proof and implementation\-level details\.

Table 3:Evaluation languages, generation models, and part\-of\-speech pipelines\.

## 5Experimental Settings

### 5\.1Languages and Models

The evaluation covers six languages and two domains \(Wikipedia, news\), yielding 12 language\-by\-domain settings\. Each language uses an instruction\-tuned generation model that natively supports it, alongside a language\-specific part\-of\-speech \(POS\) pipeline\. Table[3](https://arxiv.org/html/2606.00613#S4.T3)summarizes the main experimental setup\. Appendix[B](https://arxiv.org/html/2606.00613#A2)gives full model identifiers, part\-of\-speech backends, tagsets, and selected context orders\. For perplexity\-based quality evaluation, we use Qwen2\.5\-1\.5B\(Yanget al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib59)\)as a shared reference model across languages\.

### 5\.2Datasets

LUNA estimatesλ\\lambdatables from CulturaX\(Nguyenet al\.,[2024](https://arxiv.org/html/2606.00613#bib.bib5)\), using 20,000 held\-out records per language with a length filter of 300 to 4000 characters\. The same held\-out corpus supplies calibration for STELA\. No evaluation prompt or generated output enters the calibration corpus\. We use two dataset families: Wikipedia continuations for all six languages\(Foundation,[2023](https://arxiv.org/html/2606.00613#bib.bib6)\), and news continuations from XL\-Sum\(Hasanet al\.,[2021](https://arxiv.org/html/2606.00613#bib.bib7)\)for English, Chinese, Korean, Japanese, and Arabic, plus MLSum\(Scialomet al\.,[2020](https://arxiv.org/html/2606.00613#bib.bib8)\)for German\. Each language\-by\-domain setting contains 500 records, so each algorithm runs on 6,000 evaluation records\.

### 5\.3Baselines and Generation Protocol

We compare LUNA with eight baselines: KGW, EWD, SWEET, MorphMark, STELA, GumbelSoft, EXP, and SynthID\-Text\. SynthID\-Text is configured to match the expected tournament budgetB=𝔼​\[2mt\]B=\\mathbb\{E\}\[2^\{m\_\{t\}\}\]induced by the LUNA depth ladder, equalizing the average per\-token distortion budget across the two methods; the matching formula appears in Appendix[B\.2](https://arxiv.org/html/2606.00613#A2.SS2)\. All methods sample with temperature0\.70\.7, nucleus probability0\.950\.95, no top\-kkcap, and200200–256256new tokens; Qwen2\.5\-0\.5B uses repetition penalty1\.11\.1, others1\.01\.0\. Watermarked, unwatermarked, and human\-reference texts are truncated to at most256256generation\-tokenizer tokens before detection so that model\-aware detectors fit within GPU memory at equal evidence length\. Detailed seeds, context orders, and method\-specific hyperparameters appear in Appendix[B\.3](https://arxiv.org/html/2606.00613#A2.SS3); experiments run on a single NVIDIA RTX 3090 GPU with2424GB of memory\.

Table 4:Main detection and quality preservation results,1212\-setting mean\. SBleu,Dist1\\mathrm\{Dist\}\_\{1\}, Surp, and Entr abbreviate Self\-BLEU, Distinct\-11, surprisal, and entropy\.
### 5\.4Evaluation Metrics

#### Detection metrics\.

We use AUROC and TPR at5%5\\%FPR\. Both compare watermarked outputs with unwatermarked outputs generated by the same base model from the same prompts\. AUROC summarizes the full ROC curve; TPR at5%5\\%FPR fixes a deployment\-relevant operating point\.

#### Quality metrics\.

For each text\-level quality statisticQQ, we form the absolute setting\-level shift\|Δ​Q\|=\|Qw−Qu\|\|\\Delta Q\|=\|Q\_\{\\mathrm\{w\}\}\-Q\_\{\\mathrm\{u\}\}\|, whereQwQ\_\{\\mathrm\{w\}\}is computed on the watermarked outputs of a setting andQuQ\_\{\\mathrm\{u\}\}on the unwatermarked outputs of the same setting and prompts\. We define five statistics covering complementary notions of distortion\.\|Δ​PPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|uses median perplexity under Qwen2\.5\-1\.5B and captures the likelihood of the generated text under the reference model\.\|Δ​Self\-BLEU\|\|\\Delta\\textsc\{Self\-BLEU\}\|uses corpus\-level Self\-BLEU for intra\-output lexical repetition\.\|Δ​Distinct​\-​1\|\|\\Delta\\mathrm\{Distinct\\text\{\-\}1\}\|uses the Distinct\-1 ratio for unigram diversity at the surface level\.\|Δ​Surprisal\|\|\\Delta\\mathrm\{Surprisal\}\|and\|Δ​Entropy\|\|\\Delta\\mathrm\{Entropy\}\|use the mean token\-level surprisal and predictive entropy under the same reference model, capturing distortion at the next\-token\-distribution level\.

#### Aggregation and confidence intervals\.

All statistics are aggregated at the setting level: we first compute each statistic within each of the1212language\-by\-domain settings and then report the mean over settings\. Bootstrap95%95\\%confidence intervals resample the1212settings with replacement over10001000iterations\. Section[6\.1](https://arxiv.org/html/2606.00613#S6.SS1)reports both the mean and the bootstrap interval for\|Δ​PPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|; intervals for the other four quality metrics appear in Appendix[C](https://arxiv.org/html/2606.00613#A3)\.

## 6Experimental Results

### 6\.1Main Detection\-Quality Results

Table[4](https://arxiv.org/html/2606.00613#S5.T4)reports the experimental results\. For every method that exposes a part\-of\-speech context order as a hyperparameter, namely LUNA and STELA, results use the per\-algorithm per\-setting best context order from Table[9](https://arxiv.org/html/2606.00613#A2.T9)\(Appendix[B\.4](https://arxiv.org/html/2606.00613#A2.SS4)\)\. Bold values mark the best entry per column\.

#### Detection saturation\.

Six methods achieve AUROC above0\.9950\.995: EWD, SWEET, KGW, STELA, SynthID\-Text, and LUNA\. Within this regime, the AUROC gap between EWD and LUNA is only0\.00310\.0031, while the TPR\-at\-5%5\\%\-FPR gap is0\.01040\.0104\. Both gaps are small in absolute terms and fall within the bootstrap variability reported in Section[6\.1](https://arxiv.org/html/2606.00613#S6.SS1)and Appendix[C](https://arxiv.org/html/2606.00613#A3), so the detection ranking at this level no longer reflects a deployment\-meaningful performance separation\. Furthermore, EWD and SWEET require language\-model forward passes at detection time, while KGW, STELA, SynthID\-Text, and LUNA detect from text, tokenizer, tagger, and secret key alone; LUNA therefore matches the strongest model\-based detector within these margins without requiring the language model at verification\.

Table 5:Controlled comparisons against LUNA, averaged over the1212settings\. Detection columns report LUNA minus the control; quality columns report the control divided by LUNA, so factors above11indicate that LUNA changes the metric less\.
#### Dominant multi\-metric quality preservation\.

LUNA records the lowest mean shift on every one of the five quality metrics\. Relative to the closest baseline \(MorphMark across all five metrics\), LUNA achieves a9\.5×9\.5\\timesreduction on\|Δ​PPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|,1\.5×1\.5\\timesreduction on\|Δ​Self\-BLEU\|\|\\Delta\\textsc\{Self\-BLEU\}\|,1\.8×1\.8\\timeson\|Δ​Distinct​\-​1\|\|\\Delta\\mathrm\{Distinct\\text\{\-\}1\}\|,8\.1×8\.1\\timeson\|Δ​Surprisal\|\|\\Delta\\mathrm\{Surprisal\}\|, and2\.4×2\.4\\timeson\|Δ​Entropy\|\|\\Delta\\mathrm\{Entropy\}\|\. The dominance covers complementary aspects of distortion at once: the language\-model probability of the generated text \(\|Δ​PPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|\), its lexical structure \(\|Δ​Self\-BLEU\|\|\\Delta\\textsc\{Self\-BLEU\}\|,\|Δ​Distinct​\-​1\|\|\\Delta\\mathrm\{Distinct\\text\{\-\}1\}\|\), and the realized next\-token\-distribution statistics \(\|Δ​Surprisal\|\|\\Delta\\mathrm\{Surprisal\}\|,\|Δ​Entropy\|\|\\Delta\\mathrm\{Entropy\}\|\)\.

#### Bootstrap\-significant gap on the quality metric\.

The bootstrap analysis confirms that the perplexity\-shift gap is statistically robust\. The LUNA confidence interval\[0\.022,0\.073\]\[0\.022,0\.073\]does not overlap any baseline interval, and the next\-lowest baseline lower bound is0\.1580\.158\(MorphMark\)\. LUNA exhibits bootstrap\-significantly lower\|Δ​PPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|than every baseline at the95%95\\%confidence level\. Appendix[C](https://arxiv.org/html/2606.00613#A3)reports the full CI table\.

### 6\.2Ablation Study

Table[5](https://arxiv.org/html/2606.00613#S6.T5)compares LUNA with three targeted references that isolate the main design decisions behind the method\. STELA is the closest linguistic baseline: it uses a corpus\-estimated part\-of\-speech signal, yet injects that signal through a distortionary green\-list bias\. SynthID\-Text is the closest tournament baseline: it uses a non\-distortionary binary tournament backbone, yet allocates watermark capacity without a linguistic signal\. SynthID\-Text\-Entropy is a controlled baseline introduced in this paper\. It replaces the corpus\-estimated linguistic signal of LUNA with language\-model entropy, thereby testing whether model\-side uncertainty can substitute for the proposed POS\-context signal\. Appendix[D](https://arxiv.org/html/2606.00613#A4)gives the full construction\.

#### Linguistic signal without non\-distortion: STELA\.

STELA and LUNA both use corpus\-estimated POS\-context uncertainty\. The difference lies in the sampling backbone: STELA injects the signal through green\-list logit bias, whereas LUNA uses it to modulate the depth of a non\-distortionary tournament sampler\. This comparison shows the value of replacing a distortionary linguistic watermark with a non\-distortionary tournament mechanism\. At comparable detection \(AUROC and TPR@5% within0\.00230\.0023and0\.00850\.0085respectively\), LUNA reduces the five quality shifts by3\.96×3\.96\\timesto26\.41×26\.41\\times\.

#### Tournament sampling without linguistic scheduling: SynthID\-Text\.

SynthID\-Text and LUNA share the binary tournament backbone\. The difference is the source of the schedule: SynthID\-Text uses prefix\-hash randomness, while LUNA usesλ​\(ct\)\\lambda\(c\_\{t\}\)to place more capacity in high\-uncertainty POS contexts\. This comparison isolates the effect of linguistic scheduling within the same tournament family\. LUNA reduces all five quality shifts by2\.15×2\.15\\timesto10\.35×10\.35\\timeswhile retaining nearly the same AUROC and TPR@5%\.

#### Model entropy instead of linguistic entropy: SynthID\-Text\-Entropy\.

SynthID\-Text\-Entropy is a new controlled baseline designed for this study\. It asks whether model\-derived entropy can replace the external linguistic signal used by LUNA\. The variant keeps the SynthID\-Text tournament family and budget matching, yet uses language\-model entropy as the adaptive signal rather than the corpus\-estimated POS\-context entropy used by LUNA\. This gives a strong model\-aware comparison point: detection is nearly identical to LUNA, with gaps of only−0\.0001\-0\.0001AUROC and−0\.0007\-0\.0007TPR@5%\. The detector requires language\-model forward passes at verification time, which sacrifices model\-free detection, and LUNA still improves four of five quality metrics by1\.59×1\.59\\timesto1\.76×1\.76\\timeson average\.

## 7Conclusion

LUNA combines part\-of\-speech context entropy with a non\-distortionary tournament sampler to jointly satisfy single\-token non\-distortion, model\-free detection, and linguistic adaptivity\. Across six typologically diverse languages and two domains, it records the lowest mean shift on five quality metrics and is the only method reaching AUROC\>0\.99\>0\.99and\|Δ​PPLmed\|<0\.1\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|<0\.1in a majority of settings\.

## Limitations

LUNA uses part\-of\-speech context entropy as a linguistic proxy for watermark capacity\. This proxy captures syntactic uncertainty rather than every form of linguistic choice\. It does not directly model semantic alternatives, discourse structure, pragmatic constraints, or register\. The empirical results suggest that syntactic uncertainty provides a useful control signal, while richer linguistic schedules could combine POS context with morphology, dependency structure, discourse state, or semantic classes\. LUNA also discretizesλ​\(ct\)\\lambda\(c\_\{t\}\)into three depth tiers; finer\-grained tiers or a continuous mappingmt=f​\(λ​\(ct\)\)m\_\{t\}=f\(\\lambda\(c\_\{t\}\)\)are natural extensions that we leave to future work\. Such extensions would test how much of the watermark capacity arises from syntax alone and how much comes from broader linguistic organization\.

The method also depends on language\-specific analyzers and entropy tables\. We use deterministic POS pipelines and keep the same tagger and tagset across calibration, generation, and detection\. This design makes the schedule auditable, yet it transfers responsibility to the linguistic preprocessing layer\. Languages with limited taggers, unstable segmentation, code switching, or domain\-specific orthography may require additional calibration\. Future work can study tagger uncertainty, multilingual tagset normalization, and analyzer ensembles that preserve model\-free detection while reducing dependence on a single preprocessing pipeline\.

The theoretical guarantee has a precise scope\. LUNA preserves single\-token marginals under the standard random\-key model for the non\-distortionary tournament sampler\. This statement does not imply equality of the full joint sequence distribution for a fixed key, and it does not provide an inherent guarantee against paraphrase, translation, editing, or adversarial attacks\. These transformations can change the observed POS sequence, the reconstructed schedule, or the keyed evidence\. Our evaluation therefore treats robustness as an empirical question rather than as a theorem\-level property\.

Finally, model\-free detection does not mean infrastructure\-free detection\. A verifier still needs the tokenizer, the POS analyzer, the entropy table, and the secret key\. This requirement is substantially weaker than access to target\-model logits or surrogate forward passes, and it supports public\-verification scenarios more naturally than model\-dependent adaptive schemes\. Nevertheless, deployment would need key management, versioning of entropy tables, and documented analyzer configurations\. These operational requirements define a concrete path for extending LUNA from a research watermark to an auditable multilingual provenance system\.

## References

- Watermarking GPT outputs\.Note:Technical report / blog postExternal Links:[Link](https://www.scottaaronson.com/blog/?p=6823)Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.00613#S2.SS2.p1.2),[Table 1](https://arxiv.org/html/2606.00613#S2.T1.2.10.6.1),[Theorem 1](https://arxiv.org/html/2606.00613#Thmtheorem1.p1.6.6)\.
- M\. Al Ghanim, J\. Xue, R\. P\. Hastuti, M\. Zheng, Y\. Solihin, and Q\. Lou \(2025\)Evaluating the robustness and accuracy of text watermarking under real\-world cross\-lingual manipulations\.InFindings of the Association for Computational Linguistics \(EMNLP\),Cited by:[§2\.3](https://arxiv.org/html/2606.00613#S2.SS3.p1.1)\.
- B\. Comrie \(1989\)Language universals and linguistic typology: syntax and morphology\.University of Chicago press\.Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p2.1)\.
- S\. Dathathri, A\. See, S\. Ghaisas, P\. Huang, R\. McAdam, J\. Welbl, V\. Bachani, A\. Kaskasoli, R\. Stanforth, T\. Matejovicova,et al\.\(2024\)Scalable watermarking for identifying large language model outputs\.Nature\.Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p1.1),[§1](https://arxiv.org/html/2606.00613#S1.p3.9),[§2\.2](https://arxiv.org/html/2606.00613#S2.SS2.p1.2),[Table 1](https://arxiv.org/html/2606.00613#S2.T1.2.11.7.1),[Theorem 1](https://arxiv.org/html/2606.00613#Thmtheorem1.p1.6.6)\.
- European Parliament and Council of the European Union \(2024\)Regulation \(EU\) 2024/1689 laying down harmonised rules on artificial intelligence\.Note:Official Journal of the European UnionExternal Links:[Link](https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng)Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p1.1)\.
- W\. Foundation \(2023\)Wikimedia wikipedia dataset\.External Links:[Link](https://huggingface.co/datasets/wikimedia/wikipedia)Cited by:[§5\.2](https://arxiv.org/html/2606.00613#S5.SS2.p1.1)\.
- J\. Fu, X\. Zhao, R\. Yang, Y\. Zhang, J\. Chen, and Y\. Xiao \(2024\)GumbelSoft: diversified language model watermarking via the GumbelMax\-trick\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[§2\.2](https://arxiv.org/html/2606.00613#S2.SS2.p1.2),[Table 1](https://arxiv.org/html/2606.00613#S2.T1.1.1.1)\.
- J\. H\. Greenberget al\.\(1963\)Some universals of grammar with particular reference to the order of meaningful elements\.Universals of language\.Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p2.1)\.
- H\. Haider \(2010\)The syntax of german\.Cambridge University Press\.Cited by:[Appendix H](https://arxiv.org/html/2606.00613#A8.SS0.SSS0.Px2.p2.1),[§1](https://arxiv.org/html/2606.00613#S1.p3.9)\.
- T\. Hasan, A\. Bhattacharjee, Md\. S\. Islam, K\. Mubasshir, Y\. Li, Y\. Kang, M\. S\. Rahman, and R\. Shahriyar \(2021\)XL\-sum: large\-scale multilingual abstractive summarization for 44 languages\.InFindings of the Association for Computational Linguistics \(ACL\),Cited by:[§5\.2](https://arxiv.org/html/2606.00613#S5.SS2.p1.1)\.
- M\. Haspelmath \(2005\)The world atlas of language structures\.Oxford University Press\.Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p2.1)\.
- H\. He, Y\. Liu, Z\. Wang, Y\. Mao, and Y\. Bu \(2025\)Theoretically Grounded Framework for LLM Watermarking: A Distribution\-Adaptive Approach\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems \(NeurIPS\),Cited by:[§2\.2](https://arxiv.org/html/2606.00613#S2.SS2.p1.2)\.
- Z\. He, B\. Zhou, H\. Hao, A\. Liu, X\. Wang, Z\. Tu, Z\. Zhang, and R\. Wang \(2024\)Can watermarks survive translation? on the cross\-lingual consistency of text watermark for large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Cited by:[§2\.3](https://arxiv.org/html/2606.00613#S2.SS3.p1.1)\.
- J\. M\. Kim, Y\. Lee, Y\. Han, H\. Choi, and S\. Jung \(2024\)Does incomplete syntax influence korean language model? focusing on word order and case markers\.InFirst Conference on Language Modeling \(COLM\),Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p3.9)\.
- J\. Kirchenbauer, J\. Geiping, Y\. Wen, J\. Katz, I\. Miers, and T\. Goldstein \(2023\)A watermark for large language models\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.00613#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2606.00613#S2.T1.2.5.1.1)\.
- R\. Kuditipudi, J\. Thickstun, T\. Hashimoto, and P\. Liang \(2024\)Robust distortion\-free watermarks for language models\.Transactions on Machine Learning Research\.Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.00613#S2.SS2.p1.2),[Theorem 1](https://arxiv.org/html/2606.00613#Thmtheorem1.p1.6.6)\.
- S\. Kuno \(1973\)The structure of japanese\.Cambridge: MIT Press\.Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p3.9)\.
- H\. N\. Lalai, A\. Anantha Ramakrishnan, R\. S\. Shah, and D\. Lee \(2025\)From intentions to techniques: a comprehensive taxonomy and challenges in text watermarking for large language models\.InFindings of the Association for Computational Linguistics \(NAACL\),Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p1.1)\.
- T\. Lee, S\. Hong, J\. Ahn, I\. Hong, H\. Lee, S\. Yun, J\. Shin, and G\. Kim \(2024\)Who wrote this code? watermarking for code generation\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[§2\.1](https://arxiv.org/html/2606.00613#S2.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.00613#S2.T1.2.7.3.1)\.
- C\. N\. Li and S\. A\. Thompson \(1981\)Mandarin chinese: a functional reference grammar\.Univ of California Press\.Cited by:[Appendix H](https://arxiv.org/html/2606.00613#A8.SS0.SSS0.Px2.p5.1),[§1](https://arxiv.org/html/2606.00613#S1.p3.9)\.
- J\. Liang, Z\. Wang, S\. Hong, S\. Ji, and T\. Wang \(2025\)Watermark under fire: a robustness evaluation of LLM watermarking\.InFindings of the Association for Computational Linguistics \(EMNLP\),Cited by:[§2\.3](https://arxiv.org/html/2606.00613#S2.SS3.p1.1)\.
- A\. Liu, L\. Pan, Y\. Lu, J\. Li, X\. Hu, X\. Zhang, L\. Wen, I\. King, H\. Xiong, and P\. Yu \(2024\)A survey of text watermarking in the era of large language models\.ACM Computing Surveys\.Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p1.1)\.
- Y\. Lu, A\. Liu, D\. Yu, J\. Li, and I\. King \(2024\)An entropy\-based text watermarking detection method\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.00613#S2.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.00613#S2.T1.2.6.2.1)\.
- M\. Marcus, B\. Santorini, and M\. A\. Marcinkiewicz \(1993\)Building a large annotated corpus of english: the penn treebank\.Computational linguistics\.Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p3.9)\.
- J\. J\. McCarthy \(1981\)A prosodic theory of nonconcatenative morphology\.Linguistic inquiry\.Cited by:[Appendix H](https://arxiv.org/html/2606.00613#A8.SS0.SSS0.Px2.p6.2),[§1](https://arxiv.org/html/2606.00613#S1.p3.9)\.
- A\. Mohamed and M\. Gubri \(2025\)Is Multilingual LLM Watermarking Truly Multilingual? Scaling Robustness to 100\+ Languages via Back\-Translation\.arXiv preprint arXiv:2510\.18019\.Cited by:[§2\.3](https://arxiv.org/html/2606.00613#S2.SS3.p1.1)\.
- T\. Nguyen, C\. V\. Nguyen, V\. D\. Lai, H\. Man, N\. T\. Ngo, F\. Dernoncourt, R\. A\. Rossi, and T\. H\. Nguyen \(2024\)CulturaX: a cleaned, enormous, and multilingual dataset for large language models in 167 languages\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\),Cited by:[§4\.1](https://arxiv.org/html/2606.00613#S4.SS1.p1.8),[§5\.2](https://arxiv.org/html/2606.00613#S5.SS2.p1.1)\.
- S\. Park, H\. Park, H\. An, and Y\. Han \(2026\)A Linguistics\-Aware LLM Watermarking via Syntactic Predictability\.InProceedings of the Annual Meeting of the Association for Computational Linguistics \(ACL\),Note:To appearCited by:[§1](https://arxiv.org/html/2606.00613#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.00613#S2.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.00613#S2.T1.2.9.5.1)\.
- R\. Quirk, S\. Greenbaum, G\. Leech, and J\. Svartvik \(1985\)A comprehensive grammar of the english language\.Longman,London\.Cited by:[Appendix H](https://arxiv.org/html/2606.00613#A8.SS0.SSS0.Px2.p3.1),[§1](https://arxiv.org/html/2606.00613#S1.p3.9)\.
- S\. Rastogi and D\. Pruthi \(2024\)Revisiting the robustness of watermarking to paraphrasing attacks\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Cited by:[§2\.3](https://arxiv.org/html/2606.00613#S2.SS3.p1.1)\.
- K\. C\. Ryding \(2005\)A reference grammar of modern standard arabic\.Cambridge university press\.Cited by:[Appendix H](https://arxiv.org/html/2606.00613#A8.SS0.SSS0.Px2.p6.2),[§1](https://arxiv.org/html/2606.00613#S1.p3.9)\.
- T\. Scialom, P\. Dray, S\. Lamprier, B\. Piwowarski, and J\. Staiano \(2020\)MLSUM: the multilingual summarization corpus\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Cited by:[§5\.2](https://arxiv.org/html/2606.00613#S5.SS2.p1.1)\.
- H\. Sohn \(2001\)The korean language\.Cambridge University Press\.Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p3.9)\.
- K\. Takaoka, S\. Hisamoto, N\. Kawahara, M\. Sakamoto, Y\. Uchida, and Y\. Matsumoto \(2018\)Sudachi: a Japanese Tokenizer for Business\.InProceedings of the Eleventh International Conference on Language Resources and Evaluation \(LREC\),Cited by:[Appendix H](https://arxiv.org/html/2606.00613#A8.SS0.SSS0.Px2.p4.1)\.
- N\. Tsujimura \(2013\)An introduction to japanese linguistics\.John Wiley & Sons\.Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p3.9)\.
- S\. Tu, Y\. Sun, Y\. Bai, J\. Yu, L\. Hou, and J\. Li \(2024\)WaterBench: towards holistic evaluation of watermarks for large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[§2\.3](https://arxiv.org/html/2606.00613#S2.SS3.p1.1)\.
- S\. Vikner \(1995\)Verb movement and expletive subjects in the germanic languages\.Oxford University Press\.Cited by:[Appendix H](https://arxiv.org/html/2606.00613#A8.SS0.SSS0.Px2.p2.1),[§1](https://arxiv.org/html/2606.00613#S1.p3.9)\.
- Z\. Wang, T\. Gu, B\. Wu, and Y\. Yang \(2025\)MorphMark: flexible adaptive watermarking for large language models\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.00613#S2.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.00613#S2.T1.2.8.4.1)\.
- J\. C\. Watson \(2002\)The phonology and morphology of arabic\.OUP Oxford\.Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p3.9)\.
- N\. Xue, F\. Xia, F\. Chiou, and M\. Palmer \(2005\)The penn chinese treebank: phrase structure annotation of a large corpus\.Natural language engineering\.Cited by:[§1](https://arxiv.org/html/2606.00613#S1.p3.9)\.
- A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei,et al\.\(2024\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[§5\.1](https://arxiv.org/html/2606.00613#S5.SS1.p1.1)\.

## Appendix AMethod Details

This appendix provides the implementation details omitted from the main text for space\. Algorithm[1](https://arxiv.org/html/2606.00613#alg1)gives the deterministic entropy lookup with order backoff\. Algorithm[2](https://arxiv.org/html/2606.00613#alg2)gives per\-position generation, and Algorithm[3](https://arxiv.org/html/2606.00613#alg3)gives model\-free detection\.

Algorithm 1Order\-backoff lookup of normalized next\-tag entropy\.1:POS context

ctc\_\{t\}; language

LL; lookup tables and thresholds\.

2:Normalized next\-tag entropy

λ∈\[0,1\]\\lambda\\in\[0,1\]\.

3:for

r=kprimary,kprimary−1,…,2r=k\_\{\\mathrm\{primary\}\},k\_\{\\mathrm\{primary\}\}\-1,\\ldots,2do

4:

c\(r\)←c^\{\(r\)\}\\leftarrowtruncate

ctc\_\{t\}to the last

r−1r\-1tags

5:if

c\(r\)∈𝒯L\(r\)c^\{\(r\)\}\\in\\mathcal\{T\}\_\{L\}^\{\(r\)\}and

NL​\(c\(r\)\)≥νN\_\{L\}\(c^\{\(r\)\}\)\\geq\\nuthen

6:return

λL​\(c\(r\)\)\\lambda\_\{L\}\(c^\{\(r\)\}\)
7:endif

8:endfor

9:return

λdef\\lambda\_\{\\mathrm\{def\}\}

### A\.1Entropy Lookup with Order Backoff

We summarize the additional notation used by Algorithm[1](https://arxiv.org/html/2606.00613#alg1)\. For languageLLand orderr∈\{2,…,kprimary\}r\\in\\\{2,\\ldots,k\_\{\\mathrm\{primary\}\}\\\}, let𝒯L\(r\)\\mathcal\{T\}\_\{L\}^\{\(r\)\}denote the set of length\-\(r−1\)\(r\{\-\}1\)POS contexts observed in the calibration corpus,NL​\(c\)N\_\{L\}\(c\)denote the empirical occurrence count of contextccin that corpus, andν\\nudenote a fixed minimum\-count threshold that controls when a stored value is reused\. The thresholdν\\nuis shared across orders and languages, and is chosen on the calibration corpus so that storedλ\\lambdavalues rely only on contexts with stable empirical estimates\.

The lookup starts at the primary orderkprimaryk\_\{\\mathrm\{primary\}\}and backs off through lower orders down to order22\. It returns a stored value only when the context exists in𝒯L\(r\)\\mathcal\{T\}\_\{L\}^\{\(r\)\}and its empirical frequency reaches the thresholdν\\nu\. If no supported context appears, it returnsλdef=0\.5\\lambda\_\{\\mathrm\{def\}\}=0\.5\.

Algorithm 2LUNA generation at positiontt\.1:Prefix

x<tx\_\{<t\}; distribution

ptp\_\{t\}; keys;

Φ\\Phi; schedule; tagger

𝒜\\mathcal\{A\}; history

ℋ\\mathcal\{H\}\.

2:Next token

xtx\_\{t\}\.

3:

ct←POSContext​\(𝒜,x<t\)c\_\{t\}\\leftarrow\\mathrm\{POSContext\}\(\\mathcal\{A\},x\_\{<t\}\)
4:

λ←Lookup​\(ct\)\\lambda\\leftarrow\\mathrm\{Lookup\}\(c\_\{t\}\)
5:

mt←MapToDepth​\(λ\)m\_\{t\}\\leftarrow\\mathrm\{MapToDepth\}\(\\lambda\)using Equation[4](https://arxiv.org/html/2606.00613#S4.E4)

6:

h←HashContext​\(x<t\)h\\leftarrow\\mathrm\{HashContext\}\(x\_\{<t\}\)
7:

r←𝟏​\{h∈ℋ\}r\\leftarrow\\mathbf\{1\}\\\{h\\in\\mathcal\{H\}\\\}
8:

ℋ←UpdateHistory​\(ℋ,h\)\\mathcal\{H\}\\leftarrow\\mathrm\{UpdateHistory\}\(\\mathcal\{H\},h\)
9:if

r=1r=1then

10:return

xt∼ptx\_\{t\}\\sim p\_\{t\}
11:endif

12:

q\(0\)←ptq^\{\(0\)\}\\leftarrow p\_\{t\}
13:for

ℓ=1\\ell=1to

mtm\_\{t\}do

14:

Gv\(ℓ\)←Φ​\(kℓ,h,v\)G\_\{v\}^\{\(\\ell\)\}\\leftarrow\\Phi\(k\_\{\\ell\},h,v\)for each

v∈𝒱v\\in\\mathcal\{V\}
15:

μ\(ℓ\)←∑u∈𝒱q\(ℓ−1\)​\(u\)​Gu\(ℓ\)\\mu^\{\(\\ell\)\}\\leftarrow\\sum\_\{u\\in\\mathcal\{V\}\}q^\{\(\\ell\-1\)\}\(u\)G\_\{u\}^\{\(\\ell\)\}
16:

q\(ℓ\)​\(v\)←q\(ℓ−1\)​\(v\)​\(1\+Gv\(ℓ\)−μ\(ℓ\)\)q^\{\(\\ell\)\}\(v\)\\leftarrow q^\{\(\\ell\-1\)\}\(v\)\(1\+G\_\{v\}^\{\(\\ell\)\}\-\\mu^\{\(\\ell\)\}\)
17:endfor

18:return

xt∼q\(mt\)x\_\{t\}\\sim q^\{\(m\_\{t\}\)\}

### A\.2Generation and Detection Algorithms

### A\.3Proof of Theorem[1](https://arxiv.org/html/2606.00613#Thmtheorem1)

Algorithm 3LUNA model\-free detection\.1:Text

x=\(x1,…,xT\)x=\(x\_\{1\},\\ldots,x\_\{T\}\); tokenizer; tagger

𝒜\\mathcal\{A\};

λ\\lambdatable; keys; threshold

γ\\gamma\.

2:Decision: watermarked or not\.

3:Run

𝒜\\mathcal\{A\}on the decoded full text and align tag spans to token positions

4:

ℐ←∅\\mathcal\{I\}\\leftarrow\\emptyset;

ℋ←∅\\mathcal\{H\}\\leftarrow\\emptyset
5:for

t=1t=1to

TTdo

6:

h←HashContext​\(x<t\)h\\leftarrow\\mathrm\{HashContext\}\(x\_\{<t\}\)
7:

r←𝟏​\{h∈ℋ\}r\\leftarrow\\mathbf\{1\}\\\{h\\in\\mathcal\{H\}\\\}
8:

ℋ←UpdateHistory​\(ℋ,h\)\\mathcal\{H\}\\leftarrow\\mathrm\{UpdateHistory\}\(\\mathcal\{H\},h\)
9:if

xtx\_\{t\}is EOSor

r=1r=1then

10:continue

11:endif

12:Recover

ctc\_\{t\}before position

tt
13:

λ←Lookup​\(ct\)\\lambda\\leftarrow\\mathrm\{Lookup\}\(c\_\{t\}\)
14:

mt←MapToDepth​\(λ\)m\_\{t\}\\leftarrow\\mathrm\{MapToDepth\}\(\\lambda\)
15:

ωt←λ\\omega\_\{t\}\\leftarrow\\lambda
16:

St←∑ℓ=1mt\(Gt,xt\(ℓ\)−1/2\)S\_\{t\}\\leftarrow\\sum\_\{\\ell=1\}^\{m\_\{t\}\}\(G\_\{t,x\_\{t\}\}^\{\(\\ell\)\}\-1/2\)
17:

ℐ←ℐ∪\{t\}\\mathcal\{I\}\\leftarrow\\mathcal\{I\}\\cup\\\{t\\\}
18:endfor

19:Compute

ZZwith Equation[9](https://arxiv.org/html/2606.00613#S4.E9)

20:return

𝟏​\{Z\>γ\}\\mathbf\{1\}\\\{Z\>\\gamma\\\}

Condition on the prefixx<tx\_\{<t\}\. The POS reconstruction returnsctc\_\{t\}, the lookup returnsλ​\(ct\)\\lambda\(c\_\{t\}\), and the schedule fixesmtm\_\{t\}before tokenxtx\_\{t\}is sampled\. The current step therefore reduces to fixed\-depth binary tournament sampling with depthmtm\_\{t\}\. Letq\(0\)=ptq^\{\(0\)\}=p\_\{t\}\. For layerℓ\\ell, condition on previous layers, soq\(ℓ−1\)q^\{\(\\ell\-1\)\}is fixed\. Under the random\-key model, the binary valuesGv\(ℓ\)G\_\{v\}^\{\(\\ell\)\}are independent ofq\(ℓ−1\)q^\{\(\\ell\-1\)\}and satisfy𝔼​\[Gv\(ℓ\)\]=1/2\\mathbb\{E\}\[G\_\{v\}^\{\(\\ell\)\}\]=1/2, so

𝔼G\(ℓ\)​\[q\(ℓ\)​\(v\)∣q\(ℓ−1\)\]\\displaystyle\\mathbb\{E\}\_\{G^\{\(\\ell\)\}\}\[q^\{\(\\ell\)\}\(v\)\\mid q^\{\(\\ell\-1\)\}\]=q\(ℓ−1\)​\(v\)​\(1\+12−12\)\\displaystyle=q^\{\(\\ell\-1\)\}\(v\)\\left\(1\+\\frac\{1\}\{2\}\-\\frac\{1\}\{2\}\\right\)=q\(ℓ−1\)​\(v\)\.\\displaystyle=q^\{\(\\ell\-1\)\}\(v\)\.Iterating across the active layers and applying the tower property of conditional expectation yields𝔼G​\[q\(mt\)​\(v\)\]=pt​\(v\)\\mathbb\{E\}\_\{G\}\[q^\{\(m\_\{t\}\)\}\(v\)\]=p\_\{t\}\(v\)\. Sincextx\_\{t\}is drawn fromq\(mt\)q^\{\(m\_\{t\}\)\}conditional onGG,PrLUNA⁡\(xt=v∣x<t,G\)=q\(mt\)​\(v\)\\Pr\_\{\\mathrm\{LUNA\}\}\(x\_\{t\}=v\\mid x\_\{<t\},G\)=q^\{\(m\_\{t\}\)\}\(v\), and taking expectation overGGgives𝔼G​\[PrLUNA⁡\(xt=v∣x<t,G\)\]=pt​\(v\)\\mathbb\{E\}\_\{G\}\[\\Pr\_\{\\mathrm\{LUNA\}\}\(x\_\{t\}=v\\mid x\_\{<t\},G\)\]=p\_\{t\}\(v\)\. Prefix measurability ensures thatmtm\_\{t\}does not depend on the current sampled token, so it remains fixed throughout this argument\.

Table 6:POS backends and tagsets used by LUNA\. These choices match the tagsets used to build the correspondingλ\\lambdatables\.

## Appendix BExperimental Setting Details

### B\.1Language Typology, Models, and POS Pipelines

Table[6](https://arxiv.org/html/2606.00613#A1.T6)lists the POS backend and tagset used at entropy estimation, generation, and detection time\. For every language, the same tagger and tagset are used across these three stages\.

Table[7](https://arxiv.org/html/2606.00613#A2.T7)lists the full generation\-model identifiers used in the experiments\.

Table 7:Generation\-model identifiers\.
### B\.2Budget Matching for the SynthID\-Text

The SynthID\-Text baseline uses the same binary tournament update as LUNA and matches the expected tournament budget induced by the LUNA depth ladder\. This budget matching removes the linguistic signal from the comparison: the depth is derived from a prefix hash and a salt rather than fromλ​\(ct\)\\lambda\(c\_\{t\}\), and detection uses uniform weightsωt=1\\omega\_\{t\}=1\. The schedule chooses between adjacent depthsmfloor=⌊log2⁡B⌋m\_\{\\mathrm\{floor\}\}=\\lfloor\\log\_\{2\}B\\rfloorandmceil=⌈log2⁡B⌉m\_\{\\mathrm\{ceil\}\}=\\lceil\\log\_\{2\}B\\rceilso that

𝔼​\[2mt\]=\(1−pceil\)​2mfloor\+pceil​2mceil=B\.\\mathbb\{E\}\[2^\{m\_\{t\}\}\]=\(1\-p\_\{\\mathrm\{ceil\}\}\)2^\{m\_\{\\mathrm\{floor\}\}\}\+p\_\{\\mathrm\{ceil\}\}2^\{m\_\{\\mathrm\{ceil\}\}\}=B\.Ifmfloor=mceilm\_\{\\mathrm\{floor\}\}=m\_\{\\mathrm\{ceil\}\}, the schedule uses that depth deterministically\. Otherwise,

pceil=B−2mfloor2mceil−2mfloor\.p\_\{\\mathrm\{ceil\}\}=\\frac\{B\-2^\{m\_\{\\mathrm\{floor\}\}\}\}\{2^\{m\_\{\\mathrm\{ceil\}\}\}\-2^\{m\_\{\\mathrm\{floor\}\}\}\}\.\(10\)When calibration supplies language\-specific tier proportions\(plow,pmid,phigh\)\(p\_\{\\mathrm\{low\}\},p\_\{\\mathrm\{mid\}\},p\_\{\\mathrm\{high\}\}\), the matched budget is

B=plow​2mmin\+pmid​2mmid\+phigh​2mmax\.B=p\_\{\\mathrm\{low\}\}2^\{m\_\{\\min\}\}\+p\_\{\\mathrm\{mid\}\}2^\{m\_\{\\mathrm\{mid\}\}\}\+p\_\{\\mathrm\{high\}\}2^\{m\_\{\\max\}\}\.\(11\)At nominal proportions\(0\.25,0\.5,0\.25\)\(0\.25,0\.5,0\.25\)and ladder\(5,15,30\)\(5,15,30\), this formula givesB0=268,451,848B\_\{0\}=268\{,\}451\{,\}848\.

Table 8:Watermark\-specific settings used in the primary comparison\.
### B\.3Watermark Baselines and Hyperparameters

Table[8](https://arxiv.org/html/2606.00613#A2.T8)summarizes the main watermark\-specific settings\. The KGW\-family baselines follow the MarkLLM implementations used in the experiments\. The SynthID\-Text row uses the SynthID\-Text binary tournament backbone and applies the budget\-matching procedure in Appendix[B\.2](https://arxiv.org/html/2606.00613#A2.SS2)for fair comparison with LUNA\.

### B\.4Calibration Details

The context orderkkis selected from\{2,3,4\}\\\{2,3,4\\\}separately for LUNA, and STELA in each language\-by\-domain setting\. For each algorithm, we choose thekkthat minimizes\|Δ​PPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|on the watermarked outputs\. The selectedkkvalues for LUNA and STELA are shown in Table[9](https://arxiv.org/html/2606.00613#A2.T9); the two methods agree on the selected order in44of the1212settings\.

Table 9:Per\-algorithm selected POS context orderkkfor the linguistic methods\. Selection minimizes\|Δ​PPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|within each language\-by\-domain setting\. LUNA prefersk∈\{3,4\}k\\in\\\{3,4\\\}in 10 of 12 settings, while STELA prefersk=2k=2in 7 of 12\. The per\-kkcomparison appears in Appendix[G](https://arxiv.org/html/2606.00613#A7)\.

## Appendix CBootstrap Confidence Intervals for Quality Metrics

Section[6\.1](https://arxiv.org/html/2606.00613#S6.SS1)reports the bootstrap95%95\\%confidence interval for the quality metric\|Δ​PPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|in Table[4](https://arxiv.org/html/2606.00613#S5.T4)\. This appendix lists the bootstrap intervals for the four remaining quality metrics under the same protocol:10001000iterations, resampling the1212language\-by\-domain settings with replacement, seed4242\. For LUNA and STELA, the intervals use the per\-algorithm per\-setting best context order from Table[9](https://arxiv.org/html/2606.00613#A2.T9)\.

Tables[11](https://arxiv.org/html/2606.00613#A3.T11)and[11](https://arxiv.org/html/2606.00613#A3.T11)group the four remaining quality metrics by the aspect of distortion they capture: lexical structure and next\-token\-distribution statistics\.

Table 10:Bootstrap95%95\\%confidence intervals for the lexical\-structure quality metrics\. LUNA’s upper bound lies strictly below the lower bound of six of eight baselines on each metric; MorphMark and SynthID\-Text are the two methods whose intervals overlap LUNA’s on both metrics\.
Table 11:Bootstrap95%95\\%confidence intervals for the next\-token\-distribution quality metrics\. LUNA’s upper bound lies strictly below the lower bound of every baseline on\|Δ​Surprisal\|\|\\Delta\\mathrm\{Surprisal\}\|and of seven of eight baselines on\|Δ​Entropy\|\|\\Delta\\mathrm\{Entropy\}\|, with MorphMark the only overlap on the latter\.

## Appendix DDesign and Analysis of SynthID\-Text\-Entropy

This appendix defines the SynthID\-Text\-Entropy used in Section[6\.2](https://arxiv.org/html/2606.00613#S6.SS2)\. This variant is not a previously published watermark\. It is a diagnostic baseline that asks whether a model\-derived entropy signal can replace the external linguistic signal used by LUNA\.

### D\.1Design Rationale

LUNA combines three ingredients: a non\-distortionary SynthID\-Text tournament backbone, a prefix\-measurable adaptive schedule derived from POS\-context entropy, and a detector that reconstructs the same linguistic schedule without language\-model forward passes\. STELA tests the value of replacing a distortionary linguistic watermark with a non\-distortionary tournament backbone\. SynthID\-Text tests the value of adding a linguistic schedule to a tournament sampler\. SynthID\-Text\-Entropy tests a third question: whether model\-side entropy can play the role that POS\-context entropy plays in LUNA\.

SynthID\-Text\-Entropy keeps the SynthID\-Text tournament family and the budget\-matching procedure used for the SynthID\-Text baseline\. It replaces the external linguistic signal with language\-model entropy in the adaptive detector\. This choice creates a strong model\-aware comparison point\. It also removes model\-free detection, since the verifier must run a language model to obtain per\-token entropy values\.

### D\.2Budget Matching with LUNA

We match the expected tournament budget of SynthID\-Text\-Entropy to LUNA using the same procedure as Appendix[B\.2](https://arxiv.org/html/2606.00613#A2.SS2)\. LetB=𝔼​\[2mt\]B=\\mathbb\{E\}\[2^\{m\_\{t\}\}\]denote the expected tournament budget induced by the LUNA depth ladder under the calibration proportions\(plow,pmid,phigh\)\(p\_\{\\mathrm\{low\}\},p\_\{\\mathrm\{mid\}\},p\_\{\\mathrm\{high\}\}\)\. The SynthID\-Text\-Entropy configuration uses the corresponding budget\-matched SynthID\-Text tournament schedule, so the comparison is not driven by a larger average tournament budget\.

### D\.3Comparison and Practical Implications

At the1212\-setting mean, SynthID\-Text\-Entropy attains AUROC0\.99600\.9960and\|Δ​PPLmed\|=0\.0787\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|=0\.0787, while LUNA attains AUROC0\.99590\.9959and\|Δ​PPLmed\|=0\.0447\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|=0\.0447\. Detection is effectively indistinguishable at this aggregate level: the AUROC gap is0\.00010\.0001in favor of SynthID\-Text\-Entropy, and the TPR@5% gap is0\.00070\.0007\. On quality preservation, LUNA improves four of the five reported quality metrics by factors of1\.59×1\.59\\timesto1\.76×1\.76\\times, while SynthID\-Text\-Entropy is1\.09×1\.09\\timesbetter on\|Δ​Distinct​\-​1\|\|\\Delta\\mathrm\{Distinct\\text\{\-\}1\}\|\.

The comparison clarifies the deployment trade\-off\. Model entropy supplies a powerful adaptive signal, yet it requires language\-model forward passes at verification time\. This dependence creates serving cost, version coupling, and weaker third\-party verifiability when the generator or an appropriate surrogate is not available\. LUNA reaches the same detection regime without this dependence and preserves quality better on most reported metrics\.

## Appendix EDetection\-Quality Trade\-off

This appendix visualizes the per\-setting structure that underlies the aggregate detection and quality results in Section[6\.1](https://arxiv.org/html/2606.00613#S6.SS1)\. We characterize the trade\-off space through three complementary views: the Pareto frontier \([E\.1](https://arxiv.org/html/2606.00613#A5.SS1)\), the per\-setting sweet\-spot distribution \([E\.2](https://arxiv.org/html/2606.00613#A5.SS2)\), and the per\-baseline multi\-metric quality advantage \([E\.3](https://arxiv.org/html/2606.00613#A5.SS3)\)\.

### E\.1Pareto Frontier of the Detection–Quality Trade\-off

Figure[3](https://arxiv.org/html/2606.00613#A5.F3)plots the1212\-setting mean of AUROC against\|Δ​PPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|on a logarithmic horizontal axis\. Four of the nine methods are Pareto\-optimal: EWD, SWEET, SynthID\-Text, and LUNA; the remaining five \(KGW, STELA, MorphMark, EXP, GumbelSoft\) are dominated by some method that achieves both better detection and lower distortion\.

![Refer to caption](https://arxiv.org/html/2606.00613v1/x3.png)Figure 3:Pareto frontier of the detection\-quality trade\-off, with AUROC on the vertical axis and\|Δ​PPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|on the horizontal axis \(log scale\), averaged over the1212language\-by\-domain settings\. Four of nine methods are Pareto\-optimal \(filled markers, connected by the frontier\); the other five are dominated \(gray markers\)\. The shaded sweet\-spot region in the upper\-left corner marks AUROC\>0\.99\>0\.99and shift<0\.1<0\.1; LUNA is the only method that enters it\.LUNA occupies the left endpoint of the Pareto front\. The nearest Pareto neighbor, SynthID\-Text, sits at\|Δ​PPLmed\|=0\.463\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|=0\.463with AUROC0\.99720\.9972; moving to LUNA reduces\|Δ​PPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|by a factor of10\.4×10\.4\\timesat an AUROC cost of0\.00130\.0013\. The shaded sweet\-spot region marks the operating regime where AUROC\>0\.99\>0\.99and\|Δ​PPLmed\|<0\.1\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|<0\.1; LUNA is the only Pareto\-optimal method inside this region, and the only method of the nine to enter it at the1212\-setting mean\.

### E\.2Per\-Setting Sweet\-Spot Distribution

The aggregate sweet\-spot finding holds at the per\-setting level\. Figure[4](https://arxiv.org/html/2606.00613#A5.F4)colors each \(method, language\-domain\) cell by\|Δ​PPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|on a logarithmic scale; green circles mark the cells that jointly satisfy AUROC\>0\.99\>0\.99and\|Δ​PPLmed\|<0\.1\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|<0\.1\. LUNA reaches the sweet\-spot in99of1212settings\.

![Refer to caption](https://arxiv.org/html/2606.00613v1/x4.png)Figure 4:Per\-setting\|Δ​PPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|on a logarithmic scale for all nine methods across the1212language\-by\-domain settings\. Rows are ordered by the number of sweet\-spot cells \(green circles, marking AUROC\>0\.99\>0\.99and shift<0\.1<0\.1\)\. LUNA enters the sweet\-spot in99of1212settings; the next\-best baseline \(MorphMark\) enters it in22of1212settings\.
### E\.3Per\-Baseline Multi\-Metric Quality Advantage

Figure[5](https://arxiv.org/html/2606.00613#A5.F5)extends the comparison from the single\|Δ​PPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|axis to all five quality metrics simultaneously\. Each cell reports the ratio of the baseline’s mean distortion to LUNA’s; the rightmost column reports the geometric mean across the five metrics\. LUNA’s geometric\-mean advantage over the closest baseline \(MorphMark\) is3\.5×3\.5\\times, and the advantage exceeds an order of magnitude against KGW, STELA, EWD, GumbelSoft, and EXP\. The largest single\-metric ratios are observed for\|Δ​PPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|and\|Δ​Surprisal\|\|\\Delta\\mathrm\{Surprisal\}\|, both of which are governed by the realized next\-token probability under the reference model\. The lexical\-structure metrics \(\|Δ​Self\-BLEU\|\|\\Delta\\textsc\{Self\-BLEU\}\|,\|Δ​Distinct​\-​1\|\|\\Delta\\mathrm\{Distinct\\text\{\-\}1\}\|\) also show consistent positive gains at smaller magnitudes, suggesting that the quality preservation remains robust across diverse forms of distortion rather than being attributable to improvements in a single metric alone\.

![Refer to caption](https://arxiv.org/html/2606.00613v1/x5.png)Figure 5:Per\-baseline multi\-metric quality advantage of LUNA, computed as the ratio between each baseline’s1212\-setting mean and LUNA’s on the five quality metrics\. Cells with ratio\>1\>1indicate LUNA changes the metric by a smaller amount\. The rightmost column reports the geometric mean across the five metrics\. LUNA holds a uniform advantage on every \(baseline, metric\) cell of the table across the eight main baselines\. The SynthID\-Text\-Entropy ablation in Section[6\.2](https://arxiv.org/html/2606.00613#S6.SS2)is not shown here and is the one comparison in which LUNA does not dominate on every metric\.

## Appendix FBehavior Across Experimental Axes

This appendix examines whether the aggregate behavior in Section[6](https://arxiv.org/html/2606.00613#S6)is uniform across the experimental axes\. We report per\-language ranks \([F\.1](https://arxiv.org/html/2606.00613#A6.SS1)\) and per\-domain ranks \([F\.2](https://arxiv.org/html/2606.00613#A6.SS2)\)\.

### F\.1Per\-Language Behavior

Table[12](https://arxiv.org/html/2606.00613#A6.T12)reports LUNA’s rank among the nine methods on each of the seven metrics, separately for each language\. Ranks are computed on the per\-language mean over Wikipedia and news\. LUNA holds rank11on\|Δ​PPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|and\|Δ​Surprisal\|\|\\Delta\\mathrm\{Surprisal\}\|in all six languages, on\|Δ​Self\-BLEU\|\|\\Delta\\textsc\{Self\-BLEU\}\|and\|Δ​Entropy\|\|\\Delta\\mathrm\{Entropy\}\|in five of six, and on\|Δ​Distinct​\-​1\|\|\\Delta\\mathrm\{Distinct\\text\{\-\}1\}\|in three of six\. The quality advantage is consistently preserved across typologically diverse languages, including analytic English, isolating Chinese, agglutinative Korean and Japanese, fusional German, and Semitic\-templatic Arabic\. Detection ranks range from33to88, while remaining entirely within the saturated AUROC regime identified in Section[6\.1](https://arxiv.org/html/2606.00613#S6.SS1)\.

Table 12:LUNA’s rank out of99methods on each metric, per language, computed on the per\-language mean over Wikipedia and news\. Bold entries indicate rank11\.
### F\.2Per\-Domain Behavior

Appendix[F\.1](https://arxiv.org/html/2606.00613#A6.SS1)reports per\-language ranks\. This subsection reports the same rank summary for the two text domains \(Table[13](https://arxiv.org/html/2606.00613#A6.T13)\)\. LUNA is rank11on\|Δ​PPLmed\|\|\\Delta\\mathrm\{PPL\}\_\{\\mathrm\{med\}\}\|,\|Δ​Surprisal\|\|\\Delta\\mathrm\{Surprisal\}\|, and\|Δ​Entropy\|\|\\Delta\\mathrm\{Entropy\}\|in both Wikipedia and news; on the two lexical\-structure metrics it alternates between rank11and rank22across domains\. Detection rank is66of99in both domains, inside the saturated AUROC band\. The trade\-off profile is symmetric across the two domains\.

Table 13:LUNA’s rank out of99methods on each metric, per domain, computed on the66\-language mean within each domain\. Bold entries indicate rank11\.

## Appendix GContext\-Order Analysis

This appendix examines the POS context\-order hyperparameterkkfor the two linguistic methods, LUNA and STELA\. Both methods consume the same corpus\-estimated POS\-context signal, yet they use it in different sampling mechanisms\. LUNA turnsλ​\(ct\)\\lambda\(c\_\{t\}\)into tournament depth, whereas STELA turns the same signal into green\-list bias and detector weights\. We therefore restrict this analysis to LUNA and STELA, since the goal is to understand how linguistic context length interacts with the two linguistic\-signal mechanisms\.

### G\.1kk\-Stratified Comparison

Tables[14](https://arxiv.org/html/2606.00613#A7.T14)–[16](https://arxiv.org/html/2606.00613#A7.T16)report the fixed\-kkmeans for LUNA and STELA atk∈\{2,3,4\}k\\in\\\{2,3,4\\\}\. The comparison shows that LUNA preserves the quality advantage across context orders, while the optimal order varies by method and setting\. This pattern motivates the setting\-level selection rule in Table[9](https://arxiv.org/html/2606.00613#A2.T9)\.

Table 14:Fixed\-order comparison between LUNA and STELA atk=2k=2, averaged over the1212language\-by\-domain settings\.Table 15:Fixed\-order comparison between LUNA and STELA atk=3k=3, averaged over the1212language\-by\-domain settings\.Table 16:Fixed\-order comparison between LUNA and STELA atk=4k=4, averaged over the1212language\-by\-domain settings\.
### G\.2Context\-Order Selection Patterns

The setting\-level selections in Table[9](https://arxiv.org/html/2606.00613#A2.T9)show that LUNA and STELA prefer different context lengths\. LUNA selectsk∈\{3,4\}k\\in\\\{3,4\\\}in1010of1212settings, whereas STELA selectsk=2k=2in77of1212settings\. The two methods agree on the selectedkkin only44of1212settings\. This difference suggests that the same linguistic signal interacts differently with the sampling mechanism\. LUNA can exploit longer POS contexts through depth modulation, while STELA often prefers shorter contexts when the signal drives a distortionary green\-list bias\.

## Appendix HLinguistic Behavior ofλ\\lambdaAcross Languages

This appendix expands the linguistic intuition behind the normalized next\-tag entropyλ​\(c\)\\lambda\(c\)\. We describe the kind of POS context that LUNA tends to mark as low or highλ\\lambdain each language, and we report the spread ofτ2−τ1\\tau\_\{2\}\-\\tau\_\{1\}measured on the calibration corpus at the selected primary order from Table[9](https://arxiv.org/html/2606.00613#A2.T9)\. The spread is the gap between the frequency\-weighted 25th and 75th percentiles ofλ\\lambdain that language; it summarizes how widelyλ\\lambdavaries across positions, and therefore how often LUNA chooses the deepest tier rather than the shallowest\.

#### Why the spread ofλ\\lambdamatters\.

LUNA applies the deep tournament tier only at positions whoseλ\\lambdavalue exceedsτ2\\tau\_\{2\}\. A wider spread therefore means that the deep tier is reserved for positions that are genuinely more uncertain than typical positions in the same language, rather than being applied uniformly\. A narrow spread means that most positions sit close to a commonλ\\lambdavalue and the three\-tier schedule collapses toward a near\-uniform depth assignment\. The spread is a property of the language and its tagger, not of the watermark; the watermark only consumes this signal\.

#### Per\-language behavior\.

Korean\.Korean is agglutinative with overt particles and verbal endings, and the Sejong tagset distinguishes nominal markers, case markers, and verb\-ending morphemes\. A POS context that ends with a topic marker can be followed by many tag types depending on whether the sentence continues with a verb phrase, a coordinated clause, or an embedded clause\. A POS context that ends with a clausal final ending is far more constrained\. The two regimes are reflected in a wideλ\\lambdaspread\.

German\.German shows fusional case\-and\-number agreement and verb\-second main\-clause syntax\(Haider,[2010](https://arxiv.org/html/2606.00613#bib.bib113); Vikner,[1995](https://arxiv.org/html/2606.00613#bib.bib112)\)\. The position immediately after a fronted constituent in a main clause is fixed to a finite verb\. The position after a finite verb is much more open, since it can host a subject, an adverb, or a nominal complement depending on the construction\. This contrast between syntactically constrained verb\-second positions and freer post\-verb positions yields a wideλ\\lambdaspread, close to Korean\.

English\.English has light inflection and rigid SVO word order\(Quirket al\.,[1985](https://arxiv.org/html/2606.00613#bib.bib104)\)\. Determiner\-adjective contexts almost always continue with a noun, while preposition\-noun contexts can be followed by several functional categories\. The result is moderate spread\.

Japanese\.Japanese is agglutinative like Korean, yet writing mixes hiragana, katakana, and kanji, and SudachiPy splits compound nouns into morphemes\(Takaokaet al\.,[2018](https://arxiv.org/html/2606.00613#bib.bib9)\)\. This segmentation flattens distinctions among many nominal contexts, soλ\\lambdavaries less across positions than in Korean despite a comparable underlying morphology\.

Chinese\.Mandarin Chinese is isolating and uses few overt grammatical markers\(Li and Thompson,[1981](https://arxiv.org/html/2606.00613#bib.bib79)\)\. Most POS contexts allow a similar set of continuations, dominated by nouns and verbs, soλ\\lambdaremains close to its language\-level mean\.

Arabic\.Arabic combines templatic root\-and\-pattern morphology with rich agreement and an abjad script\(McCarthy,[1981](https://arxiv.org/html/2606.00613#bib.bib117); Ryding,[2005](https://arxiv.org/html/2606.00613#bib.bib119)\)\. The CAMeL Tools tagger emits fine\-grained tags that already encode much of this morphological information, so consecutive tags carry a high mutual constraint\. Combined with the small selected orderk=2k=2, the resultingλ\\lambdadistribution is comparatively flat\.

#### Measured spread\.

Table[17](https://arxiv.org/html/2606.00613#A8.T17)reportsτ1\\tau\_\{1\},τ2\\tau\_\{2\}, and the spreadτ2−τ1\\tau\_\{2\}\-\\tau\_\{1\}averaged over the Wikipedia and news calibration corpora at the selected primary order\. The order from widest to narrowest spread is Korean, German, English, Japanese, Chinese, Arabic\. This ordering matches the per\-language narrative above and supports the interpretation ofλ\\lambdaas a linguistic capacity signal\.

Table 17:Frequency\-weighted 25th and 75th percentile thresholds ofλ\\lambdafor LUNA at the selected primary order from Table[9](https://arxiv.org/html/2606.00613#A2.T9), averaged over Wikipedia and news\. The spreadτ2−τ1\\tau\_\{2\}\-\\tau\_\{1\}summarizes how widely the linguistic capacity signal varies across positions in each language\.

Similar Articles

A Linguistics-Aware LLM Watermarking via Syntactic Predictability

arXiv cs.CL

This paper introduces STELA, a linguistics-aware watermarking framework for LLMs that leverages syntactic predictability via POS n-grams to balance text quality and detection robustness. The method enables publicly verifiable watermark detection without requiring access to model logits, demonstrating superior performance across typologically diverse languages (English, Chinese, Korean).

SLAM: Structural Linguistic Activation Marking for Language Models

arXiv cs.CL

SLAM is a novel white-box watermarking scheme that embeds marks into the structural geometry of LLM residual streams using sparse autoencoders, achieving 100% detection accuracy with minimal quality loss on Gemma-2 models, avoiding the token-distribution biasing of prior methods.

Dataset Watermarking for Closed LLMs with Provable Detection

arXiv cs.LG

This paper introduces a novel dataset watermarking method for closed LLMs that uses co-occurrence patterns of word pairs to provably detect if proprietary data was used in training, even when it constitutes a small fraction of the dataset.