How Human-Like Are Large Language Models? A Register-Aware Linguistic Evaluation Framework

arXiv cs.CL 05/25/26, 04:00 AM Papers
large-language-models evaluation linguistic register-aware human-likeness corpus-linguistics open-source
Summary
This paper introduces a register-aware linguistic evaluation framework to assess how human-like large language models (LLMs) are by comparing the distribution of 67 lexico-grammatical features between human and LLM-generated texts using Maximum Mean Discrepancy. Experiments across seven instruction-tuned open-source models and five registers show that no model perfectly matches human baselines, and closeness to human language varies by register rather than model size.
arXiv:2605.23651v1 Announce Type: new Abstract: While factual correctness and task-performance have been in focus of Large Language Model (LLM) research for a long time, the fundamental question of how human-like generated texts are on a linguistic level has been underexplored. From a corpus-linguistic perspective, language production is inherently context-dependent, with distinct communicative contexts giving rise to differences in frequencies and co-occurrence patterns of linguistic features. A text failing to adhere to these patterns can be content-wise correct, but still be unfavorable to human readers. In this work, we propose a context-aware evaluation framework in which human-likeness is assessed using a two-sample problem between the linguistic feature distribution of a human reference corpus for a given register and a corresponding LLM-generated corpus. We implement this framework using the Maximum Mean Discrepancy (MMD) and the 67 lexico-grammatical features introduced by Biber, which are commonly applied in corpus linguistics. In our experiments, we compare seven instruction-tuned, open-source models across five English-language datasets spanning distinct registers against a human baseline. While across all tested setups, LLMs deviate from the human baseline, which models are closest to human language depends on the register and is not dictated by model size.
Original Article
View Cached Full Text
Cached at: 05/25/26, 09:02 AM
# How Human-Like Are Large Language Models? A Register-Aware Linguistic Evaluation Framework
Source: [https://arxiv.org/html/2605.23651](https://arxiv.org/html/2605.23651)
Björn Nieth1,4,Marianna Gracheva2,Michaela Mahlberg2,3, Bjoern Eskofier1,3,5,6,Emmanuelle Salin1 1Department Artificial Intelligence in Biomedical Engineering \(AIBE\), FAU Erlangen\-Nürnberg, Germany 2Department of Digital Humanities and Social Studies \(DHSS\), FAU Erlangen\-Nürnberg, Germany 3University of Birmingham, United Kingdom4Chair of AI\-supported Therapy Decisions, LMU München, Munich, Germany 5Munich Center for Machine Learning \(MCML\), Munich, Germany6Institute of AI for Health, Helmholtz Zentrum München, Neuherberg, Germany \{bjoern\.nieth, marianna\.gracheva, michaela\.mahlberg, bjoern\.eskofier, emmanuelle\.salin\}@fau\.de

###### Abstract

While factual correctness and task\-performance have been in focus of Large Language Model research for a long time, the fundamental question of how human\-like generated texts are on a linguistic level has been underexplored\. From a corpus\-linguistic perspective, language production is inherently context\-dependent, with distinct communicative contexts giving rise to differences in frequencies and co\-occurrence patterns of linguistic features\. A text failing to adhere to these patterns can be content\-wise correct, but still be unfavorable to human readers\. In this work, we propose a context\-aware evaluation framework in which human\-likeness is assessed using a two\-sample problem between the linguistic feature distribution of a human reference corpus for a given register and a corresponding Large Language Model\-generated corpus\. We implement this framework using the Maximum Mean Discrepancy and the 67 lexico\-grammatical features introduced by Biber, which are commonly applied in corpus linguistics\. In our experiments, we compare seven instruction\-tuned, open\-source models across five English\-language datasets spanning distinct registers against a human baseline\. While across all tested setups, Large Language Models deviate from the human baseline, which models are closest to human language depends on the register and is not dictated by model size\.

How Human\-Like Are Large Language Models? A Register\-Aware Linguistic Evaluation Framework

Björn Nieth1,4, Marianna Gracheva2, Michaela Mahlberg2,3,Bjoern Eskofier1,3,5,6,Emmanuelle Salin11Department Artificial Intelligence in Biomedical Engineering \(AIBE\), FAU Erlangen\-Nürnberg, Germany2Department of Digital Humanities and Social Studies \(DHSS\), FAU Erlangen\-Nürnberg, Germany3University of Birmingham, United Kingdom4Chair of AI\-supported Therapy Decisions, LMU München, Munich, Germany5Munich Center for Machine Learning \(MCML\), Munich, Germany6Institute of AI for Health, Helmholtz Zentrum München, Neuherberg, Germany\{bjoern\.nieth, marianna\.gracheva, michaela\.mahlberg, bjoern\.eskofier, emmanuelle\.salin\}@fau\.de

## 1Introduction

Research on Large Language Model \(LLM\) capabilities has largely focused on task\- or domain\-oriented benchmarks, specialized metrics such as faithfulness \(e\.g\., grounding with respect to a given input or referenceEs et al\. \([2024](https://arxiv.org/html/2605.23651#bib.bib10)\)\) or downstream task performance\. While such metrics are useful to evaluate special aspects of Large Language Model abilities, they fail to address the fundamental issue of how closely Large Language Model\-generated language resembles human language use\. As Large Language Models in the form of personal chatbots are becoming increasingly adopted by the public, the proportion of Large Language Model\-generated text encountered in various media will increase\. Therefore, it becomes increasingly important to evaluate how well these generated texts reproduce the specific linguistic patterns observed in human language\.

![Refer to caption](https://arxiv.org/html/2605.23651v1/x1.png)Figure 1:Overview of the proposed evaluation framework\. Given a specific Registerrr, we collect a human and LLM\-generated corpus\. Then we extract a set of 67 linguistic features from the texts\. Finally, we calculate the distance between the human sample and the synthetic corpus and resample the distance between human samples as a baseline\.From a linguistic perspective, there exists no standard mode of human language, as human language production inherently depends on situational factors such as communicative purpose, audience, mode, and production circumstances, among othersBiber and Conrad \([2019](https://arxiv.org/html/2605.23651#bib.bib7)\)\. These situational factors form aregister\. A general analysis of Large Language Model\-language without attention to register is meaningless, as the differences between registers \(e\.g\., research articles, news, conversations, etc\.\) are so great that no general conclusion can be drawnBiber \([2012](https://arxiv.org/html/2605.23651#bib.bib6)\),Veirano Pinto \([2023](https://arxiv.org/html/2605.23651#bib.bib35)\)\. Certain features are frequent in a register because they are demanded by a particular situation of use\. As such, human language tends to produce common frequencies and co\-occurrence patterns of linguistic features for a register\. We define register alignment as the ability to produce language linguistically fitting a target register\. This is an implicit part of human communication\. An Large Language Model\-generated text failing to adhere to the specific characteristic linguistic features of a register will lead to a text being perceived as unnatural\. This could go as far as a text missing its communicative purpose and therefore be unfavorable to human readersBerber Sardinha \([2024](https://arxiv.org/html/2605.23651#bib.bib4)\)\.

Therefore, a growing body of work has focused on analyzing Large Language Model\-generated texts by examining the linguistic features of these texts\. While these studies provide empirical evidence on the differences between AI\-generated and human\-authored texts, they have limitations, often failing to properly measure the register\-alignment of models\.

To overcome these limitations, we introduce a register\-aware evaluation framework focusing on the distributional difference of linguistic features between corpora\. An overview of our method is shown in Figure[1](https://arxiv.org/html/2605.23651#S1.F1)\. We publish our code, generation settings and results along with this paper111Code and results available at:[https://github\.com/BjoernNieth/Register\_Aware\_LLMs](https://github.com/BjoernNieth/Register_Aware_LLMs)\. Our main contributions are:

- •A method for the assessment of the human\-likeness of Large Language Model\-generated texts based on linguistic features and the Maximum Mean Discrepancy \(MMD\)\.
- •An open\-source implementation and analysis of the framework, using the 67 linguistic features introduced by \.
- •A benchmark of five registers with a human baseline and the generation results of seven models in a zero and few\-shot setting, together with a further ablation study on prompt stability for one register\.

## 2Related Work

Traditional benchmarks for Large Language Model evaluation focus on task performance \(e\.g\., MMLUHendrycks et al\. \([2021](https://arxiv.org/html/2605.23651#bib.bib16)\)\) or domain understanding often through multiple choice questions \(ChemBenchMirza et al\. \([2025](https://arxiv.org/html/2605.23651#bib.bib26)\), LegalBENCHGuha et al\. \([2023](https://arxiv.org/html/2605.23651#bib.bib15)\), MedQAJin et al\. \([2021](https://arxiv.org/html/2605.23651#bib.bib17)\)\) without assessing whether Large Language Model\-generated texts are appropriate to the underlying register\. The performance of an Large Language Model on a benchmark is even influenced by the presence and absence of registers in the pre\-training dataMyntti et al\. \([2025](https://arxiv.org/html/2605.23651#bib.bib28)\)\.

The task of document\-level Large Language Model\-generated text detection is closely linked to our setup as the same features allowing a classification can be used to study language on a corpus level\. Two common approaches for Large Language Model\-generated text detection are 1\. using Large Language Models in a white\- or black\-box fashion and 2\. linguistic features based approachesWu et al\. \([2025](https://arxiv.org/html/2605.23651#bib.bib38)\)\. Models like GhostbusterVerma et al\. \([2024](https://arxiv.org/html/2605.23651#bib.bib36)\)or DetectGPTMitchell et al\. \([2023](https://arxiv.org/html/2605.23651#bib.bib27)\)both use the probabilities emitted by Large Language Models to detect Large Language Model\-generated texts\. Beyond document\-level detection, MAUVEPillutla et al\. \([2021](https://arxiv.org/html/2605.23651#bib.bib31)\)uses the quantized embeddings of an Large Language Model to assess human\-likeness between two corpora by divergence frontiersDjolonga et al\. \([2020](https://arxiv.org/html/2605.23651#bib.bib8)\)with the KL\-divergence\. MAUVE yields a metric for human\-likeness of texts, which closely correlates with human judgment\. One major drawback of MAUVE is that it relies on model embeddings which are not human\-interpretable\. On the other hand, models using linguistic features have been shown to be sufficient to detect Large Language Model\-generated texts in a variety of different settingsAityan et al\. \([2025](https://arxiv.org/html/2605.23651#bib.bib1)\); Yadagiri et al\. \([2025](https://arxiv.org/html/2605.23651#bib.bib39)\); Li and Zhang \([2025](https://arxiv.org/html/2605.23651#bib.bib19)\); Zaitsu et al\. \([2025](https://arxiv.org/html/2605.23651#bib.bib41)\); Przystalski et al\. \([2024](https://arxiv.org/html/2605.23651#bib.bib32)\)\. Multiple analytical studies have studied the difference between human and Large Language Model\-generated texts at a corpus level using similar linguistic features\. They have either done this implicitly on a single register[Zamaraeva et al\.](https://arxiv.org/html/2605.23651#bib.bib42); Bagdasarov and Alves \([2025](https://arxiv.org/html/2605.23651#bib.bib3)\);[Georgiou](https://arxiv.org/html/2605.23651#bib.bib12)or over multiple registersReinhart et al\. \([2024](https://arxiv.org/html/2605.23651#bib.bib33)\); Berber Sardinha \([2024](https://arxiv.org/html/2605.23651#bib.bib4)\); Milička et al\. \([2025a](https://arxiv.org/html/2605.23651#bib.bib24),[b](https://arxiv.org/html/2605.23651#bib.bib25)\)\. While these studies all show systematic differences in language production between humans and Large Language Models, they come with several limitations\.

First, prior studies often assess register\-alignment underexplicit\-register conditioning, typically by asking models to continue an already register\-specific human textReinhart et al\. \([2024](https://arxiv.org/html/2605.23651#bib.bib33)\); Milička et al\. \([2025a](https://arxiv.org/html/2605.23651#bib.bib24),[b](https://arxiv.org/html/2605.23651#bib.bib25)\)\. In this setup the linguistic patterns for the specific register are already present in the example and register\-alignment requires the model to simply continue these patterns\. This primarily assesses the model’s pretraining rather than an internal notion of register\.

Secondly, several analyses compare marginal sample distributions or co\-occurrence patterns of linguistic features, without quantifying the observed differences\. While these results give empirical insights, they lack a clear metric that can be optimized in future work\.

In contrast, we propose a framework measuring register\-alignment under \(i\)implicit\-register conditioning, meaning that the model has to infer the target\-register style purely from the situational context provided in the prompt\. Our framework \(ii\) employs controlled and transparent prompting, \(iii\) compares multiple open\-source models across architectures and scales, and \(iv\) formulates human\-likeness as a two\-sample problem over multidimensional and human\-interpretable linguistic features\.

## 3Methodology

An overview of our method is shown in[1](https://arxiv.org/html/2605.23651#S1.F1)\. The first part consists in corpus generation\. Given a target registerrr, we first collect a human reference corpus, subsample a representative subset and use its metadata to construct natural language prompts\. These prompts instruct Large Language Models to generate texts using similar situational contexts\. As the prompts do not contain parts of the original text, they correspond to our definition of implicit\-register conditioning setting\. Using different Large Language Models in a zero\- and few\-shot setting, we generate parallel corporaC~r,θ\\tilde\{C\}\_\{r,\\theta\}for our registerrr\. For each text we extract a set of linguistic features, and encode each corpus into a multidimensional feature space\. Finally, in this space, we frame the problem of human\-likeness as a two\-sample problem quantified byMMD\(C~r1,C~r,θ‚\)MMD\(\\tilde\{C\}^\{1\}\_\{r\},\\tilde\{C\}^\{r,\\theta\}‚\)\. To assess, whether the observed distance indicates a meaningfull distance, we compare it againstMMD\(C~r1,C~r2\)MMD\(\\tilde\{C\}\_\{r\}^\{1\},\\tilde\{C\}\_\{r\}^\{2\}\), whereC~r1\\tilde\{C\}\_\{r\}^\{1\}andC~r1\\tilde\{C\}\_\{r\}^\{1\}are two disjoint subsets repeatedly subsample from the full corpusCrC\_\{r\}to get a Confidence Interval for the expected Maximum Mean Discrepancy between human samples\.

### 3\.1Data generation

LetR=\{r1,…,rK\}R=\\\{r\_\{1\},\\dots,r\_\{K\}\\\}denote a set of target registers\. For each registerr∈Rr\\in R, we define a corpusCr=\{\(ti,mi\)\}i=1NrC\_\{r\}=\\\{\(t\_\{i\},m\_\{i\}\)\\\}\_\{i=1\}^\{N\_\{r\}\}, whereti∈V∗t\_\{i\}\\in V^\{\*\}is a human\-authored text,V∗V^\{\*\}the set of all finite sequences over a vocabularyVV,mim\_\{i\}is metadata giving specific context abouttit\_\{i\}andNrN\_\{r\}is the size of the corpus\. We express implicit\-register conditioning by defining, for each registerrr, a natural\-language prompt templatepr\(mi\)p\_\{r\}\(m\_\{i\}\)that translates metadata into a task\-prompt which describes a language generation task\.

Given a model with parametersθ\\thetawe generate model outputst~i\\tilde\{t\}\_\{i\}by autoregressive sampling with prompt templatepr\(mi\)p\_\{r\}\(m\_\{i\}\)resulting in a register\-specific synthetic corpusC~r,θ=\{\(t~i,mi\)\}i=1Nr\\tilde\{C\}\_\{r,\\theta\}=\\\{\(\\tilde\{t\}\_\{i\},m\_\{i\}\)\\\}\_\{i=1\}^\{N\_\{r\}\}\.

Taking a set of human demonstrations, we can define a few\-shot prompt templatepr,s\(mi,C~r\)p\_\{r,s\}\(m\_\{i\},\\tilde\{C\}\_\{r\}\), whereC~r\\tilde\{C\}\_\{r\}is a corpus of human\-authored texts of registerrrwith\(t,m\)∈C~r⟹\(t,m\)∉Cr\(t,m\)\\in\\tilde\{C\}\_\{r\}\\implies\(t,m\)\\notin C\_\{r\}andssthe number of "shots", i\.e\., the number of texts taken fromC~r\\tilde\{C\}\_\{r\}to construct the prompt\. Using this prompt template we can generate a few\-shot synthetic corpusC~r,θ,s\\tilde\{C\}\_\{r,\\theta,s\}by sampling our model in the same fashion\. The few\-shot examples are selected with a seeded random function, such that every model uses the same prompt for each instancet~i\\tilde\{t\}\_\{i\}\.

### 3\.2Distributional framework

Given a corpusCCas described above, we define a set of vectorsX=\{\(xi\)\}i=1NrX=\\\{\(x\_\{i\}\)\\\}\_\{i=1\}^\{N\_\{r\}\}, wherexi∈ℝdx\_\{i\}\\in\\mathbb\{R\}^\{d\}, as the resulting set when applying a functionf:V∗↦ℝdf:V^\{\*\}\\mapsto\\mathbb\{R\}^\{d\}to each element inCC\. Each dimension inxix\_\{i\}describes the frequency of one specific linguistic feature in texttit\_\{i\}\. In our work, the functionffis implemented by a program counting the occurrence of specific linguistic features in a text\.

In this framework, we treat the observed setsXrX\_\{r\}andX~r,θ\\tilde\{X\}\_\{r,\\theta\}as samples from two underlying distributionsPrP\_\{r\}andPr,θP\_\{r,\\theta\}\. In this interpretation, quantifying the discrepancy between a model and the desired human distribution is calculated byD\(Xr,X~r,θ\)D\(X\_\{r\},\\tilde\{X\}\_\{r,\\theta\}\), whereD\(⋅,⋅\)D\(\\cdot,\\cdot\)is some function calculating a distance between two samples of two distributions\. In other words, we formulate the problem of human likeness as a two\-sample problem in a linguistic feature space\.

### 3\.3Distance Metrics

We implementD\(Xr,X~r,θ\)D\(X\_\{r\},\\tilde\{X\}\_\{r,\\theta\}\)using the Maximum Mean Discrepancy\. Letℋk\\mathcal\{H\}\_\{k\}be an Reproducing Kernel Hilbert Spaces associated with a characteristic kernelkk\. The squared Maximum Mean Discrepancy between two distributionsPPandQQis defined as:

=sup‖f‖ℋk≤1\(𝔼x∼P\[f\(x\)\]−𝔼y∼Q\[f\(y\)\]\)2\\displaystyle=\\sup\_\{\\\|f\\\|\_\{\\mathcal\{H\}\_\{k\}\}\\leq 1\}\\left\(\\mathbb\{E\}\_\{x\\sim P\}\[f\(x\)\]\-\\mathbb\{E\}\_\{y\\sim Q\}\[f\(y\)\]\\right\)^\{2\}=‖μP−μQ‖ℋk2,\\displaystyle=\\left\\\|\\mu\_\{P\}\-\\mu\_\{Q\}\\right\\\|\_\{\\mathcal\{H\}\_\{k\}\}^\{2\},whereμP:=𝔼x∼P\[k\(x,⋅\)\]\\mu\_\{P\}:=\\mathbb\{E\}\_\{x\\sim P\}\[k\(x,\\cdot\)\]andμQ:=𝔼y∼Q\[k\(y,⋅\)\]\\mu\_\{Q\}:=\\mathbb\{E\}\_\{y\\sim Q\}\[k\(y,\\cdot\)\]denote the kernel mean embeddings\. In other words, Maximum Mean Discrepancy measures the distance between the mean embeddings of two distributions in the Reproducing Kernel Hilbert Spaces\. For characteristic kernels, the mean embedding is injective, implying thatMMD\(P,Q\)=0\\mathrm\{MMD\}\(P,Q\)=0if and only ifP=QP=QGretton et al\. \([2012](https://arxiv.org/html/2605.23651#bib.bib14)\)\.

Given two samplesX=\{xi\}i=1m∼PX=\\\{x\_\{i\}\\\}\_\{i=1\}^\{m\}\\sim PandY=\{yj\}j=1n∼QY=\\\{y\_\{j\}\\\}\_\{j=1\}^\{n\}\\sim Q, a biased empirical estimate of theMMDk2\(P,Q\)\\mathrm\{MMD\}^\{2\}\_\{k\}\(P,Q\)is given by

MMDk2\(X,Y\)\\displaystyle\\mathrm\{MMD\}^\{2\}\_\{k\}\(X,Y\)=1m2∑i=1m∑j=1mk\(xi,xj\)\\displaystyle=\\frac\{1\}\{m^\{2\}\}\\sum\_\{i=1\}^\{m\}\\sum\_\{j=1\}^\{m\}k\(x\_\{i\},x\_\{j\}\)−2mn∑i=1m∑j=1nk\(xi,yj\)\\displaystyle\\quad\-\\frac\{2\}\{mn\}\\sum\_\{i=1\}^\{m\}\\sum\_\{j=1\}^\{n\}k\(x\_\{i\},y\_\{j\}\)\+1n2∑i=1n∑j=1nk\(yi,yj\)\.\\displaystyle\\quad\+\\frac\{1\}\{n^\{2\}\}\\sum\_\{i=1\}^\{n\}\\sum\_\{j=1\}^\{n\}k\(y\_\{i\},y\_\{j\}\)\.We use an RBF kernel forkk\. The bandwidth parameter is estimated via the median heuristic, i\.e\., using the median pairwise distance among pooled samplesX∪YX\\cup YGretton et al\. \([2012](https://arxiv.org/html/2605.23651#bib.bib14)\)\. This is a standard approach commonly used in the evaluation of generative modelsZhu et al\. \([2021](https://arxiv.org/html/2605.23651#bib.bib43)\); Liang et al\. \([2018](https://arxiv.org/html/2605.23651#bib.bib20)\); Long et al\. \([2015](https://arxiv.org/html/2605.23651#bib.bib22)\); Dziugaite et al\. \([2015](https://arxiv.org/html/2605.23651#bib.bib9)\)\.

While other distances could be used to implementD\(⋅,⋅\)D\(\\cdot,\\cdot\), we selected the Maximum Mean Discrepancy as it only depends on a single kernel with a single bandwidth parameter\. This is a clear advantage over metrics based on the Kullback\-Leibler divergence, which require explicit estimation of probability density functions as used inBagdasarov and Alves \([2025](https://arxiv.org/html/2605.23651#bib.bib3)\)\.

### 3\.4Sample size

In practice, available corpora are often too large for our method to be computationally feasible\. To strike a balance between distributional expressiveness and computational cost, we sample each corpus using the following method\.

Given a large corpusCr,fullC\_\{r,full\}and the according set of linguistic feature vectorsXr,fullX\_\{r,full\}as defined above, wherefullfullmeans we are using the entire dataset available\. We resample a Confidence Interval for a sampling sizen<<\|Cr,full\|n<<\|C\_\{r,full\}\|byBBtimes subsampling two exclusive setsX1,X2⊂Xr,fullX\_\{1\},X\_\{2\}\\subset X\_\{r,full\}of sizennand calculatingMMD2\(X1,X2\)\\mathrm\{MMD\}^\{2\}\(X\_\{1\},X\_\{2\}\)\. From the resultingBBdistances, we calculate anmm% Confidence Interval by taking the upper and lowerm/2m/2percentiles\. Repeating this for differentnnwe can find a dataset size that is a balanced tradeoff between sampling error and computational effort\.

For the selectednn, the upper limit of ourm%m\\%Confidence Interval of the Maximum Mean Discrepancy gives a good empirical upper bound for the distances we would expect to see between two samples of the human distribution for our sample size\. If the observed distance between a model and the human dataset is larger than the upper limit of our Confidence Interval this is a strong indicator that under the used experiment setup, the model sample is farther away than what we would typically expect between two samples of the human distribution\.

While we only use a subsampled version of the full human corpora for our experiments, we use the full corpora of each register to standardize all experiments with the full human mean and standard deviation\. Because the Maximum Mean Discrepancy is only comparable with respect to the same Reproducing Kernel Hilbert Spaces, we use a pooled version of our full human corpora to get the bandwidth of the kernel using the median heuristic for the respective registerGretton et al\. \([2012](https://arxiv.org/html/2605.23651#bib.bib14)\)\. This ensures that all distances calculated within one register are comparable\.

### 3\.5Prompt stability

To test the stability of our generation procedure, we define a set of alternative versions of prompt templatesPr′=\{pr,1′,…,pr,n′\}P\_\{r\}^\{\\prime\}=\\\{p\_\{r,1\}^\{\\prime\},\.\.\.,p\_\{r,n\}^\{\\prime\}\\\}which differ fromprp\_\{r\}in ordering, tone, and formality\. By constructing different versions of synthetic corporaCr,iC\_\{r,i\}, it is possible to test whether the observed effects are stable over different prompting techniques\.

## 4Experimental Setup

### 4\.1Linguistic Framework

In this study, we use the set of linguistic features introduced inBiber \([1988](https://arxiv.org/html/2605.23651#bib.bib5)\)\. The feature set consists of 67 lexico\-grammatical and discourse features\. In his work,[Biber](https://arxiv.org/html/2605.23651#bib.bib5)identified six latent dimensions of linguistic variation on a dataset of 23 registers\. These dimensions are based on the co\-occurrence of the 67 linguistic features in the studied corpus\. The dimensions are interpreted by the functions that their underlying features commonly fulfill in texts, from which a descriptive dimension name is derived, such as Involved vs\. Information Production or Narrative vs\. Non\-Narrative Concerns\. By linearly combining the features using the weights published inBiber \([1988](https://arxiv.org/html/2605.23651#bib.bib5)\)theDimension Scoreof a text can be calculated\. In corpus\-based text\-linguistics, comparing these dimension scores between different registers is a standard research method\. For the full list of features and their weights on the six dimensions please refer to TableLABEL:tab:biber1988\-loadingsin the Appendix\.

### 4\.2Data

We apply our framework to five different datasets spanning different communicative purposes and situational contexts:

- •Spoken conversation:BNC2014Spoken— Transcribed recordings of naturally occurring conversations with rich metadata on speakers, topics, and situational contextLove et al\. \([2022](https://arxiv.org/html/2605.23651#bib.bib23)\)\.
- •Academic writing:S2ORC\_ACL— Introductions of ACL main conference papers \(2009–2018\), parsed via the S2ORC API, with abstracts used as document\-level metadataLo et al\. \([2020](https://arxiv.org/html/2605.23651#bib.bib21)\)\.
- •Instructive online text:wikiHow— Instructional articles describing step\-by\-step procedures, using article text as content and titles/headlines as metadataKoupaee and Wang \([2018](https://arxiv.org/html/2605.23651#bib.bib18)\)\.
- •Creative writing:WritingPrompts— User\-written fictional stories generated in response to prompts posted by other usersFan et al\. \([2018](https://arxiv.org/html/2605.23651#bib.bib11)\)\.
- •News reporting:XSum— BBC news articles paired with single\-sentence summariesNarayan et al\. \([2018](https://arxiv.org/html/2605.23651#bib.bib29)\)\.

We apply standard and dataset\-specific preprocessing methods to the data, which are described in more detail in Appendix[D](https://arxiv.org/html/2605.23651#A4)\. The selected multidimensional evaluation framework yields stable linguistic features for texts starting from 400 lexical tokens\. For computational efficiency purposes, we limit all our human and generated texts to a length of 400 lexical tokens with a soft limit to the end of the next sentence up to 440 tokens\. We apply this both to human and generated texts to avoid adding positional biases to our data\. For more details, refer to Appendix[C](https://arxiv.org/html/2605.23651#A3)\.

### 4\.3Generation Setup

We evaluate a set of seven open\-source large language models spanning multiple model families, parameter scales, and training strategies\. The evaluated models are Apertus 70BApertus et al\. \([2025](https://arxiv.org/html/2605.23651#bib.bib2)\), Llama 3\.3 70B and Llama 3\.1 8BGrattafiori et al\. \([2024](https://arxiv.org/html/2605.23651#bib.bib13)\), Qwen 3 32B and Qwen 3 8BYang et al\. \([2025](https://arxiv.org/html/2605.23651#bib.bib40)\), and Gemma 3 27B and Gemma 3 12BTeam et al\. \([2025](https://arxiv.org/html/2605.23651#bib.bib34)\)\. We restrict our scope to instruction\-tuned models, as only pretrained models tend not to follow instructionsWei et al\. \([2022](https://arxiv.org/html/2605.23651#bib.bib37)\); Ouyang et al\. \([2022](https://arxiv.org/html/2605.23651#bib.bib30)\), making them unsuitable for our setup\.

The prompt templates used are written following general best practice\. One example prompt used in our experiments is shown in Table[1](https://arxiv.org/html/2605.23651#S4.T1), for further details please see Appendix[B](https://arxiv.org/html/2605.23651#A2)\. To keep the computation within reasonable bounds, we only study prompt stability for the BNC2014Spoken dataset\.

Table 1:Prompt specification for the creative\-writing corpus \(WritingPrompts\)\. Curly\-braced expressions denote placeholders replaced at inference time\.
### 4\.4Implementation Details

Depending on the size of the model, we use 1 or 2 A100 GPUs with 80 GB of VRAM each\. For inference, we use the vLLM library to load the open\-source models from HuggingFace\. All generations are done using a temperature=1 and top\-p=1\. This is equal to sampling the full distribution of the model given the prompt\. While sampling parameters have an influence on the lexical diversity of generated texts, they tend not to have a systematic influence in terms of a Multi\-Dimensional Analysis after BiberMilička et al\. \([2025a](https://arxiv.org/html/2605.23651#bib.bib24)\)\. To extract the linguistic features, we use pybiber222Implementation available at:[https://github\.com/browndw/pybiber](https://github.com/browndw/pybiber)\.\.

## 5Results

### 5\.1MMD stability

Figure[2](https://arxiv.org/html/2605.23651#S5.F2)shows the human\-humanMMD2\\mathrm\{MMD\}^\{2\}for the XSum dataset with a 95% Confidence Interval\. Respective plots for the other datasets are in Appendix[C](https://arxiv.org/html/2605.23651#A3)\. For all datasets, we observe a sharp decrease in meanMMD2\\mathrm\{MMD\}^\{2\}distance and a tightening of the Confidence Interval when the sample size is increased fromnn=50 tonn=400\. Afterwards, we observe an asymptotic decrease and tightening when further increasing the sample size\. For this study, we fix our sample size to n=600 for all experiments\. We selected this sample size as a suitable tradeoff between sampling error and computation time\.

![Refer to caption](https://arxiv.org/html/2605.23651v1/x2.png)Figure 2:MMD2\\mathrm\{MMD\}^\{2\}with a resampled confidence interval for different sample sizes on the XSum dataset\.![Refer to caption](https://arxiv.org/html/2605.23651v1/x3.png)Figure 3:MMD2\\mathrm\{MMD\}^\{2\}for all datasets and models to the respective human corpus, where the points indicate the observedMMD2\\mathrm\{MMD\}^\{2\}and the whiskers show the 95% Confidence Interval resampled on coupled samples from the human and model corpus\. The orange line in each plot gives the respective Human\-HumanMMD2\\mathrm\{MMD\}^\{2\}for the respective datasets with the resampled Confidence Interval\. The models on the y\-axis are sorted by their observedMMD2\\mathrm\{MMD\}^\{2\}distance\. Because the distances between human samples are much smaller than the distances between models and human, the human baseline Confidence Interval is not visible\.
### 5\.2Model vs human

In Figure[3](https://arxiv.org/html/2605.23651#S5.F3), we show theMMD2\\mathrm\{MMD\}^\{2\}distance between the synthetic corpora and the human corpus for each register with an estimated 95% Confidence Interval\. The sampling error between human\-human discussed in Section[5\.1](https://arxiv.org/html/2605.23651#S5.SS1)is much smaller than the observed human\-Large Language Model distances, which further justifies the selected sample size for the experiments\.

For all datasets, all tested models under our experiment setup produce a distribution of linguistic features that is farther away than what we would expect to see between two human samples of the same size\. The distance ordering of the models changes by dataset, thus human\-likeness indeed needs to be evaluated by register\. The Qwen models produce the most human\-like distribution on the BNC2014Spoken, WritingPrompts and XSum datasets, even though they have a relatively few parameters compared to other models tested\. On the S2ORC\_ACL and wikiHow datasets, Apertus 70B is the most human\-like model tested\.

Relative to the human baseline, we can observe that some registers are harder for the models to generate\. While the upper limit of the Confidence Interval on theMMD2\\mathrm\{MMD\}^\{2\}between human samples is about 0\.0025 for all registers, we can observe that for Spoken conversation the observedMMD2\\mathrm\{MMD\}^\{2\}values are generally many times larger than this baseline when compared to the wikiHow dataset\.

### 5\.3Few\-Shot generation

Figure[4](https://arxiv.org/html/2605.23651#S5.F4)shows the influence of few\-shot examples on theMMD2\\mathrm\{MMD\}^\{2\}\. In some cases, providing the model with examples in the prompt helps to generate samples closer to the human distribution, the effect however is minor\. For the WritingPrompts and the XSum dataset, Llama 8B in the Zero\-Shot setting generates samples which are relatively far away from the human distribution\. Through demonstrations, the distance drastically decreases\. Interestingly, for the BNC2014Spoken dataset, the introduction of more than one example worsened the human\-likeness of Llama 8B generations\.

![Refer to caption](https://arxiv.org/html/2605.23651v1/figures/MMD_few_shot_AI_Human_ALL_DATASETS_edited.png)Figure 4:Overview of the proposed evaluation framework\. Human and LLM\-generated texts are represented as distributions in a multidimensional register feature space\. Distributional distances quantify similarity relative to human–human baselines, while classical register dimensions are used for interpretation rather than evaluation\. AMMD2\\mathrm\{MMD\}^\{2\}closer to zero is better\.
### 5\.4Prompt stability

Figure[5](https://arxiv.org/html/2605.23651#S5.F5)shows the results of the prompt stability experiments\. While a variation in the prompt changes the absolute value of the observed distances, the overall ranking of the model is consistent over all prompt variations\. Smaller models appear to be more susceptible to prompt variations\.

In Figure[8](https://arxiv.org/html/2605.23651#A9.F8)in the Appendix, the correlation of the observedMMD2\\mathrm\{MMD\}^\{2\}between the different prompt variations is shown\. All correlation values are≥0\.985\\geq 0\.985, which indicates that all prompt variations are highly correlated\.

![Refer to caption](https://arxiv.org/html/2605.23651v1/x4.png)Figure 5:MMD2\\mathrm\{MMD\}^\{2\}for the prompt stability experiments to the human reference sample of the BNC2014Spoken\. Dots indicate the mean value over all prompts, while the band shows the minimum and maximum observed distance for the respective model under all prompt variations\.
### 5\.5Biber dimensions

While the Maximum Mean Discrepancy provides a metric for the overall distance between human and Large Language Model\-generated texts, it offers no human interpretation of the observed distances\. For that purpose, we analyze the six dimensions introduced inBiber \([1988](https://arxiv.org/html/2605.23651#bib.bib5)\)\. An example for the first dimension on the BNC2014Spoken dataset is shown in Figure[6](https://arxiv.org/html/2605.23651#S5.F6)\. This dimension can be interpreted as how involved or informational a text is\. For Spoken conversation all tested models produce texts that are less involved than the human baseline\.

The resulting plots for all datasets and dimensions can be found in the Appendix[9](https://arxiv.org/html/2605.23651#A9.F9),[10](https://arxiv.org/html/2605.23651#A9.F10),[11](https://arxiv.org/html/2605.23651#A9.F11),[12](https://arxiv.org/html/2605.23651#A9.F12)and[13](https://arxiv.org/html/2605.23651#A9.F13)\. Some notable observations can be drawn about the behavior of models\. Regarding dimension 2, concerned with narrative vs non\-narrative discourse, for academic texts, all models except Apertus, tend to generate texts more narrative than the human baseline, while for Creative Writing, some models are less narrative, while others are comparable or more narrative than the human baseline\. On dimension 4, Overt Expression of Persuasion, for the wikiHow dataset and Creative Writing, human texts tend to be more persuasive than the tested models\.

![Refer to caption](https://arxiv.org/html/2605.23651v1/x5.png)Figure 6:Violinplot of Biber dimension 1 on BNC2014Spoken for human and models in the Zero\-Shot setting
### 5\.6Features

Looking at the marginal feature distributions can already show simple patterns of difference in human\-understandable terms\. These results are further detailed in Appendix[G](https://arxiv.org/html/2605.23651#A7)\.

For all registers, all models tend to use Present Participle Clauses and Nominalizations more frequently than humans, except for the Qwen models on the WritingPrompts dataset\. This is consistent with previous researchReinhart et al\. \([2024](https://arxiv.org/html/2605.23651#bib.bib33)\)\. On all datasets, all models tend to produce longer words than humans\. Except for the Qwen models on the WritingPrompts dataset, all models tend to use Attributive Adjectives more frequently, and, except for Llama 8B and Gemma 12B on the WritingPrompts dataset, use past\-tense less frequently\. Other nouns occur more frequently in spoken conversation, Creative Writing, and wikiHow, but less frequently in Academic writing\.

Apart from these global trends, interesting patterns emerge for single registers\. For Spoken conversation, all models show increased rates of Pied Piping, while for Creative Writing, Past Participle Clauses are frequently used more often than in human texts\. For Academic Writing, split infinitives and phrasal coordinations occur more frequently across all models, and in XSum, that as a subject is generally produced more often\. These patterns show that certain biases for single features exist across all model families and registers, while some biases are register\-specific\.

## 6Ablation

### 6\.1Model\-Model\-Distances

Using the same bandwidth and normalization as for the human\-model distance, the distance between models of one register can be calculated\. The results are shown in Appendix Figure[24](https://arxiv.org/html/2605.23651#A9.F24),[25](https://arxiv.org/html/2605.23651#A9.F25),[26](https://arxiv.org/html/2605.23651#A9.F26),[27](https://arxiv.org/html/2605.23651#A9.F27),[28](https://arxiv.org/html/2605.23651#A9.F28)\. It can be observed that models of the same family tend to cluster linguistically together\. Models of the Gemma family show big differences from all other models\.

### 6\.2Linguistic feature variability

We compute the sum of the variances of all 67 linguistic features after standardization\. We compare them to a human baseline, for which we resampled a Confidence Interval for the variance on the full human sample\.

The full results are shown in Appendix Figure[29](https://arxiv.org/html/2605.23651#A9.F29)\. For all datasets except the BNC2014Spoken, the models tend to show less variability than the human baseline\. For the BNC2014Spoken, the models show more variability than humans\. Interestingly, for academic writing, both Apertus 70B and Llama 8B are within the human Confidence Interval\. From this, we can conclude that for four out of five registers, the tested models tend to underproduce the variability observed in humans\.

## 7Discussion

In our experimental setup, all models, for all prompting strategies, fail to produce a linguistic\-feature distribution that is within the distance we would expect between human samples of the same size\. The relative ordering between the models changes by register\. These results highlight the necessity of a register\-aware evaluation method for studying Large Language Model language\. Aggregating results across registers could obscure systematic differences, as a model ranked among the most human\-like for one register, does not necessarily rank among the most human\-like on another register\. This is further supported by our classifier ablation in Appendix[E](https://arxiv.org/html/2605.23651#A5), where a classifier jointly trained on all register achieves lower accuracy than register\-specific classifier\. This is inline with register theory inBiber \([1988](https://arxiv.org/html/2605.23651#bib.bib5)\)andBiber and Conrad \([2019](https://arxiv.org/html/2605.23651#bib.bib7)\), which claim that linguistic patterns are induced by their situational context of use\.

Large Language Model behavior is primarily influenced by model architecture, training strategy, and training data\. We summarize the available information about the tested models in Appendix[H](https://arxiv.org/html/2605.23651#A8)\. Based on this information, we formulate three hypotheses to explain some of the observed behaviors, focusing on training data, model family, and instruction\-tuning\.

First, distillation and synthetic data in model training do not necessarily alter the register\-alignment of models: Across all registers, Apertus 70B and the Qwen variants consistently rank among the models closest to the human reference distribution\. The Qwen models are pretrained on large\-scale mixtures of public and synthetic data and involve explicit distillation from teacher LLMs, whereas Apertus is trained from scratch on non\-distilled data\.

Second, family\-specific training and architectures imprint stable stylistic priors on the linguistic feature distribution for our observed registers: Models of the same family tend to cluster closer together than models of different families\. This even holds for models of vastly different parameter sizes\. The question remains whether these stylistic priors are introduced from differences in the training data or from other model hyperparameters such as the pretraining loss, the alignment strategy or even differences in activation functions\. Fully open\-source models like Apertus can be used in future research to study where these priors are introduced\.

Third, instruction tuning introduces priors on the linguistic feature distribution of generated texts: Apertus was instruction\-tuned on ~4 million samplesApertus et al\. \([2025](https://arxiv.org/html/2605.23651#bib.bib2)\), while Llama 3\.3 70B was instruction\-tuned "on over 10 million" samples333Number taken from the official HuggingFace blog post[https://huggingface\.co/blog/llama3](https://huggingface.co/blog/llama3)\. Except for the BNC2014Spoken dataset, Apertus is consistently closer to the human baseline than Llama 3\.3\. Also in explicit register\-conditioning experiments, the non\-instruction\-tuned version of Llama is much closer to the human baselineReinhart et al\. \([2024](https://arxiv.org/html/2605.23651#bib.bib33)\)\.

The Maximum Mean Discrepancy metric introduced in this work can be optimized in future work to improve the language production of Large Language Models\. Minimizing the Maximum Mean Discrepancy between Large Language Model\-produced texts and representative register samples will encourage Large Language Models to produce texts that are \(1\) linguistically appropriate for the purpose they fulfill and \(2\) the Maximum Mean Discrepancy as a metric encourages models to produce texts from the full human distribution, rather than mode collapsing to a narrower distribution of linguistic features\. Incorporating such a metric into the instruction tuning of a model can be used to track how much the model deviates from human language over the training or even to test if the instruction tuning data itself follows a human\-like distribution of linguistic features\.

The data from this study can be used in subsequent work to further study differences between Large Language Model and human language by for example using different or more linguistic features than we did\. Also, a more in\-depth prompt sensitivity analysis could be interesting for future work\.

## 8Conclusion

We propose a framework for evaluating the human\-likeness of LLM\-generated texts across different registers\. To implement this framework, we used the classic linguistic features of[Biber](https://arxiv.org/html/2605.23651#bib.bib5)\. We pose human\-likeness as a two\-sample problem between a synthetic and human corpus measured by the Maximum Mean Discrepancy\. As such, this work takes a first step toward combining register\-functional text linguistics with Large Language Model benchmarking\.

Our empirical analysis across five datasets and seven open\-source Large Language Models demonstrates that Large Language Model generated texts differ systematically in their distribution of linguistic features from a human baseline\. These results are consistent for all tested models and are stable across prompt variations\. Human\-likeness as formulated by our framework is not dictated by model size, with the smaller Qwen 8B and 27B models performing relatively well across nearly all datasets\. Overall, our results show that human\-likeness is register\-dependent and as such must always be evaluated in a register\-sensitive way\.

## 9Limitations

This study comes with multiple limitations\. First, our analysis only used English language datasets and a fixed set of linguistic features from one specific corpus linguistic framework\. While these features are well established in corpus linguistics, they are not exhaustive and likely do not cover all aspects of systematic deviation between Large Language Model and Human language\.

Second, our experiments tested generation only under naive zero\-shot and few\-shot generation setups\. While these setups yielded consistent results under our prompt stability analysis, n\-shot prompting covers only a small part of prompting techniques\. This leads to follow\-up questions such as the influence of reasoning or further model tuning on the linguistic properties of model output\.

Third, with the proposed framework, we only capture these particular linguistic properties of the generated texts\. Any other properties of Large Language Models and generated texts are not accessed\. There is no guarantee that the texts closest to the human distribution are actually texts that fulfill other properties, such as fluency, factual correctness, or telling a convincing story\.

## References

- Aityan et al\. \(2025\)Sergey K\. Aityan, William Claster, Karthik Sai Emani, Sohni Rais, and Thy Tran\. 2025\.[A Lightweight Approach to Detection of AI\-Generated Texts Using Stylometric Features](https://doi.org/10.48550/ARXIV.2511.21744)\.In*Proceedings of the Multi\-Domain Detection of AI\-Generated Text Shared Task*\. ACL\.Version Number: 2\.
- Apertus et al\. \(2025\)Project Apertus, Alejandro Hernández\-Cano, Alexander Hägele, Allen Hao Huang, Angelika Romanou, Antoni\-Joan Solergibert, Barna Pasztor, Bettina Messmer, Dhia Garbaya, Eduard Frank Ďurech, Ido Hakimi, Juan García Giraldo, Mete Ismayilzada, Negar Foroutan, Skander Moalla, Tiancheng Chen, Vinko Sabolčec, Yixuan Xu, Michael Aerni, and 84 others\. 2025\.[Apertus: Democratizing Open and Compliant LLMs for Global Language Environments](https://arxiv.org/abs/2509.14233)\.
- Bagdasarov and Alves \(2025\)Sergei Bagdasarov and Diego Alves\. 2025\.Like a Human? A Linguistic Analysis of Human\-written and Machine\-generated Scientific Texts\.In*Proceedings of the First on Natural Language Processing and Language Models for Digital Humanities*\.
- Berber Sardinha \(2024\)Tony Berber Sardinha\. 2024\.[AI\-generated vs human\-authored texts: A multidimensional comparison](https://doi.org/10.1016/j.acorp.2023.100083)\.*Applied Corpus Linguistics*, 4\(1\):100083\.
- Biber \(1988\)Douglas Biber\. 1988\.*Variation across speech and writing*\.Cambridge University Press, Cambridge\.
- Biber \(2012\)Douglas Biber\. 2012\.[Register as a predictor of linguistic variation](https://doi.org/10.1515/cllt-2012-0002)\.*Corpus Linguistics and Linguistic Theory*, 8\(1\):9–37\.
- Biber and Conrad \(2019\)Douglas Biber and Susan Conrad\. 2019\.[*Register, genre, and style*](https://doi.org/10.1017/9781108686136), second edition edition\.Cambridge textbooks in linguistics\. Cambridge University Press, Cambridge New York\.
- Djolonga et al\. \(2020\)Josip Djolonga, Mario Lucic, Marco Cuturi, Olivier Bachem, Olivier Bousquet, and Sylvain Gelly\. 2020\.[Precision\-recall curves using information divergence frontiers](http://proceedings.mlr.press/v108/djolonga20a.html)\.In*The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, 26\-28 August 2020, Online \[Palermo, Sicily, Italy\]*, volume 108 of*Proceedings of Machine Learning Research*, pages 2550–2559\. PMLR\.
- Dziugaite et al\. \(2015\)Gintare Karolina Dziugaite, Daniel M\. Roy, and Zoubin Ghahramani\. 2015\.[Training generative neural networks via maximum mean discrepancy optimization](http://auai.org/uai2015/proceedings/papers/230.pdf)\.In*Proceedings of the Thirty\-First Conference on Uncertainty in Artificial Intelligence, UAI 2015, July 12\-16, 2015, Amsterdam, The Netherlands*, pages 258–267\. AUAI Press\.
- Es et al\. \(2024\)Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert\. 2024\.[RAGAs: Automated evaluation of retrieval augmented generation](https://aclanthology.org/2024.eacl-demo.16)\.In*Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations*, pages 150–158, St\. Julians, Malta\. Association for Computational Linguistics\.
- Fan et al\. \(2018\)Angela Fan, Mike Lewis, and Yann Dauphin\. 2018\.[Hierarchical neural story generation](https://doi.org/10.18653/v1/P18-1082)\.In*Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 889–898, Melbourne, Australia\. Association for Computational Linguistics\.
- \(12\)Georgios P Georgiou\.Differentiating between human\-written and AI\-generated texts using linguistic features automatically extracted from an online computational tool\.*Information*\.
- Grattafiori et al\. \(2024\)Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al\-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others\. 2024\.[The Llama 3 Herd of Models](https://arxiv.org/abs/2407.21783)\.
- Gretton et al\. \(2012\)Arthur Gretton, Karsten M\. Borgwardt, Malte J\. Rasch, Bernhard Schölkopf, and Alexander Smola\. 2012\.[A kernel two\-sample test](http://jmlr.org/papers/v13/gretton12a.html)\.*Journal of Machine Learning Research*, 13\(25\):723–773\.
- Guha et al\. \(2023\)Neel Guha, Julian Nyarko, Daniel E\. Ho, Christopher Ré, Adam Chilton, Aditya K, Alex Chohlas\-Wood, Austin Peters, Brandon Waldon, Daniel N\. Rockmore, Diego Zambrano, Dmitry Talisman, Enam Hoque, Faiz Surani, Frank Fagan, Galit Sarfaty, Gregory M\. Dickinson, Haggai Porat, Jason Hegland, and 21 others\. 2023\.[Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models](http://papers.nips.cc/paper_files/paper/2023/hash/89e44582fd28ddfea1ea4dcb0ebbf4b0-Abstract-Datasets_and_Benchmarks.html)\.In*Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023*\.
- Hendrycks et al\. \(2021\)Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt\. 2021\.[Measuring massive multitask language understanding](https://openreview.net/forum?id=d7KBjmI3GmQ)\.In*9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3\-7, 2021*\. OpenReview\.net\.
- Jin et al\. \(2021\)Di Jin, Eileen Pan, Nassim Oufattole, Wei\-Hung Weng, Hanyi Fang, and Peter Szolovits\. 2021\.[What Disease Does This Patient Have? A Large\-Scale Open Domain Question Answering Dataset from Medical Exams](https://doi.org/10.3390/app11146421)\.*Applied Sciences*, 11\(14\):6421\.
- Koupaee and Wang \(2018\)Mahnaz Koupaee and William Yang Wang\. 2018\.[WikiHow: A Large Scale Text Summarization Dataset](https://arxiv.org/abs/1810.09305)\.
- Li and Zhang \(2025\)Ziqi Li and Qi Zhang\. 2025\.Linguistic Differences between AI and Human Comments in Weibo: Detect AI\-Generated Text through Stylometric Features\.
- Liang et al\. \(2018\)Shiyu Liang, Yixuan Li, and R\. Srikant\. 2018\.[Enhancing the reliability of out\-of\-distribution image detection in neural networks](https://openreview.net/forum?id=H1VGkIxRZ)\.In*6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 \- May 3, 2018, Conference Track Proceedings*\. OpenReview\.net\.
- Lo et al\. \(2020\)Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel Weld\. 2020\.[S2ORC: The semantic scholar open research corpus](https://doi.org/10.18653/v1/2020.acl-main.447)\.In*Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics*, pages 4969–4983, Online\. Association for Computational Linguistics\.
- Long et al\. \(2015\)Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I\. Jordan\. 2015\.[Learning transferable features with deep adaptation networks](http://proceedings.mlr.press/v37/long15.html)\.In*Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6\-11 July 2015*, volume 37 of*JMLR Workshop and Conference Proceedings*, pages 97–105\. JMLR\.org\.
- Love et al\. \(2022\)Robbie Love, Claire Dembry, Andrew Hardie, Vaclav Brezina, and Tony McEnery\. 2022\.[The Spoken BNC2014: Designing and building a spoken corpus of everyday conversations](https://doi.org/10.1075/ijcl.22.3.02lov)\.*International Journal of Corpus Linguistics*, pages 319–344\.
- Milička et al\. \(2025a\)Jiří Milička, Anna Marklová, and Václav Cvrček\. 2025a\.[AI Brown and AI Koditex: LLM\-Generated Corpora Comparable to Traditional Corpora of English and Czech Texts](https://arxiv.org/abs/2509.22996)\.
- Milička et al\. \(2025b\)Jiří Milička, Anna Marklová, and Václav Cvrček\. 2025b\.[Benchmark of stylistic variation in LLM\-generated texts](https://arxiv.org/abs/2509.10179)\.
- Mirza et al\. \(2025\)Adrian Mirza, Nawaf Alampara, Sreekanth Kunchapu, Martiño Ríos\-García, Benedict Emoekabu, Aswanth Krishnan, Tanya Gupta, Mara Schilling\-Wilhelmi, Macjonathan Okereke, Anagha Aneesh, Mehrdad Asgari, Juliane Eberhardt, Amir Mohammad Elahi, Hani M\. Elbeheiry, María Victoria Gil, Christina Glaubitz, Maximilian Greiner, Caroline T\. Holick, Tim Hoffmann, and 16 others\. 2025\.[A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists](https://doi.org/10.1038/s41557-025-01815-x)\.*Nature Chemistry*, 17\(7\):1027–1034\.
- Mitchell et al\. \(2023\)Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D\. Manning, and Chelsea Finn\. 2023\.[Detectgpt: Zero\-shot machine\-generated text detection using probability curvature](https://proceedings.mlr.press/v202/mitchell23a.html)\.In*International Conference on Machine Learning, ICML 2023, 23\-29 July 2023, Honolulu, Hawaii, USA*, volume 202 of*Proceedings of Machine Learning Research*, pages 24950–24962\. PMLR\.
- Myntti et al\. \(2025\)Amanda Myntti, Erik Henriksson, Veronika Laippala, and Sampo Pyysalo\. 2025\.[Register Always Matters: Analysis of LLM Pretraining Data Through the Lens of Language Variation](https://arxiv.org/abs/2504.01542)\.volume abs/2504\.01542\.
- Narayan et al\. \(2018\)Shashi Narayan, Shay B\. Cohen, and Mirella Lapata\. 2018\.[Don’t give me the details, just the summary\! topic\-aware convolutional neural networks for extreme summarization](https://doi.org/10.18653/v1/D18-1206)\.In*Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing*, pages 1797–1807, Brussels, Belgium\. Association for Computational Linguistics\.
- Ouyang et al\. \(2022\)Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L\. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F\. Christiano, Jan Leike, and Ryan Lowe\. 2022\.[Training language models to follow instructions with human feedback](http://papers.nips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html)\.In*Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 \- December 9, 2022*\.
- Pillutla et al\. \(2021\)Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, and Zaïd Harchaoui\. 2021\.[MAUVE: measuring the gap between neural text and human text using divergence frontiers](https://proceedings.neurips.cc/paper/2021/hash/260c2432a0eecc28ce03c10dadc078a4-Abstract.html)\.In*Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6\-14, 2021, virtual*, pages 4816–4828\.
- Przystalski et al\. \(2024\)Karol Przystalski, Jan Argasiński, Iwona Grabska\-Gradzińska, and Jeremi Ochab\. 2024\.[Stylometry Recognizes Human and Llm\-Generated Texts in Short Samples](https://doi.org/10.2139/ssrn.4950812)\.
- Reinhart et al\. \(2024\)Alex Reinhart, Ben Markey, Michael Laudenbach, Kachatad Pantusen, Ronald Yurko, Gordon Weinberg, and David West Brown\. 2024\.[Do LLMs write like humans? Variation in grammatical and rhetorical styles](https://arxiv.org/abs/2410.16107)\.*ArXiv preprint*, abs/2410\.16107\.
- Team et al\. \(2025\)Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean\-bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, and 197 others\. 2025\.[Gemma 3 Technical Report](https://arxiv.org/abs/2503.19786)\.
- Veirano Pinto \(2023\)Marcia Veirano Pinto\. 2023\.[Elena SeoaneandDouglas Biber\(eds\.\), Corpus\-based approaches to register variation \(Studies in Corpus Linguistics 103\)\. Amsterdam and Philadelphia: John Benjamins, 2021\. Pp\. xi \+ 341\. ISBN 97827210548\.](https://doi.org/10.1017/S1360674322000478)*English Language and Linguistics*, 27\(3\):640–646\.
- Verma et al\. \(2024\)Vivek Verma, Eve Fleisig, Nicholas Tomlin, and Dan Klein\. 2024\.[Ghostbuster: Detecting text ghostwritten by large language models](https://aclanthology.org/2024.naacl-long.95)\.In*Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\)*, pages 1702–1717, Mexico City, Mexico\. Association for Computational Linguistics\.
- Wei et al\. \(2022\)Jason Wei, Maarten Bosma, Vincent Y\. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M\. Dai, and Quoc V\. Le\. 2022\.[Finetuned language models are zero\-shot learners](https://openreview.net/forum?id=gEZrGCozdqR)\.In*The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25\-29, 2022*\. OpenReview\.net\.
- Wu et al\. \(2025\)Junchao Wu, Shu Yang, Runzhe Zhan, Yulin Yuan, Lidia Sam Chao, and Derek Fai Wong\. 2025\.[A Survey on LLM\-Generated Text Detection: Necessity, Methods, and Future Directions](https://doi.org/10.1162/coli_a_00549)\.*Computational Linguistics*, 51\(1\):275–338\.
- Yadagiri et al\. \(2025\)Annepaka Yadagiri, Teja, L\. D\. M\. S\. Sa, Pakray, Partha, and Chunka, Chukhu\. 2025\.AI\-Generated Text Detection Using DeBERTa with Auxiliary Stylometric Features\.In*Proceedings of the Multi\-Domain Detection of AI\-Generated Text Shared Task*\.
- Yang et al\. \(2025\)An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others\. 2025\.[Qwen3 Technical Report](https://arxiv.org/abs/2505.09388)\.
- Zaitsu et al\. \(2025\)Wataru Zaitsu, Mingzhe Jin, Shunichi Ishihara, Satoru Tsuge, and Mitsuyuki Inaba\. 2025\.[Stylometry can reveal artificial intelligence authorship, but humans struggle: A comparison of human and seven large language models in Japanese](https://doi.org/10.1371/journal.pone.0335369)\.*PLOS One*, 20\(10\):e0335369\.
- \(42\)Olga Zamaraeva, Flickinger, Dan, Bond, Francis, and Gómez\-Rodríguez, Carlos\.Comparing LLM\-generated and human\-authored news text using formal syntactic theory\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*\.
- Zhu et al\. \(2021\)Yongchun Zhu, Fuzhen Zhuang, Jindong Wang, Guolin Ke, Jingwu Chen, Jiang Bian, Hui Xiong, and Qing He\. 2021\.[Deep Subdomain Adaptation Network for Image Classification](https://doi.org/10.1109/TNNLS.2020.2988928)\.*IEEE Transactions on Neural Networks and Learning Systems*, 32\(4\):1713–1722\.

## Appendix ASubsampling strategy

To draw our evaluation and few\-shot dataset out of the bigger human dataset we employ the following strategy\. First, we exclude the 5% samples with the longest metadata in terms of tokens for the Llama models from being selected as samples for these sets\. This step is necessary due to outliers in terms of metadata for online registers, such as the WritingPrompts dataset, where some unusually long story prompts occur\. Because of the relative small size and hand\-curated nature of the BNC2014Spoken corpus, we exclude only the longest 3%\. This exclusion is necessary in order to make the generation step more efficient by providing an upper bound on the maximum context length required per prompt\.

For the subsampling, we 1000 times draw a random dataset from the remaining human distribution and compute the aggregate marginal Wasserstein distance between the six Biber dimensions of the full dataset \(including the samples with a long context\) and the selected samples\. We take the samples with the overall lowest Wasserstein distance as our corpus\. We repeat the same procedure for the few\-shot dataset, but additionally exclude all samples already selected for the first dataset from occurring in the few\-shot dataset\.

While it would have been possible to use the Maximum Mean Discrepancy to validate the subsampling, the calculation grows quadratically with the dataset size\. Using the six dimensions of Biber serves as a simpler metric and ensures that the selected dataset is representative of the overall dataset for subsequent linguistic studies using the data\. Additionally, we calculate Cohen’s D between the marginal feature distributions of the selected subsample for the evaluation set and the full human distribution\. For all datasets, the absolute average Cohen’s D is <0\.06, indicating very small effect sizes\.

## Appendix BPrompts used in experiments

The prompts used for all experiments within one register are the same, except for the ablation protocol explained next\. For the Qwen3 model we appended "\\nothink" to the prompt to avoid the model using its thinking mode\. The used prompts are shown in Tables[4](https://arxiv.org/html/2605.23651#A9.T4),[5](https://arxiv.org/html/2605.23651#A9.T5),[6](https://arxiv.org/html/2605.23651#A9.T6)and[7](https://arxiv.org/html/2605.23651#A9.T7)\.

To test the prompt stability, we use the following protocol\. We write four additional prompt templates where we vary the ordering of the instructions, tone, and phrasing of the instructions\. With this setup, we want to estimate the bias introduced through stylistic variations of the prompt\. The correlation between the different prompt variants across models is shown in Figure[8](https://arxiv.org/html/2605.23651#A9.F8)\.

## Appendix CSample Size

The results for the subsampling stability of the Maximum Mean Discrepancy for all datasets are shown in Figure[7](https://arxiv.org/html/2605.23651#A9.F7)\.

## Appendix DData preprocessing

All datasets used in this study were obtained from publicly available resources and processed into a unified format consisting of a main text field and associated metadata\.

Spoken conversation data is drawn from the BNC2014 Spoken corpus, which contains transcribed recordings of naturally occurring conversations with detailed speaker\- and context\-level metadata; access to the corpus can be requested at[http://corpora\.lancs\.ac\.uk/bnc2014/](http://corpora.lancs.ac.uk/bnc2014/)Love et al\. \([2022](https://arxiv.org/html/2605.23651#bib.bib23)\)\.

Academic writing is derived from the ACL Anthology using the S2ORC API, which is available through the Semantic Scholar API at[https://www\.semanticscholar\.org/product/api](https://www.semanticscholar.org/product/api); we extract ACL main conference papers published between 2009 and 2018Lo et al\. \([2020](https://arxiv.org/html/2605.23651#bib.bib21)\)\. For each paper, we analyze the introduction section and use the abstract as document\-level metadata\.

Since all datasets except the BNC2014Spoken dataset are online scraped datasets, we employ a string cleaning method to ensure consistent encodings\. This involves replacing all linebreaks, indentation and multiple whitespace with single whitespace\. Further, we remove all whitespaces before punctuations and normalize all text into Unicode Normalization Form Compatibility Composition\.

To count lexical tokens, we use the Python package Spacy and the "en\_core\_web\_sm" model with disabled Named Entity Recognition\. We count all tokens that are not punctuation or a space\. To get the sentence boundaries, we use the sentences detected by Spacy\.

For the following datasets, we employ special data preparation schemes:

Writing Prompts: Since the Writing Prompt dataset is directly scraped form a social media platform, text can contain unusual formats or characters\. To catch this we calculate the punctuation\-words ratio\. We manually check the texts with the highest punctuation\-words ratio and define a threshold to exclude them\. We manually defined a threshold of 0\.2\. The excluded texts are published together with the results of this paper\.

BNC2014Spoken: The BNC2014Spoken corpus is anonymized, meaning that all names, places and other personal information are excluded from the dataset\. While the dataset comes pre\-annotated with POS\-tags, we have to use the same Spacy model as for the other datasets, as a different POS\-model will introduce a bias\. Therefore, we replace all anonymized information where possible with matching pseudo information\. The most common pseudonymized information is names\. Therefore, we scrape the Wikipedia page for common English male444[https://en\.wikipedia\.org/wiki/Category:English\_masculine\_given\_names](https://en.wikipedia.org/wiki/Category:English_masculine_given_names)and female555[https://en\.wikipedia\.org/wiki/Category:English\_feminine\_given\_names](https://en.wikipedia.org/wiki/Category:English_feminine_given_names)names and insert them according to the provided metadata into the text\. For other, less frequent personal information, we use the Faker666[https://faker\.readthedocs\.io/en/master/](https://faker.readthedocs.io/en/master/)library\.

## Appendix EClassifier

To add further empirical evidence to the selection of the framework of Biber, we train a classifier to distinguish between human\-written and Large Language Model generated texts solely by their linguistic features\. Following the logic of Biber, registers are distinguishable by their co\-occurrence patterns of linguistic featuresBiber \([1988](https://arxiv.org/html/2605.23651#bib.bib5)\)\. As these co\-occurrence patterns are given as linear combinations of the features, we should be able to train a linear decision boundary to distinguish between the human and Large Language Model generated texts if the results of our study hold\.

We construct our dataset by taking for all registers all human and Large Language Model examples, and randomly subsample the large class to the size of the smaller class\. We train a classifier on each register and on the whole dataset\. We use an 80/20 train\-test split and fit a Logistic Regression model with a 5\-fold cross\-validation to the data\.

The accuracy and ROC accuracy are shown in Table[2](https://arxiv.org/html/2605.23651#A5.T2)\. For all registers, we achieve a ROC AUC \> 0\.9 on the test set\. This indicates that the framework of Biber gives a useful feature set to detect Large Language Model generated texts and that Large Language Model generated and human\-produced texts differ substantially under our tested generation setup\.

Table 2:Classification performance of the AI vs\. Human register classifier across datasets\.
## Appendix FBiber framework and MMD

While the framework of[Biber](https://arxiv.org/html/2605.23651#bib.bib5)is common in corpus linguistics, the use of the Maximum Mean Discrepancy is not\. Therefore, we want to highlight in the following why the Maximum Mean Discrepancy can be beneficial for register studies\.

Taking the Maximum Mean Discrepancy with an RBF\-kernel as our distance metric, according to Theorem 5 inGretton et al\. \([2012](https://arxiv.org/html/2605.23651#bib.bib14)\)the Maximum Mean Discrepancy will only be 0 if and only if the underlying distributions are the same\. Thus, a Maximum Mean Discrepancy within the variability observed between humans indicates \(among higher order relationships\) that the underlying marginal feature distributions and co\-occurrence patterns are within what we would expect between two human samples\. In other words, if the observed Maximum Mean Discrepancy is within the variability we would expect between human samples, this indicates that an analysis of the two samples would yield roughly the same results as analyzing two human samples\.

## Appendix GMarginal Feature differences

To analyse the difference for the marginal feature distributions, we use the following plots\. Figures[14](https://arxiv.org/html/2605.23651#A9.F14),[15](https://arxiv.org/html/2605.23651#A9.F15),[16](https://arxiv.org/html/2605.23651#A9.F16),[17](https://arxiv.org/html/2605.23651#A9.F17)and[18](https://arxiv.org/html/2605.23651#A9.F18)show the mean of the each feature for each model together with the human sample\. The heatmap highlights the difference in standard deviations of the human reference for this feature\. Further, in Figures[19](https://arxiv.org/html/2605.23651#A9.F19),[20](https://arxiv.org/html/2605.23651#A9.F20),[21](https://arxiv.org/html/2605.23651#A9.F21),[22](https://arxiv.org/html/2605.23651#A9.F22)and[23](https://arxiv.org/html/2605.23651#A9.F23)we show the Wasserstein distance of the marginal feature distribution to the human reference sample both standardized with the full human sample\. The difference in means is a good way to visualize differences in a human\-interpretable way, while the Wasserstein distance is suitable to highlight differences in distributions\.

A comparison between a human text and three models for one writing prompt his shown in Table[3](https://arxiv.org/html/2605.23651#A7.T3)\. Annotated are the features discussed in Section[5\.6](https://arxiv.org/html/2605.23651#S5.SS6)\. In the text of Gemma 12B appear two nominalizations and 10 attributive adjectives compared to zero nominalizations and two attributive adjectives in the human text\. While the use of nouns and past tense is not too different from the human samples, all three generated samples use present participle clauses, while non occurs in the human sample\.

Table 3:Qualitative comparison of human and LLM generations for narrative prompt number 295432 from the WritingPrompts dataset for the Zero\-Shot setting\. In the tableNouns,Past tense,Attributive Adjectives,Present Participle Clauses,NominalizationsandPast Participle Clausesare annotated\.
## Appendix HModel details

A detailed overview of the differences between the employed model families is shown in Table[10](https://arxiv.org/html/2605.23651#A9.T10)\.

## Appendix IUsing the Framework to Benchmark New Models

The experiment setup used in this work can be directly repeated to evaluate any new Large Language Model against the models reported in this study\. To do this, we publish the code used in this study, the prompt templates for each register, the per\-register human reference subsamples, the per\-register bandwidth values for the Maximum Mean Discrepancy, and the per\-register feature means and standard deviations used for standardization\. This appendix outlines the procedure for evaluating a new model\.

Procedure:Given a new model with parametersθ′\\theta^\{\\prime\}that one wishes to evaluate on a registerrr:

1. 1\.Use the published prompt templatepr\(mi\)p\_\{r\}\(m\_\{i\}\)to sample a synthetic corpusC~r,θ′=\{\(t~i,mi\)\}i=1Nr\\tilde\{C\}\_\{r,\\theta^\{\\prime\}\}=\\\{\(\\tilde\{t\}\_\{i\},m\_\{i\}\)\\\}\_\{i=1\}^\{N\_\{r\}\}from the new model, conditioning on the same metadata itemsmim\_\{i\}used in this study\.
2. 2\.Extract the 67 linguistic features ofBiber \([1988](https://arxiv.org/html/2605.23651#bib.bib5)\)for every text inC~r,θ′\\tilde\{C\}\_\{r,\\theta^\{\\prime\}\}, mapping each text to a vector inℝ67\\mathbb\{R\}^\{67\}\.
3. 3\.Standardize the resulting feature vectors using the per\-register means and standard deviations published with this study\.
4. 4\.Compute the Maximum Mean Discrepancy \+ Confidence Interval between the new model’s standardized synthetic corpus and the published human subsample for registerrr, using the published bandwidth parameter and an RBF kernel\.
5. 5\.Compare the resulting Maximum Mean Discrepancy value against the values reported in Figure[3](https://arxiv.org/html/2605.23651#S5.F3)\.

Because both the human subsample and the bandwidth parameter for each register are fixed and published, the computed Maximum Mean Discrepancy\-values are directly comparable to the ones in this study\. The results in this paper were obtained with temperature11and top\-pp11, i\.e\. sampling the full distribution induced by the model\. Other sampling parameters can be used to study the influence different sampling parameter have on human\-likeness as studied here\.

If a researcher wishes to evaluate models on a register not covered by this study, both the human reference subsample and the bandwidth parameter must be re\-estimated for that register following the procedure in §3\.4 \(sample\-size selection\) and §3\.3 \(median\-heuristic bandwidth\)\. The code to do this is published with this paper\.

Table 4:Prompt specification for the BNC2014Spoken corpus\. Curly\-braced expressions denote placeholders replaced at inference time\.Table 5:Prompt specification for the S2ORC\_ACL corpus\. Curly\-braced expressions denote placeholders replaced at inference time\.Table 6:Prompt specification for the wikiHow corpus\. Curly\-braced expressions denote placeholders replaced at inference time\.Table 7:Prompt specification for the XSum corpus\. Curly\-braced expressions denote placeholders replaced at inference time\.Table 8:Prompt ablation 1 for the BNC2014Spoken corpus\. Curly\-braced expressions denote placeholders replaced at inference time\.![Refer to caption](https://arxiv.org/html/2605.23651v1/x6.png)\(a\)BNC2014Spoken
![Refer to caption](https://arxiv.org/html/2605.23651v1/x7.png)\(b\)S2ORC\_ACL
![Refer to caption](https://arxiv.org/html/2605.23651v1/x8.png)\(c\)wikiHow
![Refer to caption](https://arxiv.org/html/2605.23651v1/x9.png)\(d\)WritingPrompts
![Refer to caption](https://arxiv.org/html/2605.23651v1/x10.png)\(e\)XSum

Figure 7:MMD2\\mathrm\{MMD\}^\{2\}with bootstrapped confidence interval for different sample sizes on all datasets\. For BNC2014Spoken error is increasing, since dataset has only 1200 samples, thus a sample size larger 600 will lead to a smaller and larger subset\.Table 9:Feature loadings fromBiber \([1988](https://arxiv.org/html/2605.23651#bib.bib5)\)FactorFeatureLoadingDimension 1private verbs0\.96THAT deletion0\.91contractions0\.90present tense verbs0\.862nd person pronouns0\.86DO as pro\-verb0\.82analytic negation0\.78demonstrative pronouns0\.76general emphatics0\.741st person pronouns0\.74pronounit0\.71BE as main verb0\.71causative subordination0\.66discourse particles0\.66indefinite pronouns0\.62general hedges0\.58amplifiers0\.56sentence relatives0\.55WH questions0\.52possibility modals0\.50non\-phrasal coordination0\.48WH clauses0\.47final prepositions0\.43nouns\-0\.80word length\-0\.58prepositions\-0\.54type/token ratio\-0\.54attributive adjectives\-0\.47Dimension 2past tense verbs0\.90third person pronouns0\.73perfect aspect verbs0\.48public verbs0\.43synthetic negation0\.40present participial clauses0\.39Dimension 3WH relative clauses on object positions0\.63pied piping constructions0\.61WH relative clauses on subject positions0\.45phrasal coordination0\.36nominalizations0\.36time adverbials\-0\.60place adverbials\-0\.49adverbs\-0\.46Dimension 4infinitives0\.76prediction modals0\.54suasive verbs0\.49conditional subordination0\.47necessity modals0\.46split auxiliaries0\.44Dimension 5conjuncts0\.48agentless passives0\.43past participial clauses0\.42BY\-passives0\.41past participial WHIZ deletions0\.40other adverbial subordinators0\.39Dimension 6THAT clauses as verb complements0\.56demonstratives0\.55That relative clause on object positions0\.46That clauses as adjective complements0\.36Dimension 7SEEM / APPEAR0\.35![Refer to caption](https://arxiv.org/html/2605.23651v1/figures/prompt_correlation_heatmap_BNC2014Spoken.png)Figure 8:Correlation heatmap between theMMD2\\mathrm\{MMD\}^\{2\}between human and AI for the BNC2014Spoken between different prompt variants in the Zero\-Shot setting\.![Refer to caption](https://arxiv.org/html/2605.23651v1/x11.png)\(a\)Dimension 1
![Refer to caption](https://arxiv.org/html/2605.23651v1/x12.png)\(b\)Dimension 2
![Refer to caption](https://arxiv.org/html/2605.23651v1/x13.png)\(c\)Dimension 3
![Refer to caption](https://arxiv.org/html/2605.23651v1/x14.png)\(d\)Dimension 4
![Refer to caption](https://arxiv.org/html/2605.23651v1/x15.png)\(e\)Dimension 5
![Refer to caption](https://arxiv.org/html/2605.23651v1/x16.png)\(f\)Dimension 6

Figure 9:Human and model distributions for Biber dimensions in the Zero\-Shot setting \(BNC2014Spoken\)\.![Refer to caption](https://arxiv.org/html/2605.23651v1/x17.png)\(a\)Dimension 1
![Refer to caption](https://arxiv.org/html/2605.23651v1/x18.png)\(b\)Dimension 2
![Refer to caption](https://arxiv.org/html/2605.23651v1/x19.png)\(c\)Dimension 3
![Refer to caption](https://arxiv.org/html/2605.23651v1/x20.png)\(d\)Dimension 4
![Refer to caption](https://arxiv.org/html/2605.23651v1/x21.png)\(e\)Dimension 5
![Refer to caption](https://arxiv.org/html/2605.23651v1/x22.png)\(f\)Dimension 6

Figure 10:Human and model distributions for Biber dimensions in the Zero\-Shot setting \(S2ORC\_ACL\)\.![Refer to caption](https://arxiv.org/html/2605.23651v1/x23.png)\(a\)Dimension 1
![Refer to caption](https://arxiv.org/html/2605.23651v1/x24.png)\(b\)Dimension 2
![Refer to caption](https://arxiv.org/html/2605.23651v1/x25.png)\(c\)Dimension 3
![Refer to caption](https://arxiv.org/html/2605.23651v1/x26.png)\(d\)Dimension 4
![Refer to caption](https://arxiv.org/html/2605.23651v1/x27.png)\(e\)Dimension 5
![Refer to caption](https://arxiv.org/html/2605.23651v1/x28.png)\(f\)Dimension 6

Figure 11:Human and model distributions for Biber dimensions in the Zero\-Shot setting \(wikiHow\)\.![Refer to caption](https://arxiv.org/html/2605.23651v1/x29.png)\(a\)Dimension 1
![Refer to caption](https://arxiv.org/html/2605.23651v1/x30.png)\(b\)Dimension 2
![Refer to caption](https://arxiv.org/html/2605.23651v1/x31.png)\(c\)Dimension 3
![Refer to caption](https://arxiv.org/html/2605.23651v1/x32.png)\(d\)Dimension 4
![Refer to caption](https://arxiv.org/html/2605.23651v1/x33.png)\(e\)Dimension 5
![Refer to caption](https://arxiv.org/html/2605.23651v1/x34.png)\(f\)Dimension 6

Figure 12:Human and model distributions for Biber dimensions in the Zero\-Shot setting \(WritingPrompts\)\.![Refer to caption](https://arxiv.org/html/2605.23651v1/x35.png)\(a\)Dimension 1
![Refer to caption](https://arxiv.org/html/2605.23651v1/x36.png)\(b\)Dimension 2
![Refer to caption](https://arxiv.org/html/2605.23651v1/x37.png)\(c\)Dimension 3
![Refer to caption](https://arxiv.org/html/2605.23651v1/x38.png)\(d\)Dimension 4
![Refer to caption](https://arxiv.org/html/2605.23651v1/x39.png)\(e\)Dimension 5
![Refer to caption](https://arxiv.org/html/2605.23651v1/x40.png)\(f\)Dimension 6

Figure 13:Human and model distributions for Biber dimensions in the Zero\-Shot setting \(XSum\)\.![Refer to caption](https://arxiv.org/html/2605.23651v1/x41.png)Figure 14:Mean of the normalized linguistic features without standardization to the full human dataset, with the difference in standard deviation to the mean of the human reference sample for the BNC2014Spoken in the Zero\-Shot setting![Refer to caption](https://arxiv.org/html/2605.23651v1/x42.png)Figure 15:Mean of the normalized linguistic features without standardization to the full human dataset, with the difference in standard deviation to the mean of the human reference sample for the S2ORC\_ACL in the Zero\-Shot setting![Refer to caption](https://arxiv.org/html/2605.23651v1/x43.png)Figure 16:Mean of the normalized linguistic features without standardization to the full human dataset, with the difference in standard deviation to the mean of the human reference sample for the wikiHow in the Zero\-Shot setting![Refer to caption](https://arxiv.org/html/2605.23651v1/x44.png)Figure 17:Mean of the normalized linguistic features without standardization to the full human dataset, with the difference in standard deviation to the mean of the human reference sample for the WritingPrompts in the Zero\-Shot setting![Refer to caption](https://arxiv.org/html/2605.23651v1/x45.png)Figure 18:Mean of the normalized linguistic features without standardization to the full human dataset, with the difference in standard deviation to the mean of the human reference sample for the XSum in the Zero\-Shot setting![Refer to caption](https://arxiv.org/html/2605.23651v1/x46.png)Figure 19:Wasserstein distance for marginal feature distributions between model and human for BNC2014Spoken in the Zero\-Shot setting![Refer to caption](https://arxiv.org/html/2605.23651v1/x47.png)Figure 20:Wasserstein distance for marginal feature distributions between model and human for S2ORC\_ACL in the Zero\-Shot setting![Refer to caption](https://arxiv.org/html/2605.23651v1/x48.png)Figure 21:Wasserstein distance for marginal feature distributions between model and human for wikiHow in the Zero\-Shot setting![Refer to caption](https://arxiv.org/html/2605.23651v1/x49.png)Figure 22:Wasserstein distance for marginal feature distributions between model and human for WritingPrompts in the Zero\-Shot setting![Refer to caption](https://arxiv.org/html/2605.23651v1/x50.png)Figure 23:Wasserstein distance for marginal feature distributions between model and human for XSum in the Zero\-Shot setting![Refer to caption](https://arxiv.org/html/2605.23651v1/x51.png)Figure 24:Observed MMD distance between different models for BNC2014Spoken in the Zero\-Shot setting\. The MMD between Human\-Human is calculated as described in[5\.1](https://arxiv.org/html/2605.23651#S5.SS1)![Refer to caption](https://arxiv.org/html/2605.23651v1/x52.png)Figure 25:Observed MMD distance between different models for S2ORC\_ACL in the Zero\-Shot setting\. The MMD between Human\-Human is calculated as described in[5\.1](https://arxiv.org/html/2605.23651#S5.SS1)![Refer to caption](https://arxiv.org/html/2605.23651v1/x53.png)Figure 26:Observed MMD distance between different models for wikiHow in the Zero\-Shot setting\. The MMD between Human\-Human is calculated as described in[5\.1](https://arxiv.org/html/2605.23651#S5.SS1)![Refer to caption](https://arxiv.org/html/2605.23651v1/x54.png)Figure 27:Observed MMD distance between different models for WritingPrompts in the Zero\-Shot setting\. The MMD between Human\-Human is calculated as described in[5\.1](https://arxiv.org/html/2605.23651#S5.SS1)![Refer to caption](https://arxiv.org/html/2605.23651v1/x55.png)Figure 28:Observed MMD distance between different models for XSum in the Zero\-Shot setting\. The MMD between Human\-Human is calculated as described in[5\.1](https://arxiv.org/html/2605.23651#S5.SS1)![Refer to caption](https://arxiv.org/html/2605.23651v1/figures/TraceDispersion_ZeroShot_summary_5datasets_n600.png)Figure 29:Sum of the variances of the 67 linguistic features after normalization on the corresponding full human distribution for each model in the Zero\-Shot setting\. The human mean is around 67, which is expected due to the normalization\.Table 10:Model characteristics for all evaluated LLMs\. The following acronyms are used: Supervised Fine\-tuning \(SFT\), Quantile Reward Policy Optimization \(QRPO\), Direct Preference Optimization \(DPO\) and negative log\-likelihood \(NLL\)
How Human-Like Are Large Language Models? A Register-Aware Linguistic Evaluation Framework

Similar Articles

Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty

Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings

Can Large Language Models Imitate Human Speech for Clinical Assessment? LLM-Driven Data Augmentation for Cognitive Score Prediction

How Well Do Large Language Models Capture Human Personality?

Can Large Language Models Reliably Code Qualitative Humanitarian Data? A Benchmark Study Against Human Expert Adjudication

Submit Feedback

Similar Articles

Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty
Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings
Can Large Language Models Imitate Human Speech for Clinical Assessment? LLM-Driven Data Augmentation for Cognitive Score Prediction
How Well Do Large Language Models Capture Human Personality?
Can Large Language Models Reliably Code Qualitative Humanitarian Data? A Benchmark Study Against Human Expert Adjudication