More Aligned, Less Diverse? Analyzing the Grammar and Lexicon of Two Generations of LLMs

arXiv cs.CL Papers

Summary

This academic paper analyzes the syntactic and lexical diversity of two generations of LLMs compared to human-authored news text, finding that newer, aligned models exhibit reduced diversity.

arXiv:2605.06030v1 Announce Type: new Abstract: This study contributes to a growing line of research in comparing LLM-generated texts with human-authored text, in this case, English news text. We focus in particular on the evaluation of syntactic properties through formal grammar frameworks. Our analysis compares two generations of LLMs in the context of two human-authored English news datasets from two different years. Employing the Head-Driven Phrase Structure Grammar (HPSG) formalism, we investigate the distributions of syntactic structures and lexical types of AI-generated texts and contrast them with the corresponding distributions in the human-authored New York Times (NYT) articles. We use diversity metrics from ecology and information theory to quantify variation in grammatical constructions and lexical types. We show that English news text has changed little in the given time frame, while newer LLMs display reduced syntactic and, especially, lexical diversity compared to older, non-instruction-tuned models. These findings point to future work in studying effects of instruction tuning, which, while enhancing coherence and adherence to prompts, may narrow the expressive range of model output.
Original Article
View Cached Full Text

Cached at: 05/08/26, 07:05 AM

# More Aligned, Less Diverse? Analyzing the Grammar and Lexicon of Two Generations of LLMs
Source: [https://arxiv.org/html/2605.06030](https://arxiv.org/html/2605.06030)
Adrián Gude1Roi Santos\-Ríos1∗\\astFrancis Bond3Dan Flickinger2 Carlos Gómez\-Rodríguez1Olga Zamaraeva1 \{adrian\.lopez\.gude, roi\.santos\.rios, carlos\.gomez, olga\.zamaraeva\}@udc\.es danflick@alumni\.stanford\.edufrancis\.bond@upol\.cz 1Universidade da Coruña, CITIC2Independent Researcher3Palacký University, Olomouc

###### Abstract

This study contributes to a growing line of research in comparing LLM\-generated texts with human\-authored text, in this case, English news text\. We focus in particular on the evaluation of syntactic properties through formal grammar frameworks\. Our analysis compares two generations of LLMs in the context of two human\-authored English news datasets from two different years\. Employing the Head\-Driven Phrase Structure Grammar \(HPSG\) formalism, we investigate the distributions of syntactic structures and lexical types of AI\-generated texts and contrast them with the corresponding distributions in the human\-authored New York Times \(NYT\) articles\. We use diversity metrics from ecology and information theory to quantify variation in grammatical constructions and lexical types\. We show that English news text has changed little in the given time frame, while newer LLMs display reduced syntactic and, especially, lexical diversity compared to older, non\-instruction\-tuned models\. These findings point to future work in studying effects of instruction tuning, which, while enhancing coherence and adherence to prompts, may narrow the expressive range of model output\.

\\useforestlibrary

linguistics\\forestapplylibrarydefaultslinguistics

More Aligned, Less Diverse? Analyzing the Grammar and Lexicon of Two Generations of LLMs

Adrián Gude1Roi Santos\-Ríos1∗\\astFrancis Bond3Dan Flickinger2Carlos Gómez\-Rodríguez1Olga Zamaraeva1\{adrian\.lopez\.gude, roi\.santos\.rios, carlos\.gomez, olga\.zamaraeva\}@udc\.esdanflick@alumni\.stanford\.edufrancis\.bond@upol\.cz1Universidade da Coruña, CITIC2Independent Researcher3Palacký University, Olomouc

## 1Introduction

Large language models \(LLMs\) are increasingly compared to human writers across a growing range of linguistic and stylistic dimensions \(e\.g\.,Reinhartet al\.[2025](https://arxiv.org/html/2605.06030#bib.bib5); Moonet al\.[2025](https://arxiv.org/html/2605.06030#bib.bib6); Rashidet al\.[2024](https://arxiv.org/html/2605.06030#bib.bib7)\)\. However, it remains unclear how such comparisons should be made and which dimensions best capture the differences\. This limits our ability to draw robust conclusions about what makes human and machine writing distinct\.

In this paper,111Adrián Gude and Roi Santos\-Ríos contributed equally to this work\. Both authors should be regarded as joint first authors\.we take a step toward addressing this gap by examining diversity, both lexical and syntactic, as a consistently informative dimension for comparing human and LLM\-generated text\. Building on the formal grammatical framework of Head\-Driven Phrase Structure Grammar \(HPSG\) and the English Resource Grammar \(ERG\)\(Benderet al\.,[2002](https://arxiv.org/html/2605.06030#bib.bib47); Flickinger,[2011](https://arxiv.org/html/2605.06030#bib.bib27)\), we analyze variation in syntactic constructions and lexical types using diversity metrics drawn from ecology and information theory \(Shannon and Simpson indices;Magurran[2004](https://arxiv.org/html/2605.06030#bib.bib60); Stamatatos[2009](https://arxiv.org/html/2605.06030#bib.bib44)\)\.

In contrast to prior related work\(Zamaraevaet al\.,[2025](https://arxiv.org/html/2605.06030#bib.bib4); Muñoz\-Ortizet al\.,[2024](https://arxiv.org/html/2605.06030#bib.bib2)\), we compare two generations of LLMs and human\-authored news writing\. For the human side, we analyze New York Times \(NYT\) lead paragraphs from two distinct periods \(2023 and 2025\)\. For the LLM side, we compare a suite of base models trained prior to 2023 \(LLaMA, Mistral, Falcon\) with newer instruction\-tuned models trained up to 2024 \(Qwen 2\.5, Mistral 7B v0\.3, GPT\-4o, LLaMa 3\.3\)\.

Our results show that English news text remains stable across time in all diversity metrics, suggesting a consistent balance of grammatical and lexical variety in professional news prose\. In contrast, in LLM\-generated text, we observe that both syntactic and lexical diversity decline substantially in newer, instruction\-tuned models, with the effect being especially pronounced for lexical diversity\. In addition, the newer LLM texts are easier to parse and take less computer memory to do so\. Our findings suggest that, while instruction tuning is designed to improve the helpfulness and coherence of responses to natural\-language prompts\(Ouyanget al\.,[2022](https://arxiv.org/html/2605.06030#bib.bib62)\), it has the side effect of reducing the syntactic and lexical breadth of the outputs\. Instruction tuned models generate outputs that are stylistically narrower and less varied than both humans and, interestingly, than earlier base models\.

Overall, our study highlights diversity metrics as a robust, linguistically grounded way to track stylistic and grammatical shifts across model generations, shedding light on how current tuning paradigms may trade off lexical variety for stylistic control\.222Code available at:[https://github\.com/olzama/llm\-syntax/](https://github.com/olzama/llm-syntax/)

## 2Related work

Muñoz\-Ortizet al\.\([2024](https://arxiv.org/html/2605.06030#bib.bib2)\)conducted a large\-scale quantitative analysis contrasting texts generated by base \(non\-instruction tuned\) LLMs with human\-written news texts\. Their results revealed differences across multiple linguistic dimensions, including morphological, syntactic, psychometric, and sociolinguistic aspects\. These findings established a detailed baseline highlighting how human linguistic patterns remain more diverse and less homogenized compared to model\-generated text\.

Zamaraevaet al\.\([2025](https://arxiv.org/html/2605.06030#bib.bib4)\)further extended this line of research by conducting a comparison on the same data using Head\-Driven Phrase Structure Grammar \(HPSG\) and the English Resource Grammar \(ERG\), providing a fine\-grained perspective within an independent linguistic\-theoretic framework\. Their study showed that human\-authored texts had greater grammatical but*less*lexical diversity than the LLM texts, and that human writers differed from each other more than each differed with respect to any LLM \(in other words, that LLMs act as an “average” writer\)\.

So far, little research has been done on the effects of instruction tuning and reinforcement learning from human feedback \(RLHF\) on LLMs’ grammatical diversity\.Padmakumar and He \([2024](https://arxiv.org/html/2605.06030#bib.bib56)\)found that RLHF affected vocabulary type/token ratios of LLMs across LLMs, leading to more homogeneous texts\. Another evaluation of the different stages of RLHF training showed that although RLHF improves out\-of\-distribution generalization compared to supervised fine\-tuning, it significantly reduces output diversity measured through a combination of N\-gram counting, semantic cosine similarity, and natural language inference metrics, revealing a tradeoff between adaptability and linguistic variety\(Kirket al\.,[2024](https://arxiv.org/html/2605.06030#bib.bib57); Shypulaet al\.,[2025](https://arxiv.org/html/2605.06030#bib.bib1)\)\. To summarize, prior research suggests that, while modern LLMs have improved in fluency and instruction\-following capabilities, these advances may come at the cost of lexical and stylistic diversity even in the new families of LLMs\. Human texts, by contrast, still exhibit greater variety and more complex linguistic structures\.

Our study contributes to this idea in several key ways\. First, while previous work has compared a single generation of LLMs to human text or analyzed the effects of RLHF in isolation, we conduct a direct, diachronic comparison between two generations of LLMs: older base models and newer, instruction\-tuned ones\. Second, we mirror this generational approach on the human side by contrasting these models against human\-authored news texts from two corresponding time periods\. Furthermore, we use a linguistic\-theoretic grammar framework, namely HPSG, to provide a way of analyzing syntactic and lexical properties of language that is independent from natural language processing tasks and thus should be more robust/generalizable\. This framework allows us to look at lexical distributions in a systematic manner, beyond the surface information that vocabulary counts provide\. HPSG lexical types are complex representations of word types that specify aspects of their syntactic behavior\. We are not aware of a similar resource within, e\.g\., the UD framework; POS distinctions are too coarse\. Last but not least, we employ independent diversity metrics from ecology\.

## 3Methodology

We use the HPSG\-based framework utilized e\.g\. byZamaraevaet al\.\([2025](https://arxiv.org/html/2605.06030#bib.bib4)\)and focus on diversity metrics from ecology and information theory to quantify changes in grammatical and lexical variability over time\.

### 3\.1English Resource Grammar

The goal of this study is to provide an insight into how LLMs change over time with respect to the grammatical properties of their writing\. For this purpose, it is not enough to look at just the vocabulary, and, while looking at dependency structures like Universal Dependencies\(Nivreet al\.,[2016](https://arxiv.org/html/2605.06030#bib.bib34)\), asMuñoz\-Ortizet al\.\([2024](https://arxiv.org/html/2605.06030#bib.bib2)\)did is useful, we are interested in a comparison within a framework that is rooted in a formal linguistic theory and not inherently biased towards performance on NLP tasks\. For this reason, we choose the DELPH\-IN HPSG framework, followingZamaraevaet al\.\([2025](https://arxiv.org/html/2605.06030#bib.bib4)\)\. Head\-driven Phrase Structure Grammar\(HPSG; Pollard and Sag,[1994](https://arxiv.org/html/2605.06030#bib.bib33)\)is a theory of syntax that was developed by linguists theoretically and with empirical validation in mind, which is why the theory is associated with fully explicit formalisms that can be fully implemented on the computer, allowing for rigorous validation of the theoretical claims made about syntax\. DELPH\-IN333[https://delph\-in\.github\.io/docs/home/Home/](https://delph-in.github.io/docs/home/Home/)is one such formalism that matured into a long\-term international grammar engineering effort\. In particular, the English Resource Grammar\(ERG; Flickinger,[2000](https://arxiv.org/html/2605.06030#bib.bib26),[2011](https://arxiv.org/html/2605.06030#bib.bib27)\)has been in continuous development with regular releases,444[https://github\.com/delph\-in/erg/releases/tag/2025](https://github.com/delph-in/erg/releases/tag/2025)reaching 94% average accuracy over a variety of English corpora\. Importantly and in contrast to statistical parsers, the ERG is designed to provide structure for*all possible well\-formed*English sentences and to*reject*strings that do not correspond to well\-formed English utterances\. This is crucial for our methodology, because we want to be able to compare LLMs between each other and to human writers with respect to*rare*linguistic phenomena, not only the frequent ones, and also with respect to sentences which were*not*parsed by the grammar for some reason that can be informative\. The second property of the HPSG/ERG that is important to us is its structure: the grammar is represented as a clear hierarchy of*syntactic and lexical types*\. The type definitions are detailed HPSG structures which specify sets of constraints which make a construction possible \(such as: “this is a verb and it requires two complements of which one is a noun”, to provide a simplistic example\)\. Syntactic type definitions are used at parse time to “license” phrases bottom\-up, until a complete sentence spanning the whole input string is built \(such as noun phrases, verb phrases…\), whereas lexical types are used by the parser only at the beginning of the parsing but provide very rich information about the specific constraints that need to be met\. In this sense, lexical types are drastically different from vocabulary items which are just surface strings representing words\. Finally, grammar\-based exhaustive chart parsing provides us with a window into the sentences’*parsability*: how easy or difficult is it to parse a particular sentence, as a proxy measure of how complex it is\.

### 3\.2Portability

While our experiments focus on English, the grammar\-based approach itself can be used with any language \(notably, also with low\-resource languages\)\. The HPSG framework is based on general linguistic theory and is not specific to English, and the tools used to develop these grammars are likewise language\-independent\. At the same time, each implemented grammar necessarily includes language\-specific layers, which requires expert effort to develop\. As a result, existing resources differ in their size and coverage, with English currently having the most mature and broad\-coverage grammars\. However, the required investment is comparable to that of training machine learning systems, which depend on significant data and computational resources that may not be equally available across languages or domains\. We therefore see the methodology as broadly applicable, with its extension to other languages primarily dependent on the continued development of high\-quality grammatical resources\. The same applies across different genres, as once an adequate grammar is available for a language, the approach can be readily transferred to non\-news domains with not much additional effort\.

### 3\.3Diversity metrics

We use two diversity measures, the Shannon\-Wiener diversity index \(H, or Shannon Index\), and the Simpson Diversity Index\(Magurran,[2004](https://arxiv.org/html/2605.06030#bib.bib60)\)\.

Shannon:H′=−∑i=1Spi​ln⁡\(pi\)H^\{\\prime\}=\-\\sum\_\{i=1\}^\{S\}p\_\{i\}\\ln\(p\_\{i\}\)

Simpson:D=1−∑i=1Spi2D=1\-\\sum\_\{i=1\}^\{S\}p\_\{i\}^\{2\}

The Shannon Index is widely used as an ecological measure of species diversity\(Spellerberg and Fedor,[2003](https://arxiv.org/html/2605.06030#bib.bib58)\)\. It considers both the number of species \(richness\) and the evenness of their distribution, meaning a higher H value indicates greater diversity\. The index is the same as the Shannon Entropy, and quantifies the uncertainty or "surprise" of predicting the next species in a community\.

The Simpson Index\(Simpson,[1949](https://arxiv.org/html/2605.06030#bib.bib59)\)measures the probability that two individuals \(or tokens\) randomly selected from a sample will belong to different categories \(e\.g\., species, construction, lexical type, …\)\. It is less sensitive to low frequency phenomena than the Shannon Index\.

Table 1:Datasets: reproduced in full from Table 1 inMuñoz\-Ortizet al\.[2024](https://arxiv.org/html/2605.06030#bib.bib2), alongside the experiments done inZamaraevaet al\.[2025](https://arxiv.org/html/2605.06030#bib.bib4)\.Dataset\# Sent\. in datasetModel sizeTraining tokensData sourcesLLaMa37,8257B1TNot disclosed37,80013B1T37,56830B1\.5T38,10765B1\.5TFalconRefinedWeb\-English \(76%\), RefinedWeb\-Euro \(8%\),27,7697B1\.5TGutenberg \(6%\), Conversations \(5%\)GitHub \(3%\), Technical \(2%\)Mistral35,0867BNot disclosedNot disclosedOriginal NYT26,102N/AN/ANew York Times Archive, Oct\. 1, 2023 \- Jan\. 24, 2024Redwoods \(WSJ\)43,043N/AN/AWall Street Journal sections 1\-21Redwoods \(Wikipedia\)10,726N/AN/AWikipedia

Table 2:Datasets contributed with this paper\.Dataset\# Sent\. in datasetModel sizeTraining tokensData sourcesQwen 2\.537,82514B18T26,49832B18TNot disclosed34,89272B18TLlaMa 3\.3Not disclosed39,30670B15T\+Mistral v\.0\.333,8407B1TNot disclosedGPT\-4o50,544Not disclosed13TNot disclosedOriginal NYT26,102N/AN/ANew York Times Archive, Feb\. 1, 2025 \- May\. 31, 2025

## 4Data and generative models

To study the differences between LLM\-generated and human\-authored news texts, we construct a new dataset similar in structure to the one used inMuñoz\-Ortizet al\.\([2024](https://arxiv.org/html/2605.06030#bib.bib2)\)andZamaraevaet al\.\([2025](https://arxiv.org/html/2605.06030#bib.bib4)\)\. We use New York Times \(NYT\) lead paragraphs from February to May 2025\. We assume that the LLMs under investigation, all trained on data up to 2024, could not have encountered these human\-authored articles in training\.

system\_prompt:"Youareaprofessionaljournalistspecializinginwritingnews\.Followthegivenstructure\."

user\_prompt:"Youwillwriteanewsleadparagraphusingtheinputsbelow\.

Inputs

Headline:\{headline\}

LeadThreeWords:\{lead\_three\_words\}

Requirements\-Mandatory

Writeoneparagraphofseveralsentences\(morethanone,e\.g\.two\-three\(2\-3\)\);notitle,nobullets\.

Outputformat:theparagraphonly,nopreambleorlabels\."

Figure 1:Prompts used to generate news lead paragraphs from LLMs \(system and user prompts\)\.Our new NYT dataset mirrors the structure ofMuñoz\-Ortizet al\.[2024](https://arxiv.org/html/2605.06030#bib.bib2)\. Human\-authored texts consist of lead paragraphs downloaded from the NYT Archive API\.555[https://developer\.nytimes\.com/docs/archive\-product/1/overview](https://developer.nytimes.com/docs/archive-product/1/overview)For each headline, we prompted our set of LLMs with the headline and the first three words of the lead paragraph to generate synthetic leads\. WhereasMuñoz\-Ortizet al\.\([2024](https://arxiv.org/html/2605.06030#bib.bib2)\)used base models that could be prompted directly for text completion, the instruction\-tuned models need a more explicit prompt in the form of instructions telling them to complete the paragraph, which is shown in Figure[1](https://arxiv.org/html/2605.06030#S4.F1)\. The analyses we present refer exclusively to the human\-written and LLM\-generated lead paragraphs

The LLMs inMuñoz\-Ortizet al\.\([2024](https://arxiv.org/html/2605.06030#bib.bib2)\)included earlier systems \(LLaMA, Falcon 7B, and Mistral 7B, all released prior to October 2023\), while in this study we extend the design to more recent models: Qwen 2\.5 \(14B, 32B, 72B\), LLaMA 3\.3 \(70B\), Mistral 7B v0\.3, and GPT\-4o\. FollowingMuñoz\-Ortizet al\.\([2024](https://arxiv.org/html/2605.06030#bib.bib2)\), we continue to distinguish scaling effects within a single architecture \(e\.g\. different Qwen sizes\) from differences due to model families\.666As a total, we performed 3 initial text generations with Qwen 2\.5 32B to calibrate the outputs of the LLMs, then one execution per model, and lastly, another execution per model with the latest prompt, disclosed in Figure[1](https://arxiv.org/html/2605.06030#S4.F1)\. The total number of executions amounts to 17\.

Dataset and model properties are summarized in Tables[1](https://arxiv.org/html/2605.06030#S3.T1)\-[2](https://arxiv.org/html/2605.06030#S3.T2)\. The hyperparameters used for all models are the following: temperature: 0\.7, top\_p: 0\.92, top\_k: 50, repetition\_penalty: 1\.05, max\_new\_tokens: 1000, num\_return\_sequences: 1, num\_beams: 1\. The average sentence length is in the range of 18\-20 tokens for 2023 LLMs, while the newer 2025 models generate around 22\-29 tokens, as shown in Table[7](https://arxiv.org/html/2605.06030#A1.T7)\.

### 4\.1Scope

We deliberately restrict our experimental scope to a controlled news genre when comparing linguistic properties of LLM\-generated and human\-authored texts\. This design choice follows prior work\(Zamaraevaet al\.,[2025](https://arxiv.org/html/2605.06030#bib.bib4); Muñoz\-Ortizet al\.,[2024](https://arxiv.org/html/2605.06030#bib.bib2)\)and reflects both methodological and conceptual considerations\. First, genre may exert an influence on linguistic structure, outweighing individual author effectsBiber \([1991](https://arxiv.org/html/2605.06030#bib.bib19)\); limiting genre variation therefore reduces confounds and allows clearer attribution of observed differences to generation source rather than discourse conventions\. Second, focusing on one genre enables more reliable measurement of fine\-grained linguistic properties, which may otherwise be obscured by cross\-genre heterogeneity\. Finally, while large\-scale data generation across multiple genres would be desirable, it is computationally and financially costly, and may incentivize breadth at the expense of analytical depth\. Our scope prioritizes internal validity and interpretability, providing a principled basis for future work to test the generality of these findings across genres and domains\.

## 5Results

In the following subsections we compare the grammar distributions in the human\-authored and the LLM\-generated datasets\. The distributions were obtained with the English Resource Grammar \(§[3\.1](https://arxiv.org/html/2605.06030#S3.SS1)\)\. We provide a list of the most distinctive syntactic and lexical types, as well as examples of where they are found in the data\. However, the main point is that the datasets produced by people and by older and newer LLMs can be distinguished at the level of grammatical typesdistributions, taken as a statistical snapshot\. Examples are meant to be illustrative but not necessarily explanatory\. Any dataset can contain any instance of any syntactic or lexical type\.

Table 3:Syntactic constructions contributing the most to statistical differences between 2025 LLMs and English news text\.Table 4:Lexical types contributing the most to statistical differences between 2025 LLMs and English news textTable 5:Syntactic constructions contributing the most to the statistical differences between the 2023 and 2025 LLMs\.Table 6:Lexical types contributing the most to the statistical differences between the 2023 and 2025 LLMs\.### 5\.1Syntactic types: Humans and LLMs

![Refer to caption](https://arxiv.org/html/2605.06030v1/img/diversity-constr-shannon.png)

Figure 2:Syntactic construction diversity measured using the Shannon Index\. Higher values indicate a more varied distribution of syntactic constructions\. On the Y\-axis, each point corresponds to a model name\.
![Refer to caption](https://arxiv.org/html/2605.06030v1/img/diversity-constr-simpson.png)

Figure 3:Syntactic construction diversity measured using the Simpson Index\. Higher values indicate a more varied distribution of syntactic constructions\. On the Y\-axis, each point corresponds to a model name\.

![Refer to caption](https://arxiv.org/html/2605.06030v1/img/diversity-lextype-shannon.png)

Figure 4:Lexical type diversity measured using the Shannon Index\. Higher values indicate a more varied distribution of lexical constructions\. On the Y\-axis, each point corresponds to a model name\.
![Refer to caption](https://arxiv.org/html/2605.06030v1/img/diversity-lextype-simpson.png)

Figure 5:Lexical type diversity measured using the Simpson Index\. Higher values indicate a more varied distribution of lexical constructions\. On the Y\-axis, each point corresponds to a model name\.

Table[3](https://arxiv.org/html/2605.06030#S5.T3)shows that, compared to LLM\-generated texts, human\-authored English news text makes frequent use of constructions that help bind text to concrete events, locations, and temporal frames\. In particular, human\-authored news texts show higher use of clause\-embedding and attribution structures, a tendency that correlates with reportive verbs requiring subordinate clauses \(such as ‘said’ in ‘Critics said…’, Table[4](https://arxiv.org/html/2605.06030#S5.T4)\)\.

Figure[3](https://arxiv.org/html/2605.06030#S5.F3)shows that humans are clearly more diverse in their use of syntactic constructions than all LLMs\.777Shannon indices were computed using maximum bootstrap = 10,000In the case of the human texts, NYT\-authored texts retrieved from 2025 are slightly less varied than the 2023 texts, but very close \(seen also in terms of the Simpson index; Figure[3](https://arxiv.org/html/2605.06030#S5.F3)\)\.

### 5\.2Syntactic types: 2023 and 2025 LLMs

Table[5](https://arxiv.org/html/2605.06030#S5.T5)gives some examples of the distributional differences between the LLMs from 2023 and the LLMs from 2025\. Notably, the newer 2025 LLMs arenotmore diverse than 2023 models; in fact, they form alower\-diversity band, despite being larger and trained on more recent data\. This appears to be the result of the newer LLMs avoiding specifics such as names, dates, measures, etc\., which influences not only lexical but also syntactic distribution\. In this sense, older models were more like human writers\.

In 2023, the constructions that contribute the most to the differences in diversity in comparison with 2025 are adjective\-headed phrases and bare noun phrases \(Table[5](https://arxiv.org/html/2605.06030#S5.T5)\)\. In contrast, the 2025 models favor constructions that involve modification and coordination\. These include noun phrases with modifiers, participial subordinate phrases, and coordinated noun structures\. All of these add volume to the output but not necessarily content\. Although bare noun phrases \(common in the use of proper names\) are present, they are less characteristic of the overall distribution, ranking 4th instead of 2nd \(Table[5](https://arxiv.org/html/2605.06030#S5.T5)\)\. Overall, these results suggest that newer LLMs are more careful about outputting factually incorrect information, which, curiously, can be detected at the level of syntax\.

### 5\.3Lexical types: Humans and LLMs

Figure[5](https://arxiv.org/html/2605.06030#S5.F5)shows an interesting difference with respect to lexical types\. While syntactically, human authors of English news were clearly the most diverse, when it comes to lexical types, LLM outputs from 2023 show the highest diversity, surpassing all other corpora\. Human\-authored English news texts sit in the middle, and LLMs from 2025 rank the lowest\. This indicates that, despite being larger and trained on more recent corpora, the newer systems employ a lesser variety of vocabulary groupings characterized by certain syntactic behavior \(an example of such a group would be mass nouns, or clause\-embedding verbs, etc\)\. This surprising result calls for further investigation, possibly in the dimension of training paradigms, post\-training and alignment\. The same ranking is observed with Simpson indices \(Figure[5](https://arxiv.org/html/2605.06030#S5.F5)\)\. Humans, meanwhile, remain a stable reference, showing similar lexical type diversity as before, with a particularly distinctive use of proper names \(Table[4](https://arxiv.org/html/2605.06030#S5.T4)\)\.

### 5\.4Lexical types: 2023 and 2025 LLMs

The lexical\-type ranking in Table[6](https://arxiv.org/html/2605.06030#S5.T6)shows the change in the lexical types contributing the most to the diversity of the LLMs’ distributions between 2023 and 2025\. In 2023, the lexical types contributing the most to the diversity are utterance particles and personal pronouns\. This suggests that the models’ output is oriented toward conversational framing and speaker reference, with less emphasis on factual content\. By contrast, 2025 models have a distinctive distribution of the very ‘basic’ lexical types: adjectives, common nouns, and transitive verbs\. This may be a feature of a generic style that avoids specifics\.

### 5\.5Punctuation

When we split the lexical types into punctuation and non\-punctuation, in Figures[7](https://arxiv.org/html/2605.06030#S5.F7)and[9](https://arxiv.org/html/2605.06030#S5.F9)with Shannon index and Figures[7](https://arxiv.org/html/2605.06030#S5.F7)and[9](https://arxiv.org/html/2605.06030#S5.F9)with Simpson index, we see a very clear distinction: the 2025 models are much less diverse in their usage of punctuation\.

![Refer to caption](https://arxiv.org/html/2605.06030v1/img/diversity-lextype_punct-shannon.png)Figure 6:Lexical type diversity measured using the Shannon Index considering only punctuation\. On Y\-axis, each point corresponds to a model name\.
![Refer to caption](https://arxiv.org/html/2605.06030v1/img/diversity-lextype_punct-simpson.png)Figure 7:Lexical type diversity measured using the Simpson Index considering only punctuation\. On Y\-axis, each point corresponds to a model name\.

![Refer to caption](https://arxiv.org/html/2605.06030v1/img/diversity-lextype_xpunct-shannon.png)Figure 8:Lexical type diversity measured using the Shannon Index excluding punctuation\. On Y\-axis, each point corresponds to a model name\.
![Refer to caption](https://arxiv.org/html/2605.06030v1/img/diversity-lextype_xpunct-simpson.png)Figure 9:Lexical type diversity measured using the Simpson Index excluding punctuation\. On Y\-axis, each point corresponds to a model name\.

Some anomalies are seen in GPT\-4o: it generates very few em\-dashes or semicolons, even though they are common in journalists’ writing and they have been described as common in LLM\-generated text\(Srivastava,[2025](https://arxiv.org/html/2605.06030#bib.bib61)\)\. It also generates very few quotations \(correlating with its diminished use of the reportive verbs such as ‘said’\), whereas the rest of the models have shown no special behaviors like these\. For example, the WSJ has many sentences like this:“In Asia, as in Europe, a new order is taking shape,” Mr\. Baker said\., but these are very rare in the output from GPT\-4o\. We hypothesize that this is a response to post\-training aimed at reducing hallucinations and non\-factual output\. Still, this is a significant difference with what we expect from newspaper text\.

### 5\.6Text length and parsability

The length of the generated texts is also different between the 2023 and 2025 models\. More specifically, the 2025 models consistently generate longer sentences than both human\-authored English news sentences and the sentences produced by 2023 models\. In 2023, human sentences were about 10–20% longer than LLM sentences\. In contrast, newer models produce sentences that are 15–30% longer than sentences written by human authors\. The 2025 systems also reduce short sentences \(1–15 tokens\) by factors of 9 to 30 and cut non\-sentence fragments by a factor of five\. Despite this, the ERG requiresfewerresources \(time and space\) to parse the sentences generated by 2025 models\. Despite generating longer sentences, newer LLMs do not appear to create more complex structures\. Instead, they are easier for the ERG to parse \(recall the ERG is a deterministic, exhaustive search chart parser, which can run out of time or space to parse a sentence\)\. Human\-authored English news sentences from both 2023 and 2025 reach about 94–95% parse success, while LLM texts exceed 96% and approach 99% for the newest models\. Even with substantial length increases, 2025 LLM sentences remain highly parsable, reflecting structural regularity that aligns with the best\-understood and streamlined parts of the grammar, unlike human\-authored English news text, which triggers phenomena that may be less understood or objectively challenging to parse\. This means that the that the text generated by the newer instruction\-tuned LLMs is easier for the ERG to parse than both earlier LLM outputs and English human\-authored news text\. This finding motivates future work exploring how specific grammatical structures correlate with parsability\. More details about time and space required to parse each dataset are Appendix[A](https://arxiv.org/html/2605.06030#A1)\.

## 6Conclusion

In this work, we compared two generations of LLMs with two temporal samples of human\-authored news writing, studying their syntactic structure within a formal linguistic framework\. By applying diversity metrics from ecology and information theory to distributions of grammar constructions in LLM\-generated and human\-authored writing, we provided an interpretable view of how model\-generated text changed over time relative to human language production\. Our findings show that while English news text remains stable in both syntactic and lexical diversity, LLMs exhibit a shift: newer instruction\-tuned systems produce text that is syntactically and lexically less diverse\.

Despite producing longer sentences compared to human writers and to older LLMs, the 2025 instruction\-tuned LLMs display reduced lexical variety and lower constructional diversity\. The newer models also generate text that is easier to parse, suggesting increased predictability\. Together, these trends indicate that the more recent instruction\-tuned LLMs are constrained to a less expressive language space, yielding outputs that are more formulaic and less varied than English news text\.

Our results confirm with two independent frameworks — ecology diversity metrics combined with a linguistic\-theoretic account of grammar — that larger and newer models are not closer to human linguistic behavior, at least in the domain of professional news writing\. Instead, even though their fluency seems to have improved, they show a growing divergence in lexical and grammatical diversity from humans\. Future work could examine how the instruction\-tuning of the models affects their ability to generate more diverse texts, and if with the right procedures, we could enhance these linguistic capabilities that they seem to have traded off\.

## Limitations

There are some limitations in terms of methodology based on the nature of the study\. First, working with LLMs, which are non\-deterministic models, introduces variability in the generated outputs, as results depend strongly on both the prompt design and the specific model used\.

Second, this study focuses primarily on NYT\-style news articles, which do not fully represent the broader spectrum of writing styles\.

Third, the analysis of more recent LLMs is constrained by hardware limitations\. Due to the large size of some of the models considered, quantization is applied during the inference phase when generating synthetic data\. Specifically, we apply 4\-bit quantization to the largest models \(LLaMA 65B and Qwen 2\.5 72B\)\. In addition, LLaMA 65B is also evaluated under 8\-bit quantization, while the remaining models are used without quantization\.

Finally, another limitation is the scarcity of large\-scale HPSG grammars\. Currently, only a few languages have wide\-coverage implementations, and among them the English Resource Grammar is the only one extensive enough to parse roughly 94% of news text\. Consequently, the present study is necessarily restricted to English\.

## Acknowledgments

We acknowledge grants GAP \(PID2022\-139308OA\-I00\) funded by MICIU/AEI/10\.13039/501100011033/ and ERDF, EU; LATCHING \(PID2023\-147129OB\-C21\) funded by MICIU/AEI/10\.13039/501100011033 and ERDF, EU; and TSI\-100925\-2023\-1 funded by Ministry for Digital Transformation and Civil Service and “NextGenerationEU” PRTR; as well as funding by Xunta de Galicia \(ED431C 2024/02\)\.

CITIC, as a center accredited for excellence within the Galician University System and a member of the CIGUS Network, receives subsidies from the Department of Education, Science, Universities, and Vocational Training of the Xunta de Galicia\. Additionally, it is co\-financed by the EU through the FEDER Galicia 2021\-27 operational program \(Ref\. ED431G 2023/01\)\.

We have used ChatGPT and Gemini for minor copy\-editing \(e\.g\. thesaurus suggestions\) and for visualization ideas\. We have used GitHub copilot for code autocompletion and Claude Code for final refactoring\.

## References

- The Grammar Matrix: An open\-source starter\-kit for the rapid development of cross\-linguistically consistent broad\-coverage precision grammars\.InProceedings of the Workshop on Grammar Engineering and Evaluation at the 19th International Conference on Computational Linguistics,J\. Carroll, N\. Oostdijk, and R\. Sutcliffe \(Eds\.\),Taipei,pp\. 8–14\.Cited by:[§1](https://arxiv.org/html/2605.06030#S1.p2.1)\.
- D\. Biber \(1991\)Variation across speech and writing\.Cambridge University Press\.Cited by:[§4\.1](https://arxiv.org/html/2605.06030#S4.SS1.p1.1)\.
- D\. Flickinger \(2000\)On building a more efficient grammar by exploiting types\.Natural Language Engineering6\(01\),pp\. 15–28\.Cited by:[§3\.1](https://arxiv.org/html/2605.06030#S3.SS1.p1.1)\.
- D\. Flickinger \(2011\)Accuracy v\. robustness in grammar engineering\.InLanguage from a Cognitive Perspective: Grammar, Usage and Processing,E\. M\. Bender and J\. E\. Arnold \(Eds\.\),pp\. 31–50\.Cited by:[§1](https://arxiv.org/html/2605.06030#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.06030#S3.SS1.p1.1)\.
- R\. Kirk, I\. Mediratta, C\. Nalmpantis, J\. Luketina, E\. Hambro, E\. Grefenstette, and R\. Raileanu \(2024\)Understanding the effects of rlhf on llm generalisation and diversity\.External Links:2310\.06452,[Link](https://arxiv.org/abs/2310.06452)Cited by:[§2](https://arxiv.org/html/2605.06030#S2.p3.1)\.
- A\. E\. Magurran \(2004\)Measuring biological diversity\.Blackwell Publishing,Oxford\.External Links:ISBN 978\-0\-632\-05633\-0Cited by:[§1](https://arxiv.org/html/2605.06030#S1.p2.1),[§3\.3](https://arxiv.org/html/2605.06030#S3.SS3.p1.1)\.
- K\. Moon, A\. E\. Green, and K\. Kushlev \(2025\)Homogenizing effect of large language models \(llms\) on creative diversity: an empirical comparison of human and chatgpt writing\.Computers in Human Behavior: Artificial Humans,pp\. 100207\.Cited by:[§1](https://arxiv.org/html/2605.06030#S1.p1.1)\.
- A\. Muñoz\-Ortiz, C\. Gómez\-Rodríguez, and D\. Vilares \(2024\)Contrasting linguistic patterns in human and LLM\-generated news text\.Artificial Intelligence Review57\(10\),pp\. 265\.Cited by:[§1](https://arxiv.org/html/2605.06030#S1.p3.1),[§2](https://arxiv.org/html/2605.06030#S2.p1.1),[§3\.1](https://arxiv.org/html/2605.06030#S3.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.06030#S3.T1),[§4\.1](https://arxiv.org/html/2605.06030#S4.SS1.p1.1),[§4](https://arxiv.org/html/2605.06030#S4.p1.1),[§4](https://arxiv.org/html/2605.06030#S4.p2.1),[§4](https://arxiv.org/html/2605.06030#S4.p3.1)\.
- J\. Nivre, M\. De Marneffe, F\. Ginter, Y\. Goldberg, J\. Hajic, C\. D\. Manning, R\. McDonald, S\. Petrov, S\. Pyysalo, N\. Silveira, and R\. Tsarfaty \(2016\)Universal dependencies v1: a multilingual treebank collection\.InProceedings of the Tenth International Conference on Language Resources and Evaluation \(LREC’16\),pp\. 1659–1666\.Cited by:[§3\.1](https://arxiv.org/html/2605.06030#S3.SS1.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray, J\. Schulman, J\. Hilton, F\. Kelton, L\. Miller, M\. Simens, A\. Askell, P\. Welinder, P\. F\. Christiano, J\. Leike, and R\. Lowe \(2022\)Training language models to follow instructions with human feedback\.InAdvances in Neural Information Processing Systems,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),Vol\.35,pp\. 27730–27744\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf)Cited by:[§1](https://arxiv.org/html/2605.06030#S1.p4.1)\.
- V\. Padmakumar and H\. He \(2024\)Does writing with language models reduce content diversity?\.External Links:2309\.05196,[Link](https://arxiv.org/abs/2309.05196)Cited by:[§2](https://arxiv.org/html/2605.06030#S2.p3.1)\.
- C\. Pollard and I\. A\. Sag \(1994\)Head\-Driven Phrase Structure Grammar\.Studies in Contemporary Linguistics,The University of Chicago Press and CSLI Publications,Chicago, IL and Stanford, CA\.Cited by:[§3\.1](https://arxiv.org/html/2605.06030#S3.SS1.p1.1)\.
- M\. M\. Rashid, N\. Atilgan, J\. Dobres, S\. Day, V\. Penkova, M\. Küçük, S\. R\. Clapp, and B\. D\. Sawyer \(2024\)Humanizing ai in education: a readability comparison of llm and human\-created educational content\.InProceedings of the Human Factors and Ergonomics Society Annual Meeting,Vol\.68,pp\. 596–603\.Cited by:[§1](https://arxiv.org/html/2605.06030#S1.p1.1)\.
- A\. Reinhart, B\. Markey, M\. Laudenbach, K\. Pantusen, R\. Yurko, G\. Weinberg, and D\. W\. Brown \(2025\)Do llms write like humans? variation in grammatical and rhetorical styles\.Proceedings of the National Academy of Sciences122\(8\),pp\. e2422455122\.Cited by:[§1](https://arxiv.org/html/2605.06030#S1.p1.1)\.
- A\. Shypula, S\. Li, B\. Zhang, V\. Padmakumar, K\. Yin, and O\. Bastani \(2025\)Evaluating the diversity and quality of llm generated content\.arXiv preprint arXiv:2504\.12522\.Cited by:[§2](https://arxiv.org/html/2605.06030#S2.p3.1)\.
- E\. H\. Simpson \(1949\)Measurement of diversity\.Nature163,pp\. 688\.External Links:[Document](https://dx.doi.org/10.1038/163688a0)Cited by:[§3\.3](https://arxiv.org/html/2605.06030#S3.SS3.p5.1)\.
- I\. F\. Spellerberg and P\. J\. Fedor \(2003\)A tribute to claude shannon \(1916–2001\) and a plea for more rigorous use of species richness, species diversity and the ‘shannon–wiener’ index\.Global Ecology and Biogeography12\(3\),pp\. 177–179\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1046/j.1466-822X.2003.00015.x),[Link](https://onlinelibrary.wiley.com/doi/abs/10.1046/j.1466-822X.2003.00015.x),https://onlinelibrary\.wiley\.com/doi/pdf/10\.1046/j\.1466\-822X\.2003\.00015\.xCited by:[§3\.3](https://arxiv.org/html/2605.06030#S3.SS3.p4.1)\.
- R\. Srivastava \(2025\)External Links:[Link](https://medium.com/@raj-srivastava/how-llms-turned-the-em-dash-into-a-villain-technical-nuances-b564857adc3b)Cited by:[§5\.5](https://arxiv.org/html/2605.06030#S5.SS5.p2.1)\.
- E\. Stamatatos \(2009\)A survey of modern authorship attribution methods\.Journal of the American Society for Information Science and Technology60\(3\),pp\. 538–556\.External Links:[Document](https://dx.doi.org/10.1002/asi.21001),[Link](https://www.researchgate.net/publication/220435062)Cited by:[§1](https://arxiv.org/html/2605.06030#S1.p2.1)\.
- O\. Zamaraeva, D\. Flickinger, F\. Bond, and C\. Gómez\-Rodríguez \(2025\)Comparing LLM\-generated and human\-authored news text using formal syntactic theory\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 9041–9060\.External Links:[Link](https://aclanthology.org/2025.acl-long.443/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.443),ISBN 979\-8\-89176\-251\-0Cited by:[§1](https://arxiv.org/html/2605.06030#S1.p3.1),[§2](https://arxiv.org/html/2605.06030#S2.p2.1),[§3\.1](https://arxiv.org/html/2605.06030#S3.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.06030#S3.T1),[§3](https://arxiv.org/html/2605.06030#S3.p1.1),[§4\.1](https://arxiv.org/html/2605.06030#S4.SS1.p1.1),[§4](https://arxiv.org/html/2605.06030#S4.p1.1)\.

## Appendices

## Appendix AParsability

Tables[7](https://arxiv.org/html/2605.06030#A1.T7)and[8](https://arxiv.org/html/2605.06030#A1.T8)show that newer instruction\-tuned LLMs produce texts that are easier to parse than both human\-authored English news text and earlier LLM outputs\.

ProfileItemsParsedLengthShortFrgmtTimeSpace\>\>Limit2023 \(RAM limit 21G\)%toks/S%%sec/SGb/S%*nyt\-2023\-human**26092**93\.4**22\.33**33**13**11\.5**2\.3**4\.3*falcon07\-2023\-llm2776997\.718\.6237104\.40\.91\.1llama07\-2023\-llm3782596\.419\.4235126\.01\.21\.8llama13\-2023\-llm3780097\.118\.6038135\.11\.01\.4llama30\-2023\-llm3756896\.918\.1739124\.91\.01\.3llama65\-2023\-llm3810796\.418\.7637125\.71\.11\.7mistral7b\-2023\-llm3508697\.318\.3638134\.70\.91\.12025 \(RAM limit 31G\)*nyt\-2025\-human**24053**94\.8**22\.19**32**12**12\.2**2\.4**2\.6*qwen14\-2025\-llm2649898\.225\.63427\.21\.60\.6qwen32\-2025\-llm3489298\.425\.62326\.71\.30\.5qwen72\-2025\-llm3461498\.526\.05226\.01\.20\.3llama70\-2025\-llm3930697\.929\.87129\.52\.00\.9gpt4o\-2025\-llm5054498\.525\.99124\.61\.00\.2mistral7i\-2025\-llm3970898\.325\.87427\.41\.40\.6Table 7:Parsing statistics for human\-authored \(in italics\) and LLM\-generated news datasets from 2023 and 2025\. Each row reports the percentage of sentences successfully parsed by the ERG, average sentence length in tokens, proportion of short sentences \(≤15\\leq 15tokens\), proportion of fragments, mean CPU time and memory per sentence, and proportion of sentences exceeding resource limits\.ProfileTime \(CPU\-seconds/sent\)Space \(Gbytes/sent\)\(length in tokens\)31\-3536\-4041\-4546\-5031\-3536\-4041\-4546\-50*nyt\-2023\-human**13**28**48**67**2\.6**5\.2**9\.1**13\.1*falcon07\-2023\-llm112444682\.24\.58\.313\.6llama07\-2023\-llm142650682\.75\.09\.513\.4llama13\-2023\-llm132850652\.65\.39\.513\.3llama30\-2023\-llm142851692\.65\.39\.413\.3llama65\-2023\-llm143050672\.75\.59\.413\.1mistral7b\-2023\-llm132748732\.55\.18\.814\.1*nyt\-2025\-human**13**30**52**84**2\.7**5\.9**10\.1**16\.2*qwen14\-2025\-llm102549772\.35\.19\.815\.3qwen32\-2025\-llm102450782\.04\.48\.714\.0qwen72\-2025\-llm92142671\.84\.07\.612\.1llama70\-2025\-llm81738611\.73\.57\.311\.7gpt4o\-2025\-llm81736541\.63\.36\.59\.8mistral7i\-2025\-llm102550812\.04\.58\.714\.0Table 8:Average parsing cost by sentence\-length bin for human \(in italics\) and LLM\-generated texts\. CPU time and memory consumption per sentence \(in seconds and GB, respectively\) for sentences binned by length \(31–50 tokens\)\.

Similar Articles

Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories

arXiv cs.CL

This paper diagnoses the low diversity in LLM-generated stories, finding that 88.3% of sampled stories contain one of 11 common words (e.g., Elias, lighthouse) across models, and traces this homogeneity to post-training data and alignment rather than prevalence in pre-training data.

Content for Content’s Sake

Armin Ronacher

The author investigates how LLMs are influencing word usage in coding and everyday language, finding that words favored by LLMs show increased frequency in both coding sessions and Google Trends, raising concerns about humans adopting LLM writing styles.

Review Arcade: On the Human Alignment and Gameability of LLM Reviews

Hugging Face Daily Papers

This paper investigates the alignment of LLM-generated reviews with human judgment using 1k real ACL 2025 submissions, finding limited agreement, instability across models/prompts, and a method to artificially inflate scores without meaningful changes. The authors advise against relying solely on LLM reviews and call for discussion on their use in handling increasing submission volumes.

The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment

arXiv cs.CL

This paper geometrically analyzes why LLMs acting as judges agree strongly with each other but weakly with humans, finding that inter-LLM consensus reflects a collapsed subspace rather than true human alignment on subjective rubrics. Post-hoc calibration on human data improves alignment, but even calibrated LLMs fall short of human reliability.