What Are LLMs Doing to Scientific Communication? Measuring Changes in Writing Practices and Reading Experience

arXiv cs.CL Papers

Summary

This paper investigates how the growing use of LLMs in writing is altering scientific communication, using a corpus of ACL papers and synthetic data to show lexical and stylistic changes, and connects these to subjective reading experience via expert annotations.

arXiv:2605.19936v1 Announce Type: new Abstract: Has the style of scientific communication changed due to the growing use of large language models in the writing process? We address this question in the domain of Natural Language Processing by leveraging two data resources we create: a naturalistic corpus of over 37,000 papers from the ACL Anthology (2020-2024); and a synthetic dataset of 3,000 human-written passages and their LLM-generated improvements. We first implement a series of diachronic lexical analyses, showing that both word frequency and usage contexts have changed significantly over time, indicating semantic specialization in some cases and generalization in others. Broadening our perspective, we then model a range of more complex stylistic features and find that LLM-modified texts more frequently contain certain syntactic constructions, more complex and longer words and a lower lexical diversity. Finally, we connect these changes in writing practices to subjective reading experience through a pilot annotation study with 20 domain experts. They overall rate LLM-improved texts as more understandable and exciting, but also express negative qualitative attitudes towards LLMs, highlighting the strongly subjective effect of AI-assisted writing on reading experience.
Original Article
View Cached Full Text

Cached at: 05/20/26, 08:27 AM

# What Are LLMs Doing to Scientific Communication? Measuring Changes in Writing Practices and Reading Experience
Source: [https://arxiv.org/html/2605.19936](https://arxiv.org/html/2605.19936)
###### Abstract

Has the style of scientific communication changed due to the growing use of large language models in the writing process? We address this question in the domain of Natural Language Processing by leveraging two data resources we create: a naturalistic corpus of over 37,000 papers from the ACL Anthology \(2020–2024\); and a synthetic dataset of 3,000 human\-written passages and their LLM\-generated improvements\. We first implement a series of diachronic lexical analyses, showing that both word frequency and usage contexts have changed significantly over time, indicating semantic specialization in some cases and generalization in others\. Broadening our perspective, we then model a range of more complex stylistic features and find that LLM\-modified texts more frequently contain certain syntactic constructions, more complex and longer words and a lower lexical diversity\. Finally, we connect these changes in writing practices to subjective reading experience through a pilot annotation study with 20 domain experts\. They overall rate LLM\-improved texts as more understandable and exciting, but also express negative qualitative attitudes towards LLMs, highlighting the strongly subjective effect of AI\-assisted writing on reading experience\.

Keywords:AI\-assisted writing, scientific communication, language change

\\NAT@set@cites

What Are LLMs Doing to Scientific Communication? Measuring Changes in Writing Practices and Reading Experience

Filip Miletić111Equal contribution\.\* Neele Falk111Equal contribution\.\*Institute for Natural Language Processing, University of Stuttgart, Germany\{filip\.miletic, neele\.falk\}@ims\.uni\-stuttgart\.deAbstract content

## 1\. Introduction

Large language models \(LLMs\) are increasingly used to assist human writing, including in high\-stakes domains such as scientific communication\. The rapid and pervasive nature of these changes raises the question of the ways in which they may be altering prevalent writing practices \(e\.g\., lexical and stylistic choices\), and of the subsequent effect of those practices on reading experience \(e\.g\., perceived clarity and trustworthiness of texts\)\.

While evidence of these changes is beginning to emerge, prior work is limited in two major respects\. First, several recent studies examine the distinctive linguistic properties of LLM\-generated scientific textMa et al\. \([2023](https://arxiv.org/html/2605.19936#bib.bib19)\); Muñoz\-Ortiz et al\. \([2024](https://arxiv.org/html/2605.19936#bib.bib23)\); Zanotto and Aroyehun \([2024](https://arxiv.org/html/2605.19936#bib.bib45)\); Zamaraeva et al\. \([2025](https://arxiv.org/html/2605.19936#bib.bib44)\), but they generally compare texts which are written by humans vs\. entirely model\-generated\. This clear\-cut distinction oversimplifies finer\-grained practices: LLMs are typically used to improve human\-written text rather than generate entire passagesKoller et al\. \([2024](https://arxiv.org/html/2605.19936#bib.bib14)\); Kobak et al\. \([2025](https://arxiv.org/html/2605.19936#bib.bib13)\); moreover, real\-life documents tend to alternate between human\-only and LLM\-improved writing rather than contain a uniform amount of generated textLee et al\. \([2022](https://arxiv.org/html/2605.19936#bib.bib16)\); Richburg et al\. \([2024](https://arxiv.org/html/2605.19936#bib.bib32)\)\. The second major shortcoming is the focus on identifying distinctive properties of generated text without systematically measuring their effect on human readers\. Even where included, such measurements are limited to broad patterns such as the ability to distinguish human vs\. LLM writingGao et al\. \([2023](https://arxiv.org/html/2605.19936#bib.bib6)\); Ma et al\. \([2023](https://arxiv.org/html/2605.19936#bib.bib19)\)\. As a result, the link between objective differences in writing style and more subjective but vital dimensions of reading experience remains to be established\.

This paper aims to provide a more realistic and comprehensive assessment of LLM use in scientific communication\. We design our study so as to capture the collaborative nature of human–LLM writing and the uneven spread of such interventions within a document, as well as to explicitly connect the distinctive features of these writing practices with subjective reading experience\. We conduct our analysis on NLP papers from the ACL Anthology, and define two periods of around two years each, respectively preceding and following the release of ChatGPT in November 2022\. We view the two periods as reflecting community\-level writing practices which do not include any vs\. may include some writing by public\-facing LLMs\. Complementing this naturalistic scenario, we simulate the real\-life use of LLMs in a more controlled synthetic setting: we sample 3,000 extracts from pre\-ChatGPT papers and generate model\-improved versions of those\. We pose the following research questions:

RQ1In what way did core linguistic choices change between these two periods?RQ2To what extent are more complex stylistic properties specific to each of the two periods?RQ3Do these differences in writing practices give rise to different reading experiences?
We first assess differences in linguistic choices by deploying a series of diachronic lexical analyses: statistical corpus measures to identify emergent terms; type\-level word embeddings to characterize broad changes in their semantic properties; and token\-level word embeddings to automatically retrieve their time\-specific uses\. Broadening our focus, we then investigate how different linguistic features — such as text length, sentiment, grammatical and lexical variability, and readability — contribute to explaining variation between human and LLM\-assisted writings through regression analysis\. Finally, we conduct an annotation study contrasting human\-written texts with their LLM\-produced improvements, and ask 20 domain experts for ratings of reading experience in terms of clarity, authenticity, trustworthiness, and excitement\.

We provide the following contributions\.\(1\)We show that post\-ChatGPT papers are distinguished by more complex lexical choices \(e\.g\.,enhancerather thanimprove\) and further stylistic properties \(e\.g\., lower lexical diversity\)\. By comparing naturalistic data from the ACL Anthology and synthetic data from text generation experiments, we confirm that these writing practices can be attributed to LLM use\.\(2\)We further find that these stylistic changes are linked to differences in subjective reading experience, with LLM\-improved texts perceived as clearer and more exciting\.\(3\)We release an updated version of the ACL\-OCL corpus\(Rohatgi et al\.,[2023](https://arxiv.org/html/2605.19936#bib.bib33)\), containing PDF\-extracted text of 99\.3k papers from the ACL Anthology\. We also provide a one\-line script to ingest future papers\.\(4\)We release 3,000 pairs of human\-written texts and their LLM\-produced improvements, and annotations of human reading experience for 200 pairs\.111Data and code are available at[https://github\.com/FilipMiletic/ScientificCommunication](https://github.com/FilipMiletic/ScientificCommunication)

## 2\. Related Work

#### Detection of AI\-generated content\.

Paralleling the rise in popularity of LLMs, recent years have seen growing interest in automatic detection of AI\-generated content\. This includes the release of datasets and benchmarks to train detection tools and evaluate different methods\(e\.g\., Chen et al\.,[2023](https://arxiv.org/html/2605.19936#bib.bib2); Li et al\.,[2024](https://arxiv.org/html/2605.19936#bib.bib17); Guo et al\.,[2023](https://arxiv.org/html/2605.19936#bib.bib10); Dugan et al\.,[2024](https://arxiv.org/html/2605.19936#bib.bib5); Macko et al\.,[2023](https://arxiv.org/html/2605.19936#bib.bib20); Wang et al\.,[2024](https://arxiv.org/html/2605.19936#bib.bib39)\)\. Techniques for detecting AI\-generated text include watermarkingZhao et al\. \([2025](https://arxiv.org/html/2605.19936#bib.bib46)\), fine\-tuning transformer\-based classifiersGuggilla et al\. \([2025](https://arxiv.org/html/2605.19936#bib.bib9)\), using model\-related features\(Wu et al\.,[2025](https://arxiv.org/html/2605.19936#bib.bib42)\)or linguistic features\(Hamed and Wu,[2023](https://arxiv.org/html/2605.19936#bib.bib12)\)\.222We refer to the survey byWu et al\. \([2025](https://arxiv.org/html/2605.19936#bib.bib42)\)for a comprehensive overview on detecting AI\-generated content\.While good results are generally achieved for AI\-generated content, the detection of human–AI coauthored text remains a major challenge and requires adaptation of existing modelsRichburg et al\. \([2024](https://arxiv.org/html/2605.19936#bib.bib32)\); Su et al\. \([2025](https://arxiv.org/html/2605.19936#bib.bib37)\)\. Early work includes the CoAuthor datasetLee et al\. \([2022](https://arxiv.org/html/2605.19936#bib.bib16)\), which includes essays augmented with GPT\-3 suggestions, while more recent datasets focus on different variations of human–AI co\-authored texts \(e\.g\., human\-written then machine\-polished\)Wang et al\. \([2025](https://arxiv.org/html/2605.19936#bib.bib40)\)\.

#### Stylistic differences between human\-written and AI\-generated or AI\-modified text\.

Several works more directly explore the difference in stylistic features of human and AI\-generated text\. Domains that are mostly covered are news articlesMuñoz\-Ortiz et al\. \([2024](https://arxiv.org/html/2605.19936#bib.bib23)\); Zamaraeva et al\. \([2025](https://arxiv.org/html/2605.19936#bib.bib44)\), essaysAkinwande et al\. \([2024](https://arxiv.org/html/2605.19936#bib.bib1)\)and abstracts of scientific articlesMa et al\. \([2023](https://arxiv.org/html/2605.19936#bib.bib19)\)\. Existing works examine features from all possible categories, such as the frequency of certain syntactic constructions, n\-grams, hedging, lexical complexity, rhetorical properties and sentiment\. Frequent linguistic peculiarities in AI\-generated content include, e\.g\., lower lexical variationZanotto and Aroyehun \([2024](https://arxiv.org/html/2605.19936#bib.bib45)\); Akinwande et al\. \([2024](https://arxiv.org/html/2605.19936#bib.bib1)\); Yildiz Durak et al\. \([2025](https://arxiv.org/html/2605.19936#bib.bib43)\), more positive sentimentMuñoz\-Ortiz et al\. \([2024](https://arxiv.org/html/2605.19936#bib.bib23)\); Zamaraeva et al\. \([2025](https://arxiv.org/html/2605.19936#bib.bib44)\), fewer compoundsZamaraeva et al\. \([2025](https://arxiv.org/html/2605.19936#bib.bib44)\), and the excessive use of certain verbs and modifiers such asdelve,crucial, orintricateGray \([2024](https://arxiv.org/html/2605.19936#bib.bib7)\); Kobak et al\. \([2025](https://arxiv.org/html/2605.19936#bib.bib13)\); Reinhart et al\. \([2025](https://arxiv.org/html/2605.19936#bib.bib31)\)\.

Several works use these features to predict whether a text was human or AI\-generated and to identify the strongest predictorMa et al\. \([2023](https://arxiv.org/html/2605.19936#bib.bib19)\); Desaire et al\. \([2023](https://arxiv.org/html/2605.19936#bib.bib3)\); Akinwande et al\. \([2024](https://arxiv.org/html/2605.19936#bib.bib1)\)\. Some works also investigate human perception of LLM\-generated texts, e\.g\.Gao et al\. \([2023](https://arxiv.org/html/2605.19936#bib.bib6)\)andHakam et al\. \([2024](https://arxiv.org/html/2605.19936#bib.bib11)\)find that human annotators struggle to distinguish between human and LLM\-generated scientific texts\.Russell et al\. \([2025](https://arxiv.org/html/2605.19936#bib.bib34)\)show that annotators with frequent LLM\-writing experience better detect generated news\. InDoru et al\. \([2025](https://arxiv.org/html/2605.19936#bib.bib4)\), participants classified scientific texts and rated their fluency, quality, and coherence\.Lin and Zhu \([2025](https://arxiv.org/html/2605.19936#bib.bib18)\)find that researchers use LLMs mainly to improve clarity and conciseness, leading to a more homogeneous writing style since ChatGPT’s release\.

Most prior studies focus on fully generated texts and rarely compare human–AI co\-authored vs\. human\-only writing\. For this reason, we compare articles published after ChatGPT’s release with those from shortly before, expecting weaker but detectable linguistic shifts even if only a fraction were LLM\-modified\. Further, frequent exposure to LLM\-generated language may lead researchers to unconsciously adopt its style\. Unlike most prior studies, we analyze full papers rather than abstracts, since LLM use likely occurs across all sections\. While LLM\-related vocabulary and linguistic features have been studied, they are rarely examined together, and existing research often focuses on surface\-level trends\. Lexical choices, in particular, remain underexplored beyond frequency\-based analyses, despite well\-established methods for modeling semantic change\(Tahmasebi et al\.,[2021](https://arxiv.org/html/2605.19936#bib.bib38); Schlechtweg,[2023](https://arxiv.org/html/2605.19936#bib.bib36)\)\. Finally, the subjective perception of LLM\-generated content in scientific texts has hardly been studied, which is why we complement our data\-driven analysis with a pilot study of reading experience with 20 domain experts\.

## 3\. Data

We now present our two English\-language data resources: a naturalistic corpus of NLP papers from the ACL Anthology \(henceforthoriginaldataset\); and a synthetic dataset of human\-written passages and their LLM\-generated improvements \(henceforthLLMdataset\)\.

### 3\.1\. ACL Anthology Corpus

Since the focus of our work is on scientific communication in the NLP community, we analyze the papers from the ACL Anthology,333[https://aclanthology\.org](https://aclanthology.org/)the open\-access publication repository of the Association for Computational Linguistics \(ACL\)\. As our starting point, we use the ACL\-OCL corpus\(Rohatgi et al\.,[2023](https://arxiv.org/html/2605.19936#bib.bib33)\)containing ca\. 73,000 papers\. They were obtained by crawling the Anthology website for PDF files and then extracting the full text using GROBID\.444[https://github\.com/kermitt2/grobid](https://github.com/kermitt2/grobid)

The content in the original corpus ends in September 2022 and to our knowledge has not been updated since\. We therefore implement an update to bring its temporal span closer to the present day\. We also note two other recurrent problems\. Some papers from the original time span are available in the Anthology but not included in the corpus, possibly due to coverage issues during crawling\. Other papers are included in the corpus, but are associated with metadata without full text content due to file issues \(e\.g\., failed extraction with GROBID or PDF missing from the Anthology at the time of the crawl, especially for very early conferences\)\.

In our update, we do not crawl the Anthology website but instead use its BibTeX export as the most comprehensive structured record of available papers\. We rely on BibTeX information to identify the missing papers based on citation keys, extract their metadata, and reconstruct their URLs\. We download the corresponding PDF files and then use GROBID to extract paper text\. This process also recovers the textual content for a subset of papers lacking it in the original corpus; we remove any remaining papers without textual content\. We accompany the updated corpus with code \(run as a one\-line command\) which checks the locally available papers against those in the Anthology and passes any missing papers through the full update pipeline\.

The updated corpus contains 99\.2k papers published until the end of 2024\. For the purposes of our study, we define a subcorpus structured into two time periods around a critical point in time regarding LLM use: the release of ChatGPT in November 2022\. The first time period \(t1t\_\{1\}\) contains papers published from 2020 to 2022\. Its last major conference event is EMNLP 2022, which had a camera\-ready deadline in October of that year\. The second time period \(t2t\_\{2\}\) contains papers published in the second half of 2023 and in 2024\. It begins with ACL 2023, whose camera\-ready deadline was in May of that year, i\.e\., six months after the release of ChatGPT\. The gap between the two periods ensures a clear distinction between them while limiting the effect of changes in topic over time\. We only retain papers from events that took place in both time periods\.

While we assume that these sampling constraints ensure good comparability oft1t\_\{1\}andt2t\_\{2\}, we also inspect it more closely by running a topic analysis\. We find some topical shifts in line with the evolution of the field \(e\.g\., a stronger focus on individual levels of linguistic structure int1t\_\{1\}, and prominence of more recent machine learning methods int2t\_\{2\}\), but the analysis overall confirms broad topical comparability of the two time periods\. Detailed results are reported in Appendix[A](https://arxiv.org/html/2605.19936#A1)\.

We preprocess PDF\-extracted text using spaCy555[https://spacy\.io](https://spacy.io/)\(modelen\_core\_web\_sm\)\. We segment the text into sentences, which are then tokenized, lemmatized, and POS\-tagged\. We run a subset of analyses on paragraph\-level, which we operationally define as non\-overlapping windows of five sentences\. Final corpus structure is shown in Table[1](https://arxiv.org/html/2605.19936#S3.T1)\.

Table 1:Distribution of papers across time periods
### 3\.2\. LLM\-Assisted Paraphrases

The original sub\-corpus described in Section[3\.1](https://arxiv.org/html/2605.19936#S3.SS1)describes the realistic scenario in whicht2t\_\{2\}contains a hybrid form of human and LLM co\-authored text\. To compare whether the patterns that emerge from the analysis on this are similar to those in texts explicitly modified by LLMs, we replicate this scenario in a controlled setup and construct a dataset with texts fromt1t\_\{1\}paired with GPT\-improved paraphrases, thus offering a clear gold label \(human vs\. LLM\-modified\)\.

We select a random sample of 3,000 publications fromt1t\_\{1\}from 2022 and, for each paper in this, a random paragraph with a minimum length of 100 tokens\. The selected paragraph is chosen from the initial paper paragraphs to make sure that it mostly spans text from the introduction\. In the next step, we develop 10 different prompts that scientists frequently use during the writing process to refine their texts\. To identify these prompts, we conducted an anonymous survey in which colleagues and students were asked to share the prompts they frequently use to improve their own writing\. The 10 final prompts ultimately include both more general requests \(Improve the following;Please polish the following text\) as well as prompts which ask for the improvement of specific text dimensions \(Please improve the coherence of the text;Refine grammar, tone, and readability\)\. We then prompt GPT\-3\.5\-turbo for each of the 3,000 original texts, randomly choosing one out of the 10 prompts which results in the final corpus\. The final dataset \(referred to asLLM\) consists of 6k paragraphs, 3k human\-generated and 3k LLM\-modified\.

Note that over the full period covered by our naturalistic corpus, researchers may have used a range of different LLMs, particularly toward the end oft2t\_\{2\}when newer models became available \(e\.g\., GPT\-4, Claude\)\. However, at the time of publication of most papers in our dataset, the most widely available and commonly used system was ChatGPT based on GPT\-3\.5\. Therefore, we used GPT\-3\.5\-turbo to generate paraphrases in our experiments\.

## 4\. Characterizing Writing Changes via Lexical Choices

We begin by addressingRQ1: Are there core linguistic choices which changed betweent1t\_\{1\}andt2t\_\{2\}in connection with LLM use? Focusing on the lexical level, we identify words with strongest changes in rates of use and then characterize them in terms of finer\-grained patterns of semantic change\.

![Refer to caption](https://arxiv.org/html/2605.19936v1/figs/lexical.png)Figure 1:Target words with strongest differences in rates of use int1t\_\{1\}vs\.t2t\_\{2\}\(top 10 per time period\)\. X\-axis: log\-likelihood score, negative values indicate higher frequency int1t\_\{1\}\. Y\-axis:Δ​N​D\\Delta ND, higher values indicate an increase in neighborhood density over time \(i\.e\., restriction of usage contexts\); for grayed out targets, the difference in neighborhood density betweent1t\_\{1\}andt2t\_\{2\}is not statistically significant\. Color coding: dark blue targets also appear in the top 100 terms when contrasting original vs\. LLM\-paraphrased texts\.### 4\.1\. Experimental Setup

#### Preprocessing

Starting from preprocessed paper text \(cf\. Section[3\.1](https://arxiv.org/html/2605.19936#S3.SS1)\), we examine content words \(nouns, verbs, adjectives, and adverbs\) in the shared vocabulary oft1t\_\{1\}andt2t\_\{2\}\. We lowercase all lemmas and retain those that are at least three characters long and contain only alphabetic characters\.

#### Corpus statistics

On the simplest level of analysis, we quantify the extent to which a word’s rate of use has changed betweent1t\_\{1\}andt2t\_\{2\}\. To determine the strength of the change, we compute the log\-likelihood score\(Rayson and Garside,[2000](https://arxiv.org/html/2605.19936#bib.bib28)\), which compares the observed frequencies of a given word in two corpora while accounting for the expected frequencies based on total corpus size\. To determine the directionality of the change, we calculate the frequency ratio of a word’s frequency int2t\_\{2\}to its frequency int1t\_\{1\}\(normalized per million tokens\)\. A value<<1 is indicative of a term falling out of use over time, and vice versa\. We calculate these statistics both for the original corpus and the LLM dataset\.

#### Type\-level word embeddings

To better understand broad semantic patterns behind lexical choices, we train type\-level word embedding models\. We use word2vec\(Mikolov et al\.,[2013](https://arxiv.org/html/2605.19936#bib.bib22)\)in the gensim implementation\(Řehůřek and Sojka,[2010](https://arxiv.org/html/2605.19936#bib.bib29)\), and set the algorithm to skip\-gram with negative sampling, window size to 5, vector dimensions to 100, minimum frequency to 10, and other hyperparameters to default values\. We train separate models fort1t\_\{1\}andt2t\_\{2\}\(original corpus\), and run the process three times to account for randomness across word2vec runs\(Pierrejean and Tanguy,[2018](https://arxiv.org/html/2605.19936#bib.bib26)\)\.

We characterize a word’s usage in a given time period via its distributional neighborhood density, which reflects semantic features such as polysemy and is therefore long\-established in research on language change\(Sagi et al\.,[2009](https://arxiv.org/html/2605.19936#bib.bib35)\)\. To calculate a target word’s neighborhood density, we select its 100 nearest neighbors, and take the mean of the cosine similarity scores between the target and each neighbor\. We repeat the process for each of the three word2vec runs and take the average value\. We then calculate the change in neighborhood density for wordwwbetween time periodst1t\_\{1\}andt2t\_\{2\}asΔ​N​D​\(w\)=N​D​\(wt2\)−N​D​\(wt1\)\\Delta ND\(w\)=ND\(w\_\{t\_\{2\}\}\)\-ND\(w\_\{t\_\{1\}\}\)\. An increase in density is indicative of a restriction in usage contexts, typical of semantic specialization; a drop in density indicates diversification of usage contexts, which is typical of semantic generalization\. We test the significance of changes in density using the Mann\-Whitney–U test at 0\.05 level\.

#### Token\-level word embeddings

Connecting broad semantic trends to occurrence\-level differences in usage, we implement an analysis using token\-level embeddings\. For a given target word, we collect the sentences in which it occurs in the original corpus; these are capped at 1,000 per time period and randomly subsampled if necessary\. We then obtain contextualized embeddings of the target word in each sentence using ModernBERT\(base model with 22 layers, 768 dimensions, 149m parameters; Warner et al\.,[2025](https://arxiv.org/html/2605.19936#bib.bib41)\)\. We feed the model with one sentence at a time, and retain the embedding corresponding to the target word in the last hidden state\. The obtained embeddings for each target word are then clustered using usingkk\-means, for which we rely on the scikit\-learn implementation\(Pedregosa et al\.,[2011](https://arxiv.org/html/2605.19936#bib.bib24)\)and setkkto 8\. For each cluster, we calculate the proportion of examples fromt1t\_\{1\}andt2t\_\{2\}, and draw on this information to identify stable vs\. shifting usages\.

### 4\.2\. Results

On the most general level, we inspect changes in rates of use for the whole vocabulary based on the distribution of log\-likelihood scores\. This change is significant666Based on the critical log\-likelihood value of 15\.13 recommended byRayson et al\. \([2004](https://arxiv.org/html/2605.19936#bib.bib27)\)\.for 13% \(9,353 out of 71,993\) words in the shared vocabulary as defined above\. The significant changes are near\-evenly distributed between drops \(45%\) and increases \(55%\) in frequency, matching the intuition that the obsolescence of most words is paralleled by a rise in prominence of their functional equivalents\. We further analyze the words with significant changes in rates of use by correlating their log\-likelihood score \(to which we assign a negative value for words whose frequency drops over time\) and the change in neighborhood densityΔ​N​D\\Delta ND\. We find a negative correlation \(Spearman’sρ=−0\.26\\rho=\-0\.26,p≪0\.01p\\ll 0\.01\) suggesting that increased rate of use is associated with a lowering of neighborhood density, which typically reflects semantic generalization\. However, the limited strength of the correlation indicates that the process does not apply to the whole vocabulary\.

We now shift from vocabulary\-level trends to the most relevant individual words\. For each part\-of\-speech, we retain the words with the 10 highest positive and negative log\-likelihood scores, respectively capturing the strongest increases and drops in rate of use over time\. We plot these words in Figure[1](https://arxiv.org/html/2605.19936#S4.F1), with log\-likelihood on the x\-axis and change in neighborhood densityΔ​N​D\\Delta NDon the y\-axis\.

Changes in rates of use \(x\-axis\) often reflect patterns of lexical replacement\. Some reflect shifts in specialized topics, such as technical terms referring to core machine learning concepts dropping out of use \(e\.g\.,embedding\), and those related to more recent approaches increasing their rate of use \(e\.g\.,prompt\)\. Other patterns capture stylistic rather than terminological differences: general\-language expressions which are semantically broad and stylistically neutral tend to fall out of use, while more formal and specialized words gain prominence\. This trend is especially visible for verbs \(e\.g\.,usevs\.utilize\), adjectives \(e\.g\.,goodvs\.comprehensive\), and adverbs \(e\.g\.,thenvs\.subsequently\)\.

But can these changes be attributed to LLM\-assisted writing? We inspect the log\-likelihood scores calculated on our LLM dataset, comparing originalt1t\_\{1\}texts and their LLM\-assisted paraphrases\. We select the words with 100 highest positive and negative log\-likelihood scores \(per part of speech\)\. The overlap between this set and the top\-10 sets from the original corpus is shown in Figure[1](https://arxiv.org/html/2605.19936#S4.F1)using dark blue markers\. The overlap is almost entirely limited to general\-language and not technical terms\. Given the topic\-controlled nature of the LLM corpus, this finding indicates that \(i\) the changes in the original corpus reflect two parallel trends: a topical and a stylistic shift; and \(ii\) that the stylistic shift can be attributed to LLM use\.

We now analyze the semantic change mechanisms associated with different rates of use, as measured by the change in neighborhood densityΔ​N​D\\Delta ND\(y\-axis\) and by the temporal distribution over clusters of token\-level embeddings\. As an example, we focus on the verbsensureandutilize: they both show an increase in frequency over time, but differ in semantic mechanisms\. The verbensureshows an increase in neighborhood density, which is indicative of semantic specialization\. This trend is supported by clustered examples like the following:

1. \(1\)Lastly, we aim to measure the semantic similarity between generated questions toensurethat the questions assess the same content\.
2. \(2\)Toensurethe high quality of the annotation procedure, we manually annotated a set of 200 control tasks\.

Example[1](https://arxiv.org/html/2605.19936#S4.I1.i1)is from a cluster dominated by older data \(65%t1t\_\{1\}\) whereensureconveys a broad sense of finality across diverse contexts\. Example[2](https://arxiv.org/html/2605.19936#S4.I1.i2)comes from a more recent cluster \(57%t2t\_\{2\}\)\. It is representative of the more specific meaning ‘to guarantee’, attested in a restricted set of transitive contexts usually referring to the quality of a scientific artifact\.

In contrast,utilizeshows a slight expansion of distributional neighborhood, typical of semantic generalization which is also borne out by these examples:

1. \(3\)The latter aims toutilizethe multi\-level interests to enhance both conversation and recommendation tasks when users chat with system\.
2. \(4\)Finally, weutilizea ridge regression classifier to obtain final classification results\.

Example[3](https://arxiv.org/html/2605.19936#S4.I2.i3)is from an older cluster \(60%t1t\_\{1\}\) and conveys the specific meaning of ‘use to the fullest potential’\. Example[4](https://arxiv.org/html/2605.19936#S4.I2.i4)comes from a cluster with more recent data \(56%t2t\_\{2\}\) whereutilizeis attested with the broad meaning ‘make use of’\.

Summarizing, we identified a substantial set of the vocabulary with a shift in the rate of use since the introduction of ChatGPT\. As further validation, in Appendix[B](https://arxiv.org/html/2605.19936#A2)we also present complete corpus statistics for the strongest changes in rates of use, extend the same analysis to multi\-word sequences, and analyze further clustering examples\. We consistently find that some changes are due to topical shifts within the scientific community, while others are more clearly attributable to the adoption of LLMs\. The changes are further reflected in shifting sense distributions, leading to semantic generalization in some cases and specialization in others\.

## 5\. Predicting Time Periods from Complex Linguistic Features

In the following section, we addressRQ2: are there systematic stylistic differences betweent1t\_\{1\}andt2t\_\{2\}that could be attributed to the use of LLMs as writing assistants?

![Refer to caption](https://arxiv.org/html/2605.19936v1/figs/combined_forest_plots.png)Figure 2:Comparison of odds ratios for linguistic features between the original and LLM\-paraphrased datasets\. Positive odds ratios indicate that higher feature values are more characteristic of the post\-GPT / LLM\-paraphrased texts\. Horizontal bars represent 95% CIs \(approximated with the Wald method for original\)\. The vertical dashed line at 1 marks the point of no effect\.### 5\.1\. Experimental Setup

We apply logistic regression as an analysis tool to find the linguistic markers that significantly contribute to explaining the outcome \(a binary variable representing whether a text is fromt1t\_\{1\}ort2t\_\{2\}\), after accounting for other relevant feature groups\. The advantage of this approach is that it allows us to model the relationship between each feature and the target variable: positive odds indicate that higher values of a feature are characteristic oft2t\_\{2\}\(human \+ LLM assistant\), negative odds indicate that they are more characteristic oft1t\_\{1\}\(humans only\)\. We can also reveal which feature groups have the greatest impact\.

Since many factors may influence stylistic change in our real\-world data \(original dataset\), we repeat the same analysis in the controlled, synthetic setting \(LLM dataset\) to isolate stylistic effects specifically linked to LLM use\.

#### Preprocessing

We first extract around 1,000 linguistic features from six different features groups withelfenMaurer \([2026](https://arxiv.org/html/2605.19936#bib.bib21)\)andLFTKLee and Lee \([2023](https://arxiv.org/html/2605.19936#bib.bib15)\)for all paragraphs fromt1t\_\{1\}andt2t\_\{2\}in the original dataset, as well as all paragraphs in the LLM dataset\. The features cover surface features \(e\.g\., word and sentence length\), morpho\-syntactic features \(e\.g\., occurrences of specific morphological constructions or part\-of\-speech tags\), syntactic features \(e\.g\., dependency relations and the complexity of dependency trees\), psycholinguistic features \(e\.g\., acquisition norms or average concreteness of words\), lexical\-semantic features \(e\.g\., measures of lexical diversity or occurrences of specific named entities\), and sentiment features \(polarity and emotion\-related words\)\. All features are standardized and scaled\.

#### Methodology

As a first step, we apply a filtering method to select a meaningful set of features as predictors: We keep only features with variance\>\>0 and perform a correlation analysis, retaining features with correlation<<0\.7 and selecting one representative feature per highly correlated group\. This yields 145 initial features\.

We then perform stability selection using logistic regression with elastic net regularization on a balanced subset of 200k paragraphs from theoriginaldataset, applying 5\-fold inner and 10\-fold outer cross\-validation to assess robustness\. For our final selection we remove all features that \(a\) are not selected in all folds \(a feature is selected when coefficient\>\>0\), \(b\) show inconsistent effects, \(c\) show less than±\\pm5% change in odds, and \(d\) were not significant \(we bootstrapped CIs to approximate significance\)\. This leaves 45 robust features\. To further reduce the number, we rank them by absolute change in odds and select the top features per feature group, resulting in 24 final features\. Finally, we refit the logistic regression on the full dataset for unbiased estimates\. For the synthetic scenario, we use a generalized mixed\-effects model with paper ID as a random effect to account for data dependencies\.

### 5\.2\. Results

The regression model on the original dataset achieves a mean AUC of 0\.65±\\pm0\.002 and a pseudoR2R^\{2\}of 5\.6% \(McFadden\) across the 10\-fold cross\-validation, which means that it explains not a large but meaningful amount of variance\. Using the same features as fixed effects in the mixed effects model explains 29% of the variance, which is a substantial portion and confirms that LLM\-modified texts exhibit particular stylistic patterns that are characterized well by the linguistic features\.

Figure[2](https://arxiv.org/html/2605.19936#S5.F2)visualizes the direction and strength of association between each feature and the outcome variable \(t1t\_\{1\}vs\.t2t\_\{2\}or human vs\. GPT\-paraphrased\), grouped by the different feature groups\. Points to the right of the 1\.0 decision boundary indicate that higher feature values are associated with LLM use, while points to the left correspond to human\-written texts\. When blue and orange points appear on the same side, the pattern is consistent across both the original and LLM datasets\.

LLM use is characterized by longer and more complex words \(higher average word length and age of acquisition\), confirming the findings from the previous section\. They show greater entropy while still containing many familiar, high\-prevalence terms\. In contrast, human\-written texts use more stopwords, suggesting that LLM outputs are more semantically dense\. Stopword use, however, presents a contrasting pattern: while it shows a strong positive association in the original corpus, the relationship is reversed in the LLM\-modified texts\. Human\-written texts tend to be more varied, showing higher lexical diversity \(Simpson’s D\) and greater verb variability, although this pattern is not entirely consistent\.

Overall, the findings present a mixed picture of lexical variation and linguistic complexity: LLMs appear to produce more complex syntactic constructions and lexically dense content, yet with less lexical variety and a preference for familiar vocabulary\. Whether this results in improved or reduced clarity will be examined in the following section\.

In terms of syntactic and morpho\-syntactic structures, there are distinct patterns between human and LLM\-modified texts\. LLMs tend to use more negations and adverbial clauses, indicating a preference for more elaborated or qualified sentence structures with additional modifiers\. Human\-written texts, in contrast, are more strongly associated with conjunctions – a construction often simplified or replaced with more complex connectives by LLMs – and contain more compounds\.

We observe clear differences in punctuation use: human\-written texts more often include brackets, whereas LLM\-modified texts introduce more commas and possibly dashes\. LLM\-improved texts also contain more proper nouns, with the exception of organization entities, which occur more frequently in human\-written texts\. Similarly, date entities are more strongly associated with human\-written texts, as LLMs are less likely to generate references and often remove them when prompted to improve human originals\.

Sentiment patterns show a mixed picture: LLMs use more anger\-related words, while original texts contain more trust\-related words expressing confidence or credibility \(e\.g\.,confirm,promise,support\) – although this trend is not confirmed in the paraphrased corpus\. Another interesting pattern is the higher frequency of sensorimotor words in LLM\-modified texts \(e\.g\.,enhance,highlight\), possibly because LLMs were trained to make writing more dynamic and engaging\. This could point to a stylistic bias learned from human feedback and exposure to polished text\.

Finally, we investigate the relative importance of each feature\. Looking at the most important features, we find that dependency features \(e\.g\., adverbial clauses\) and punctuation have a major influence on both models, indicating a generalizable, strong pattern in terms of the difference between human and LLM\-generated texts\. Another strong feature in both datasets is the more frequent use of proper nouns in LLM\-modified texts\. The largest difference between the two datasets lies in their key predictive features: in the original corpus, variance is primarily explained by stopword use, whereas in the LLM dataset, it is driven by the average age of acquisition\.

## 6\. Measuring Reading Experience

While prior research has examined lexical and stylistic differences between human\- and LLM\-generated texts, little is known about how readers perceive these changes\. To address this gap, we conducted a pilot study to assess human perception of human\-only and LLM\-modified texts \(RQ3\)\.

To gain insights on that question, we look at four broader dimensions of reading experience:claritymeasures whether a text is understandable and communicates the content clearly,authenticitycaptures to what extent the reader feels connected to the author and perceives the author as being genuine in their writing,trustworthinesscaptures the extent to which a text presents its arguments clearly, reliably, and credibly, whileexcitementassesses whether the reading experience is engaging and stimulates the reader’s interest\.

### 6\.1\. Annotation Setup

#### Study design

To find out whether readers prefer human\-written or LLM\-improved texts and along which dimensions, we designed the annotation study as a pairwise comparison task\. Given a pair of texts, text A and text B, annotators had to rate which one aligns more strongly with a certain statement\. We measure each dimension with two statements, for example,I read this text smoothly and fluentlyto measure the dimension ofclarity\. Raters indicated their preference on a four\-point Likert scale \(strongly A, slightly A, slightly B, strongly B\)\. We asked 20 different domain experts \(NLP researchers\) with varying backgrounds and levels of seniority to annotate 20 text\-pairs\. For each pair we collected annotations from two raters, resulting in 200 annotated instances\. Appendix[C](https://arxiv.org/html/2605.19936#A3)provides the guidelines and detailed questionnaire\.

#### Annotation data

We use a subset of the LLM dataset to \(a\) control for topic effects and \(b\) ensure clear gold labels\. To select good candidates, each original text and its paraphrase were converted into feature vectors capturing style \(linguistic features\) and semantics\(using SBERT embeddings; Reimers and Gurevych,[2019](https://arxiv.org/html/2605.19936#bib.bib30), modelparaphrase\-multilingual\-mpnet\-base\-v2\)\. We calculated pairwise differences and ranked pairs by average distance\. We then manually selected 200 high\-quality pairs from the top 300, removing noisy instances, LLM artifacts, and formatting indicators\.

### 6\.2\. Results

#### Quantitative ratings

Figure[3](https://arxiv.org/html/2605.19936#S6.F3)shows the overall preference distribution of human raters in our annotation study\. We can see that raters tend to prefer LLM\-improved texts, especially on the dimensions of clarity and excitement\. However, there is also a substantial portion of texts where the human original was preferred, which shows that prompting an LLM to improve a scientific text does not always lead to the desired effect on the reader side\. Additionally, a considerable number of texts showed no clear preference for either version\.

![Refer to caption](https://arxiv.org/html/2605.19936v1/figs/distribution.png)Figure 3:Distribution over Preferences in Human Raters\. Higher red portion indicates a larger preference for LLM\-improved texts, higher blue portion for the human originals\.For each dimension, we measure whether the ratings significantly deviate from 0 \(neutral\) and quantify the effect size\. Table[2](https://arxiv.org/html/2605.19936#S6.T2)shows that the strongest preference for LLM\-improved texts can be observed inclarityandexcitement\. The clear preference for LLM\-improved texts inclarityis unsurprising, as this aspect is often explicitly mentioned in the prompts\. Interestingly, even for dimensions not directly prompted, such asauthenticity, LLM texts are still preferred more often, though the effect size is small\.

When comparing individual annotators, we also find notable differences\. Some clearly favor human\-written texts on the dimensions oftrustworthinessandauthenticity, while others strongly prefer the LLM\-modified versions, indicating that these two dimensions are perceived more subjectively\.

Table 2:Model preference scores \(range: \-2 to 2; 0 = no preference\)\. Negative values indicate preference for LLM\-paraphrased texts, positive values for human originals\. LLM paraphrases are favored for clarity and excitement \(strong direction, moderate effect size\)\. Significance assessed using the Wilcoxon rank test\.
#### Qualitative remarks

We conducted qualitative interviews with five participants after the annotation study\. They all described the annotation task as challenging; with the exception of clarity\-related ratings, they reported difficulties due to the subjective nature of most questions\. They further noted lower confidence particularly for the ratings on the intermediate points of the scale, underscoring the need for a nuanced reading of the results\.

Most participants made intuitive assumptions about which text in the pair was LLM\-based, but they also noted a limited degree of certainty\. They collectively reported a wide range of properties leading them to think a text was LLM\-assisted: limited variability in lexical choices and sentence length; the amount of references and abbreviations; formatting, e\.g\., punctuation and bullet lists\. The diversity of these strategies and their limited accuracy highlights possible interactions between personal attitudes and reading experience\. A mismatch is further supported by generally negative expressed attitudes towards LLM writing \(in this subgroup of participants\) vs\. overall positive collected ratings\.

## 7\. Conclusion

In this work, we investigated the influence of LLMs on writing style in scientific communication, specifically in the NLP domain\. We relied on two data scenarios: a naturalistic corpus of over 37,000 scientific papers published in the two years preceding vs\. following the release of ChatGPT, and a synthetic dataset of 3,000 human\-written passages and their LLM\-generated improvements\.

With regard to word usage, we found that both word frequency and the contexts in which these are used have changed significantly, indicating semantic specialization in some cases and generalization in others\. We also found specific stylistic features that distinguish LLM\-modified texts from human texts\. For example, LLM\-modified texts more frequently contain certain syntactic constructions, more complex and longer words and a lower lexical diversity\. Crucially, trends in word usage as well as stylistic features are broadly consistent across the naturalistic and synthetic data scenarios, indicating that many observed shifts in writing practices can be attributed to growing LLM use\. Finally, in a pilot study, we measured the impact of these changes in writing style on the reading experience\. The results indicate that LLM\-improved texts are perceived as more understandable and exciting\.

Our findings encourage further research along several dimensions\. These include a larger\-scale annotation study to verify the generalizability of the results; exploring prompt\- or model\-specific characteristics of writing style and word usage; and deploying the identified linguistic features in applied scenarios such as detecting AI\-generated content\.

## 8\. Limitations

We note several limitations of our work\. First, our analysis is framed as a binary comparison of language use across two time periods\. This approach has important practical benefits and is underpinned by clearly stated assumptions \(e\.g\., uncertainty regarding the precise degree of LLM use in the second time period\)\. However, a finer\-grained comparison of smaller time slices could provide a clearer picture regarding the development patterns of LLM\-supported stylistic choices\.

We also acknowledge that, prior to the release of ChatGPT, other AI\-assisted writing tools such as Grammarly were already available and may have been used during the first time period, potentially influencing some publications\. However, these tools primarily focus on grammar correction and minor stylistic suggestions rather than generating substantial new text or paraphrases\. For this reason, their potential impact on the linguistic patterns we analyze is likely more limited compared to modern large language models\. A larger empirical study comparing the stylistic differences between traditional writing tools and LLMs as writing assistants would be a valuable direction for future work\.

More generally, our study is limited to a single scientific domain \(Natural Language Processing\) and one European language \(English\)\. Since academic writing conventions are highly specific to disciplines as well as languages, replicating this analysis on more diverse datasets would help assess the generalizability of the trends we report\.

Finally, we ran a pilot annotation study with 20 participants, collecting two judgments per instance\. A larger\-scale setup with more annotations per instance would provide more robust results\.

## 9\. Acknowledgments

We are grateful to our 20 volunteer annotators for the time and effort they put into this study\. Filip Miletić was supported by DFG Research Grant SCHU 2580/5\-2 \(Computational Models of Semantic Variation in Multi\-Word Expressions across Speakers and Languages\)\.

## 10\. Bibliographical References

\\c@NAT@ctr

- Akinwande et al\. \(2024\)Mayowa Akinwande, Oluwaseyi Adeliyi, and Toyyibat Yussuph\. 2024\.[Decoding ai and human authorship: Nuances revealed through nlp and statistical analysis](https://doi.org/10.5121/ijci.2024.130408)\.*International Journal on Cybernetics & Informatics*, 13\(4\):85–103\.
- Chen et al\. \(2023\)Yutian Chen, Hao Kang, Vivian Zhai, Liangze Li, Rita Singh, and Bhiksha Raj\. 2023\.[Token prediction as implicit classification to identify LLM\-generated text](https://doi.org/10.18653/v1/2023.emnlp-main.810)\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 13112–13120, Singapore\. Association for Computational Linguistics\.
- Desaire et al\. \(2023\)Heather Desaire, Aleesa E\. Chua, Madeline Isom, Romana Jarosova, and David Hua\. 2023\.[Distinguishing academic science writing from humans or chatgpt with over 99% accuracy using off\-the\-shelf machine learning tools](https://doi.org/https://doi.org/10.1016/j.xcrp.2023.101426)\.*Cell Reports Physical Science*, 4\(6\):101426\.
- Doru et al\. \(2025\)Berin Doru, Christoph Maier, Johanna Sophie Busse, Thomas Lücke, Judith Schönhoff, Elena Enax\-Krumova, Steffen Hessler, Maria Berger, and Marianne Tokic\. 2025\.[Detecting artificial intelligence–generated versus human\-written medical student essays: Semirandomized controlled study](https://doi.org/10.2196/62779)\.*JMIR Medical Education*, 11:e62779\.
- Dugan et al\. \(2024\)Liam Dugan, Alyssa Hwang, Filip Trhlík, Andrew Zhu, Josh Magnus Ludan, Hainiu Xu, Daphne Ippolito, and Chris Callison\-Burch\. 2024\.[RAID: A shared benchmark for robust evaluation of machine\-generated text detectors](https://doi.org/10.18653/v1/2024.acl-long.674)\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 12463–12492, Bangkok, Thailand\. Association for Computational Linguistics\.
- Gao et al\. \(2023\)Catherine A\. Gao, Frederick M\. Howard, Nikolay S\. Markov, Emma C\. Dyer, Siddhi Ramesh, Yuan Luo, and Alexander T\. Pearson\. 2023\.[Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers](https://doi.org/10.1038/s41746-023-00819-6)\.*npj Digital Medicine*, 6\(1\)\.
- Gray \(2024\)Andrew Gray\. 2024\.[ChatGPT "contamination": estimating the prevalence of LLMs in the scholarly literature](https://arxiv.org/abs/2403.16887)\.*arXiv*, abs/2403\.16887\.
- Grootendorst \(2022\)Maarten Grootendorst\. 2022\.[BERTopic: Neural topic modeling with a class\-based TF\-IDF procedure](https://arxiv.org/abs/2203.05794)\.*arXiv*, abs/2203\.05794\.
- Guggilla et al\. \(2025\)Chinnappa Guggilla, Budhaditya Roy, Trupti Chavan, Abdul Rahman, and Edward Bowen\. 2025\.[AI generated text detection using instruction fine\-tuned large language and transformer\-based models](https://arxiv.org/abs/2507.05157)\.*arXiv*, abs/2507\.05157\.
- Guo et al\. \(2023\)Biyang Guo, Xin Zhang, Ziyuan Wang, Minqi Jiang, Jinran Nie, Yuxuan Ding, Jianwei Yue, and Yupeng Wu\. 2023\.[How close is ChatGPT to human experts? Comparison corpus, evaluation, and detection](https://doi.org/10.48550/ARXIV.2301.07597)\.*arXiv*, abs/2301\.07597\.
- Hakam et al\. \(2024\)Hassan Tarek Hakam, Robert Prill, Lisa Korte, Bruno Lovreković, Marko Ostojić, Nikolai Ramadanov, and Felix Muehlensiepen\. 2024\.[Human\-written vs AI\-generated texts in orthopedic academic literature: Comparative qualitative analysis](https://doi.org/10.2196/52164)\.*JMIR Formative Research*, 8:e52164\.
- Hamed and Wu \(2023\)Ahmed Abdeen Hamed and Xindong Wu\. 2023\.[Improving detection of ChatGPT\-generated fake science using real publication text: Introducing xFakeBibs a supervised\-learning network algorithm](https://doi.org/10.48550/ARXIV.2308.11767)\.*arXiv*, abs/2308\.11767\.
- Kobak et al\. \(2025\)Dmitry Kobak, Rita González\-Márquez, Emőke\-Ágnes Horvát, and Jan Lause\. 2025\.[Delving into LLM\-assisted writing in biomedical publications through excess vocabulary](https://doi.org/10.1126/sciadv.adt3813)\.*Science Advances*, 11\(27\)\.
- Koller et al\. \(2024\)Daphne Koller, Andrew Beam, Arjun Manrai, Euan Ashley, Xiaoxuan Liu, Judy Gichoya, Chris Holmes, James Zou, Noa Dagan, Tien Y\. Wong, David Blumenthal, and Isaac Kohane\. 2024\.[Why we support and encourage the use of large language models in NEJM AI submissions](https://doi.org/10.1056/aie2300128)\.*NEJM AI*, 1\(1\)\.
- Lee and Lee \(2023\)Bruce W\. Lee and Jason Lee\. 2023\.[LFTK: Handcrafted features in computational linguistics](https://aclanthology.org/2023.bea-1.1)\.In*Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2023\)*, pages 1–19, Toronto, Canada\. Association for Computational Linguistics\.
- Lee et al\. \(2022\)Mina Lee, Percy Liang, and Qian Yang\. 2022\.[CoAuthor: Designing a human\-AI collaborative writing dataset for exploring language model capabilities](https://doi.org/10.1145/3491102.3502030)\.In*Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems*, CHI ’22, New York, NY, USA\. Association for Computing Machinery\.
- Li et al\. \(2024\)Yafu Li, Qintong Li, Leyang Cui, Wei Bi, Zhilin Wang, Longyue Wang, Linyi Yang, Shuming Shi, and Yue Zhang\. 2024\.[MAGE: Machine\-generated text detection in the wild](https://doi.org/10.18653/v1/2024.acl-long.3)\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 36–53, Bangkok, Thailand\. Association for Computational Linguistics\.
- Lin and Zhu \(2025\)Cong William Lin and Wu Zhu\. 2025\.[Divergent LLM adoption and heterogeneous convergence paths in research writing](https://doi.org/10.48550/ARXIV.2504.13629)\.*arXiv*, abs/2504\.13629\.
- Ma et al\. \(2023\)Yongqiang Ma, Jiawei Liu, Fan Yi, Qikai Cheng, Yong Huang, Wei Lu, and Xiaozhong Liu\. 2023\.[AI vs\. human – differentiation analysis of scientific content generation](https://doi.org/10.48550/ARXIV.2301.10416)\.*arXiv*, abs/2301\.10416\.
- Macko et al\. \(2023\)Dominik Macko, Robert Moro, Adaku Uchendu, Jason Lucas, Michiharu Yamashita, Matúš Pikuliak, Ivan Srba, Thai Le, Dongwon Lee, Jakub Simko, and Maria Bielikova\. 2023\.[MULTITuDE: Large\-scale multilingual machine\-generated text detection benchmark](https://doi.org/10.18653/v1/2023.emnlp-main.616)\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 9960–9987, Singapore\. Association for Computational Linguistics\.
- Maurer \(2026\)Maximilian Maurer\. 2026\.elfen: A python package for efficient linguistic feature extraction for natural language datasets\.In*Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations*, Rabat, Morocco\. Association for Computational Linguistics\.
- Mikolov et al\. \(2013\)Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean\. 2013\.[Efficient estimation of word representations in vector space](http://arxiv.org/abs/1301.3781)\.In*1st International Conference on Learning Representations, ICLR 2013, Workshop Track Proceedings*, Scottsdale, Arizona, USA\.
- Muñoz\-Ortiz et al\. \(2024\)Alberto Muñoz\-Ortiz, Carlos Gómez\-Rodríguez, and David Vilares\. 2024\.[Contrasting linguistic patterns in human and LLM\-generated news text](https://doi.org/10.1007/s10462-024-10903-2)\.*Artificial Intelligence Review*, 57\(10\)\.
- Pedregosa et al\. \(2011\)Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay\. 2011\.Scikit\-learn: Machine learning in Python\.*Journal of Machine Learning Research*, 12:2825–2830\.
- Pei et al\. \(2022\)Jiaxin Pei, Aparna Ananthasubramaniam, Xingyao Wang, Naitian Zhou, Apostolos Dedeloudis, Jackson Sargent, and David Jurgens\. 2022\.[POTATO: The portable text annotation tool](https://doi.org/10.18653/v1/2022.emnlp-demos.33)\.In*Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations*, pages 327–337, Abu Dhabi, UAE\. Association for Computational Linguistics\.
- Pierrejean and Tanguy \(2018\)Benedicte Pierrejean and Ludovic Tanguy\. 2018\.[Towards qualitative word embeddings evaluation: Measuring neighbors variation](https://doi.org/10.18653/v1/N18-4005)\.In*Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop*, pages 32–39, New Orleans, Louisiana, USA\. Association for Computational Linguistics\.
- Rayson et al\. \(2004\)Paul Rayson, Damon Berridge, and Brian Francis\. 2004\.Extending the cochran rule for the comparison of word frequencies between corpora\.In*JADT 2004 : 7es Journées internationales d’Analyse statistique des Données Textuelles*\.
- Rayson and Garside \(2000\)Paul Rayson and Roger Garside\. 2000\.[Comparing corpora using frequency profiling](https://doi.org/10.3115/1117729.1117730)\.In*The Workshop on Comparing Corpora*, pages 1–6, Hong Kong, China\. Association for Computational Linguistics\.
- Řehůřek and Sojka \(2010\)Radim Řehůřek and Petr Sojka\. 2010\.Software framework for topic modelling with large corpora\.In*Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks*, pages 45–50, Valletta, Malta\. ELRA\.
- Reimers and Gurevych \(2019\)Nils Reimers and Iryna Gurevych\. 2019\.[Sentence\-BERT: Sentence embeddings using Siamese BERT\-Networks](https://doi.org/10.18653/v1/D19-1410)\.In*Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\)*, pages 3982–3992, Hong Kong, China\. Association for Computational Linguistics\.
- Reinhart et al\. \(2025\)Alex Reinhart, Ben Markey, Michael Laudenbach, Kachatad Pantusen, Ronald Yurko, Gordon Weinberg, and David West Brown\. 2025\.[Do LLMs write like humans? Variation in grammatical and rhetorical styles](https://doi.org/10.1073/pnas.2422455122)\.*Proceedings of the National Academy of Sciences*, 122\(8\)\.
- Richburg et al\. \(2024\)Aquia Richburg, Calvin Bao, and Marine Carpuat\. 2024\.[Automatic authorship analysis in human\-AI collaborative writing](https://aclanthology.org/2024.lrec-main.165/)\.In*Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\)*, pages 1845–1855, Torino, Italia\. ELRA and ICCL\.
- Rohatgi et al\. \(2023\)Shaurya Rohatgi, Yanxia Qin, Benjamin Aw, Niranjana Unnithan, and Min\-Yen Kan\. 2023\.[The ACL OCL corpus: Advancing open science in computational linguistics](https://doi.org/10.18653/v1/2023.emnlp-main.640)\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 10348–10361, Singapore\. Association for Computational Linguistics\.
- Russell et al\. \(2025\)Jenna Russell, Marzena Karpinska, and Mohit Iyyer\. 2025\.[People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI\-generated text](https://doi.org/10.18653/v1/2025.acl-long.267)\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 5342–5373, Vienna, Austria\. Association for Computational Linguistics\.
- Sagi et al\. \(2009\)Eyal Sagi, Stefan Kaufmann, and Brady Clark\. 2009\.[Semantic density analysis: Comparing word meaning across time and phonetic space](https://aclanthology.org/W09-0214/)\.In*Proceedings of the Workshop on Geometrical Models of Natural Language Semantics*, pages 104–111, Athens, Greece\. Association for Computational Linguistics\.
- Schlechtweg \(2023\)Dominik Schlechtweg\. 2023\.[*Human and Computational Measurement of Lexical Semantic Change*](http://dx.doi.org/10.18419/opus-12833)\.Stuttgart, Germany\.
- Su et al\. \(2025\)Zhixiong Su, Yichen Wang, Herun Wan, Zhaohan Zhang, and Minnan Luo\. 2025\.[HACo\-det: A study towards fine\-grained machine\-generated text detection under human\-AI coauthoring](https://doi.org/10.18653/v1/2025.acl-long.1069)\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 22015–22036, Vienna, Austria\. Association for Computational Linguistics\.
- Tahmasebi et al\. \(2021\)Nina Tahmasebi, Lars Borin, and Adam Jatowt\. 2021\.[Survey of computational approaches to lexical semantic change](https://doi.org/10.5281/zenodo.5040302)\.In Nina Tahmasebi, Lars Borin, Adam Jatowt, Yang Xu, and Simon Hengchen, editors,*Computational Approaches to Semantic Change*, pages 1–91\. Language Science Press, Berlin\.
- Wang et al\. \(2024\)Yuxia Wang, Jonibek Mansurov, Petar Ivanov, Jinyan Su, Artem Shelmanov, Akim Tsvigun, Chenxi Whitehouse, Osama Mohammed Afzal, Tarek Mahmoud, Toru Sasaki, Thomas Arnold, Alham Fikri Aji, Nizar Habash, Iryna Gurevych, and Preslav Nakov\. 2024\.[M4: Multi\-generator, multi\-domain, and multi\-lingual black\-box machine\-generated text detection](https://doi.org/10.18653/v1/2024.eacl-long.83)\.In*Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 1369–1407, St\. Julian’s, Malta\. Association for Computational Linguistics\.
- Wang et al\. \(2025\)Yuxia Wang, Artem Shelmanov, Jonibek Mansurov, Akim Tsvigun, Nizar Habash, Alham Fikri Aji, Ekaterina Artemova, Zhuohan Xie, Jinyan Su, Rui Xing, Iryna Gurevych, and Preslav Nakov\. 2025\.[PAN’25 generative AI detection \(task 2\): Human\-AI collaborative text classification](https://doi.org/10.5281/ZENODO.14966981)\.
- Warner et al\. \(2025\)Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Griffin Thomas Adams, Jeremy Howard, and Iacopo Poli\. 2025\.[Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference](https://doi.org/10.18653/v1/2025.acl-long.127)\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 2526–2547, Vienna, Austria\. Association for Computational Linguistics\.
- Wu et al\. \(2025\)Junchao Wu, Shu Yang, Runzhe Zhan, Yulin Yuan, Lidia Sam Chao, and Derek Fai Wong\. 2025\.[A survey on LLM\-generated text detection: Necessity, methods, and future directions](https://doi.org/10.1162/coli_a_00549)\.*Computational Linguistics*, 51\(1\):275–338\.
- Yildiz Durak et al\. \(2025\)Hatice Yildiz Durak, Figen Eğin, and Aytuğ Onan\. 2025\.[A comparison of human‐written versus AI‐generated text in discussions at educational settings: Investigating features for ChatGPT, Gemini and BingAI](https://doi.org/10.1111/ejed.70014)\.*European Journal of Education*, 60\(1\)\.
- Zamaraeva et al\. \(2025\)Olga Zamaraeva, Dan Flickinger, Francis Bond, and Carlos Gómez\-Rodríguez\. 2025\.[Comparing LLM\-generated and human\-authored news text using formal syntactic theory](https://doi.org/10.18653/v1/2025.acl-long.443)\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 9041–9060, Vienna, Austria\. Association for Computational Linguistics\.
- Zanotto and Aroyehun \(2024\)Sergio E\. Zanotto and Segun Aroyehun\. 2024\.[Human variability vs\. machine consistency: A linguistic analysis of texts generated by humans and large language models](https://doi.org/10.48550/ARXIV.2412.03025)\.*arXiv*, abs/2412\.03025\.
- Zhao et al\. \(2025\)Xuandong Zhao, Sam Gunn, Miranda Christ, Jaiden Fairoze, Andres Fabrega, Nicholas Carlini, Sanjam Garg, Sanghyun Hong, Milad Nasr, Florian Tramèr, Somesh Jha, Lei Li, Yu\-Xiang Wang, and Dawn Song\. 2025\.[SoK: Watermarking for AI\-generated content](https://doi.org/10.1109/SP61157.2025.00178)\.In*IEEE Symposium on Security and Privacy, SP 2025, San Francisco, CA, USA, May 12\-15, 2025*, pages 2621–2639\. IEEE\.

## Appendix ATopic analysis

We inspect the comparability of the two time periods by analyzing their distribution across topics\. We rely on BERTopic\(Grootendorst,[2022](https://arxiv.org/html/2605.19936#bib.bib8)\): given a set of documents, it computes document embeddings using a pretrained transformer model, clusters those embeddings, and then represents the topics \(corresponding to the obtained clusters\) by identifying a set of distinctive keywords\. We set the minimum topic size to 100 documents, representation model to Maximal Marginal Relevance, and use default values for other parameters\. For each paper, we use the concatenation of its title and abstract\. Papers initially classified as outliers are assigned to the best\-fitting topic based on the probabilities computed in the soft\-clustering step over document embeddings\. We include the trained topic model in our corpus update pipeline to ensure consistency of topic labels for future papers\.

For a given topic, we calculate what proportion of papers assigned to it come fromt1t\_\{1\}vs\.t2t\_\{2\}\(after normalizing the counts by the total number of papers in the respective time period\)\. We show sample topics with different temporal distributions in Table[3](https://arxiv.org/html/2605.19936#A1.T3)\. Topics which are overrepresented in one of the two periods reflect general shifts in research trends within NLP\. For example, papers focusing on individual levels of linguistic structure \(e\.g\., word sense disambiguation, topic 41; dependency parsing, 45\) and related methods \(word embeddings, 19\) are more frequent int1t\_\{1\}\. Those concerned with LLMs \(13, 54\) and more recent methods such as reinforcement learning \(62\) are more prevalent int2t\_\{2\}\. But beyond these rather intuitive differences, 39 out of 70 topics \(62% of all papers\) have a broadly balanced temporal distribution \(normalized proportion of papers from the dominant time period≤\\leq60%\)\. Together with the fact that strictly all topics contain papers from botht1t\_\{1\}andt2t\_\{2\}, we interpret these findings as confirming the overall comparability of the two time periods, without stark topical shifts likely to skew the outcome of our experiments\.

Table 3:Sample topics with different temporal distribution\. Percentages show the normalized proportion of papers from each time period within a topic; total shows topic size as the raw number of papers\.
## Appendix BAdditional Linguistic Examples

We present additional examples from our lexical analysis introduced in Section[4](https://arxiv.org/html/2605.19936#S4)\. We provide more detailed corpus statistics for single\-word lexical choices; extend the same analysis to multi\-word sequences, operationalized as word 5\-grams; and then elaborate on our clustering analysis by discussing more target words and sample sentences\.

### B\.1\. Single\-Word Lexical Choices

We present the strongest changes in single\-word lexical choices for nouns \(Table[4](https://arxiv.org/html/2605.19936#A3.T4)\), adjectives \(Table[5](https://arxiv.org/html/2605.19936#A3.T5)\), verbs \(Table[6](https://arxiv.org/html/2605.19936#A3.T6)\), and adverbs \(Table[7](https://arxiv.org/html/2605.19936#A3.T7)\)\. Each table shows the 10 words with the strongest increase \(top panel\) and decrease \(bottom panel\) in frequency over time, as measured by the log\-likelihood score\. Examples are shown both for the naturalistic corpus \(left panel\) and the LLM\-modified corpus \(right panel\)\. The following columns are shown:

- •F​r​e​qt1Freq\_\{t\_\{1\}\}– frequency per million words int1t\_\{1\}
- •F​r​e​qt2Freq\_\{t\_\{2\}\}– frequency per million words int2t\_\{2\}
- •L​LLL– log\-likelihood score
- •N​Dt1ND\_\{t\_\{1\}\}– neighborhood density int1t\_\{1\}
- •N​Dt2ND\_\{t\_\{2\}\}– neighborhood density int2t\_\{2\}
- •Δ​N​D\\Delta ND– change in neighborhood density int2t\_\{2\}
- •UU– Mann\-Whitney–U test statistic for comparison of neighborhood densities
- •pp– significance levels for the Mann\-Whitney–U test: \*\*\*<0\.001<0\.001; \*\*<0\.01<0\.01; \*<0\.05<0\.05; ns≥0\.05\\geq 0\.05

Naturalistic corpus targets marked with an asterisk also appear in the top 100 strongest changes in the LLM\-modified corpus \(for the respective part of speech and increase/decrease direction\)\.

### B\.2\. Multi\-Word Lexical Choices

We perform a follow\-up to our core analysis of single\-word lexical choices, aiming to understand if comparable patterns can also be observed in longer sequences of words\. We operationally define these as word 5\-grams \(lemmatized and part\-of\-speech tagged\), retaining only those where each of the five constituent lemmas is composed of alphabetic characters only and is at least two characters long\. We then collect frequency counts int1t\_\{1\}andt2t\_\{2\}for both the naturalistic and the LLM\-modified corpus, and compute the log\-likelihood score using the same procedure as for individual words\. We restrict our analysis to the 5\-grams which appear at least 10 times in botht1t\_\{1\}andt2t\_\{2\}\(for the corresponding corpus\)\. The results are summarized in Table[8](https://arxiv.org/html/2605.19936#A3.T8), which shows the 40 strongest rises and falls in use for the naturalistic and the LLM\-modified corpus\.

Some of the word sequences identified by the analysis point to the general evolution of the NLP community in terms of dominant methods \(e\.g\.,language model such as bertfalling out of use\) and writing conventions \(e\.g\.,name or uniquely identify individualgrowing in use, presumably as part of the Responsible NLP Checklist\)\. But many other sequences clearly reflect the stylistic patterns already observed for single words\. For example, the naturalistic corpus shows a decrease in meta\-narrative devices relying on simple vocabulary \(e\.g\.,we can see that the,result show that our model\) and an increase in more formal equivalent expressions \(e\.g\.,it be evident that the,provide valuable insight into\)\. The LLM\-modified corpus shows the same trend, with straightforward expressions falling out of use \(e\.g\.,paper be organize as follow,it have be show that\), and their formal equivalents becoming more prominent \(e\.g\.,paper be structure as follow,it have be observe that\)\. Like in the single\-word analysis, we overall see that the naturalistic and the LLM\-modified corpus overlap in most prominent stylistic changes, and that these generally involve more complex lexical choices\. These findings further support the view that the stylistic changes in the naturalistic corpus can be at least partly attributed to LLM\-assisted writing\.

### B\.3\. Clustering Examples

We provide further examples from our clustering analysis to illustrate fine\-grained usage differences for the following target words:ensure\(Table[9](https://arxiv.org/html/2605.19936#A3.T9)\),utilize\(Table[10](https://arxiv.org/html/2605.19936#A3.T10)\),crucial\(Table[11](https://arxiv.org/html/2605.19936#A3.T11)\), andnotably\(Table[12](https://arxiv.org/html/2605.19936#A3.T12)\)\. For each target word, we include two sample clusters capturing distinct uses, with four representative sentences manually selected for each cluster\. For ease of reading, the examples are shown in a keyword\-in\-context format\.

## Appendix CHuman Annotation

In the human annotation, we measured four different dimensions of reading experience, each recorded with 2 items\. The items for all dimensions are listed in Table[13](https://arxiv.org/html/2605.19936#A3.T13)\. The annotation guidelines are presented in Figure[4](https://arxiv.org/html/2605.19936#A3.F4)and an example of one item and the annotator view is shown in Figure[5](https://arxiv.org/html/2605.19936#A3.F5)\. Annotation was conducted using the Potato annotation tool\(Pei et al\.,[2022](https://arxiv.org/html/2605.19936#bib.bib25)\)\. The full configuration, dataset and annotation results are provided via the repository:[https://github\.com/FilipMiletic/ScientificCommunication](https://github.com/FilipMiletic/ScientificCommunication)

Table 4:Strongest changes innounusage\.Table 5:Strongest changes inadjectiveusage\.Table 6:Strongest changes inverbusage\.Table 7:Strongest changes inadverbusage\.Table 8:Strongest changes in the use of word 5\-grams\. Shown columns:t1t\_\{1\}: frequency int1t\_\{1\}per million words;t2t\_\{2\}: frequency int2t\_\{2\}per million words;L​LLL: log\-likelihood score\.Table 9:Example uses of the verbensure\. Top cluster: broad sense of finality similar to conjunctions likein order to\(65%t1t\_\{1\}vs\. 35%t2t\_\{2\}; 431 sentences\)\. Bottom cluster: more specific sense ‘to guarantee’ \(43%t1t\_\{1\}vs\. 57%t2t\_\{2\}; 363 sentences\)\. The growing use of the more specialized second sense is consistent with an increasing neighborhood density \(Δ​N​D=0\.013\\Delta ND=0\.013\)\.Table 10:Example uses of the verbutilize\. Top cluster: more specific sense ‘use to the fullest potential’ \(60%t1t\_\{1\}vs\. 40%t2t\_\{2\}; 281 sentences\)\. Bottom cluster: more general sense ‘make use of’ \(44%t1t\_\{1\}vs\. 56%t2t\_\{2\}; 354 sentences\)\. The growing use of the broader second sense is consistent with a falling neighborhood density \(ΔND=−0\.026\)\.\\Delta ND=\-0\.026\)\.Table 11:Example uses of the adjectivecrucial\. Top cluster: finality\-connoted sense ‘important in determining an outcome’ \(58%t1t\_\{1\}vs\. 42%t2t\_\{2\}; 146 sentences\)\. Bottom cluster: more general sense ‘important, significant’ \(38%t1t\_\{1\}vs\. 62%t2t\_\{2\}; 208 sentences\)\. The growing use of the broader second sense is consistent with a falling neighborhood density \(Δ​N​D=−0\.018\\Delta ND=\-0\.018\)\.Table 12:Example uses of the adverbnotably\. Top cluster: semantically broad usage typically introducing an example, similar to ‘especially, particularly’ \(63%t1t\_\{1\}vs\. 37%t2t\_\{2\}; 171 sentences\)\. Bottom cluster: intensifier\-like sense ‘very’ \(44%t1t\_\{1\}vs\. 56%t2t\_\{2\}; 309 sentences\)\. The intensification use is restricted to a relatively small set of cooccurrents \(typically gradable adjectives\), which aligns with an increasing neighborhood density \(Δ​N​D=0\.029\\Delta ND=0\.029\)\.Table 13:The following questions are used to measure four different dimensions of reading experience\.![Refer to caption](https://arxiv.org/html/2605.19936v1/x1.png)Figure 4:The introduction and study description annotators saw before rating the pairs of human vs\. LLM\-paraphrased scientific text\.![Refer to caption](https://arxiv.org/html/2605.19936v1/x2.png)Figure 5:View of an example annotation item: pairwise annotation on four different dimensions of reading experience\.

Similar Articles

Content for Content’s Sake

Armin Ronacher

The author investigates how LLMs are influencing word usage in coding and everyday language, finding that words favored by LLMs show increased frequency in both coding sessions and Google Trends, raising concerns about humans adopting LLM writing styles.

Fine-tuning an LLM to write docs like it's 1995

Hacker News Top

The author fine-tuned a local LLM on a corpus of 1990s Microsoft manuals to generate documentation in that vintage style, exploring local model customization for technical writing.

Effects of Varying LLM Access on Essay Writing Behavior

arXiv cs.CL

A pilot study with 24 college students examines how varying levels of LLM access (none, limited, unlimited) affect essay writing quality, behavior, and perceived authorship, finding that constrained access preserves authorship confidence while unlimited access reduces creative expression and ownership.

Computational conceptual history of scientific concepts: From early digital methods to LLMs

arXiv cs.CL

This paper situates large language models within the broader history of computational approaches to concept analysis in the history, philosophy, and sociology of science (HPSS), reviewing methodological challenges and LLM-based case studies for lexical semantic change detection. It covers corpus construction, operationalization, and evaluation across both pre-LLM and LLM-era workflows.