Explain Like I'm 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses

arXiv cs.CL 06/08/26, 04:00 AM Papers
Summary
This paper proposes a framework for evaluating LLMs' ability to generate multiple responses to scientific queries at different language complexity levels. The study finds that models often vary complexity inconsistently, with Claude Sonnet 4.5 performing best but only shifting complexity correctly 46% of the time.
arXiv:2606.06788v1 Announce Type: new Abstract: Evaluations of large language models (LLMs) in scientific information seeking tasks have become increasingly use-centric, such as conducting live or multi-turn evaluations with real users. These evaluations still assume a single, static chat interface, but as models are integrated into new interfaces, evaluations must shift to incorporate interface-specific criteria. We propose a new evaluation framework based on a formative study with $16$ participants that tests models' ability to generate multiple responses to one query that differ along an interpretable axis of language (language complexity), inspired by direct manipulation interfaces from human-centered design literature. We evaluate GPT-5.1, GPT-5 mini, Claude Sonnet 4.5 + Thinking, and DeepSeek-V3.1 by generating 5 responses at different levels of language complexity for $98$ scientific queries. While models vary complexity across responses, most changes remain inconsistent, with the best performing model (Claude Sonnet 4.5) only shifting reliable complexity measures in the correct direction $46\%$ of the time. Our findings hold with increased sample size and alternative complexity levels.
Original Article
View Cached Full Text
Cached at: 06/08/26, 09:20 AM
# Explain Like I’m 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses
Source: [https://arxiv.org/html/2606.06788](https://arxiv.org/html/2606.06788)
Indu Panigrahi and Tal August Siebel School of Computing and Data Science University of Illinois Urbana\-Champaign \{indup2, taugust\}@illinois\.edu

###### Abstract

Evaluations of large language models \(LLMs\) in scientific information seeking tasks have become increasingly use\-centric, such as conducting live or multi\-turn evaluations with real users\. These evaluations still assume a single, static chat interface, but as models are integrated into new interfaces, evaluations must shift to incorporate interface\-specific criteria\. We propose a new evaluation framework based on a formative study with1616participants that tests models’ ability to generate multiple responses to one query that differ along an interpretable axis of language \(language complexity\), inspired by direct manipulation interfaces from human\-centered design literature\. We evaluate GPT\-5\.1, GPT\-5 mini, Claude Sonnet 4\.5 \+ Thinking, and DeepSeek\-V3\.1 by generating 5 responses at different levels of language complexity for9898scientific queries\. While models vary complexity across responses, most changes remain inconsistent, with the best performing model \(Claude Sonnet 4\.5\) only shifting reliable complexity measures in the correct direction46%46\\%of the time\. Our findings hold with increased sample size and alternative complexity levels\.

Explain Like I’m 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses

Indu Panigrahi and Tal AugustSiebel School of Computing and Data ScienceUniversity of Illinois Urbana\-Champaign\{indup2, taugust\}@illinois\.edu

## 1Introduction

Current evaluations of large language models \(LLMs\) in information seeking tasks have become increasingly use\-centric in response to the rapid improvement of models and their integration into deployed systems\. For example, more evaluations focus on conducting live or multi\-turn evaluations with users\(Bragget al\.,[2026b](https://arxiv.org/html/2606.06788#bib.bib21)\)\. However, these evaluations generally assume a single, static user interface \(i\.e\., a chat interface\)\. As models are integrated into new interfaces\(e\.g\., reading or writing interfaces, Joshi and Vogel,[2026](https://arxiv.org/html/2606.06788#bib.bib47); Leeet al\.,[2024](https://arxiv.org/html/2606.06788#bib.bib45)\), evaluations must shift to incorporate interface\-specific criteria\.

We focus on one such task and use context: scientific information seeking with mixed\-expertise users\. Scientists increasingly use LLMs for reading\(Foket al\.,[2023](https://arxiv.org/html/2606.06788#bib.bib15); Loet al\.,[2023](https://arxiv.org/html/2606.06788#bib.bib22)\)and synthesizing literature\(Asaiet al\.,[2026](https://arxiv.org/html/2606.06788#bib.bib2); OpenAI,[2025](https://arxiv.org/html/2606.06788#bib.bib5)\)\. Readers of scientific language can vary in their preferred responses depending on their background \(e\.g\., a junior or senior researcher\) and particular context \(e\.g\., reading within a familiar or unfamiliar discipline\), such as preferring simpler or more complex summaries and explanations\(Guoet al\.,[2021](https://arxiv.org/html/2606.06788#bib.bib19); Augustet al\.,[2022](https://arxiv.org/html/2606.06788#bib.bib29); Joshiet al\.,[2025](https://arxiv.org/html/2606.06788#bib.bib7)\)\.

Past evaluations have focused on models’ ability to generate responses that align with different envisioned audiences\(Joshiet al\.,[2025](https://arxiv.org/html/2606.06788#bib.bib7)\), or personalize a response to given user\(Guoet al\.,[2024b](https://arxiv.org/html/2606.06788#bib.bib23); Murthyet al\.,[2022](https://arxiv.org/html/2606.06788#bib.bib20)\)\. However, a single response can be misaligned with a user, fail to incorporate correct context\(Foket al\.,[2024](https://arxiv.org/html/2606.06788#bib.bib8)\), or be inherently insufficient even if correct\. For example, the design paradigm ofdetails on demand\(Minet al\.,[2025](https://arxiv.org/html/2606.06788#bib.bib35)\)suggests that users may—after seeing an initial summary—want more details, even if they did not want a detailed first response\. While users might prompt a model for more details, often direct manipulation of text \(e\.g\., a slider for text complexity\) can be a more effective mechanism for end\-user control\(Zhanget al\.,[2026](https://arxiv.org/html/2606.06788#bib.bib46)\)\.

Rather than asking if models can generate a single best scientific summary, in this paper we ask if models can generatemultiplesummaries that enable effective user selection and control\. We focus on language complexity\(e\.g\., jargon, information, Guoet al\.,[2024a](https://arxiv.org/html/2606.06788#bib.bib51); Joshiet al\.,[2025](https://arxiv.org/html/2606.06788#bib.bib7)\)in model responses, and validate our approach with a formative user study \(N=16N=16\) where participants explored unfamiliar STEM topics using a prototype chat interface that enabled direct manipulation of response language complexity \(Fig\.[1](https://arxiv.org/html/2606.06788#S3.F1)\)\. Using an existing scientific QA benchmark\(Asaiet al\.,[2026](https://arxiv.org/html/2606.06788#bib.bib2)\), we test55recent models \(GPT\-5\.1, GPT\-5 mini, Claude Sonnet 4\.5 \+ Thinking, and DeepSeek\-V3\.1\) by generating55responses at different levels of response complexity111We anchor complexity levels in our prompt by using different envisioned audiences, similar to past work\(Joshiet al\.,[2025](https://arxiv.org/html/2606.06788#bib.bib7); Augustet al\.,[2022](https://arxiv.org/html/2606.06788#bib.bib29)\), see Sec[3\.1](https://arxiv.org/html/2606.06788#S3.SS1)\.to9898scientific queries\. We define model performance based on the relationshipbetweenversions of a response\.

We find that while models are able to vary complexity across responses, most changes remain inconsistent\. For example, the best\-performing model for jargon change \(Claude Sonnet 4\.5\) only changed jargon in the correct direction46%46\\%of the time across the55responses for an individual query \(33%33\\%of the time for our measure of information, Sec\.[4\.3](https://arxiv.org/html/2606.06788#S4.SS3)\)\. Additionally, though models consistently increased complexity measures for lower complexity responses, in higher levels, changes in measures neared chance level for most models\. We also show that our findings hold with increased sample size \(N=459N=459\) and alternative audience labels\. In summary, our paper makes the following contributions:

1. 1\.A new evaluation framework that tests models’ ability to generate multiple, distinct versions of a response across a dimension of interest\. We instantiate the framework for scientific information seeking\.
2. 2\.A suite of three complexity measures motivated from prior work and grounded to a user study with1616participants reading scientific literature in unfamiliar disciplines\. To evaluate models, we define a criterion for the relationship of complexity measuresbetweenversions of a response\.
3. 3\.Evaluation results from55models on scientific queries showing that models often fail to reliably adjust complexity measures in the correct direction across versions\. Our findings hold across models and when we shift anchors for version generation \(i\.e\., make anchors more distant from one another\)\.

## 2Related Work

### 2\.1LLMs for Information\-Seeking

With the rising capabilities of LLMs being able to quickly produce different versions of text\(Kirket al\.,[2024](https://arxiv.org/html/2606.06788#bib.bib33); Augustet al\.,[2024](https://arxiv.org/html/2606.06788#bib.bib17); Joshiet al\.,[2025](https://arxiv.org/html/2606.06788#bib.bib7); Wuet al\.,[2023a](https://arxiv.org/html/2606.06788#bib.bib34)\), many LLM\-powered interfaces have been developed to help people search for papers\(Muddet al\.,[2025](https://arxiv.org/html/2606.06788#bib.bib3)\), skim papers\(Foket al\.,[2023](https://arxiv.org/html/2606.06788#bib.bib15)\), aggregate information across multiple documents\(Singhet al\.,[2025](https://arxiv.org/html/2606.06788#bib.bib24); Whitfield and Hofmann,[2023](https://arxiv.org/html/2606.06788#bib.bib26)\), and understand content\(Augustet al\.,[2023](https://arxiv.org/html/2606.06788#bib.bib18); Rustet al\.,[2025b](https://arxiv.org/html/2606.06788#bib.bib25); Foket al\.,[2023](https://arxiv.org/html/2606.06788#bib.bib15); Rustet al\.,[2025a](https://arxiv.org/html/2606.06788#bib.bib13); Foket al\.,[2024](https://arxiv.org/html/2606.06788#bib.bib8)\)\. A popular option has been conversation\-based, question\-answering systems, such as Elicit\(Whitfield and Hofmann,[2023](https://arxiv.org/html/2606.06788#bib.bib26)\), ASTA\(Singhet al\.,[2025](https://arxiv.org/html/2606.06788#bib.bib24)\), and OpenAI’s Deep Research\(OpenAI,[2025](https://arxiv.org/html/2606.06788#bib.bib5)\)\.

### 2\.2Adapting LLM Responses to Users

There have been two primary ways to adapt LLM responses to users: interactivity \(i\.e\., theuser decideswhat to see\)\(Minet al\.,[2025](https://arxiv.org/html/2606.06788#bib.bib35); Sundaret al\.,[2010](https://arxiv.org/html/2606.06788#bib.bib36); Färberet al\.,[2025](https://arxiv.org/html/2606.06788#bib.bib32); Headet al\.,[2021](https://arxiv.org/html/2606.06788#bib.bib31)\)and personalization \(i\.e\., thesystem decideswhat the user sees\)\(Kimet al\.,[2025b](https://arxiv.org/html/2606.06788#bib.bib57); Adaret al\.,[2017](https://arxiv.org/html/2606.06788#bib.bib58); Augustet al\.,[2022](https://arxiv.org/html/2606.06788#bib.bib29); Acharyaet al\.,[2018](https://arxiv.org/html/2606.06788#bib.bib30)\)\. Past evaluations in complexity adaption have had an implicitpersonalizationcontext, meaning the goal was for the model to generate a single correct response, evaluating whether or not LLMs could simplify text by generating a response at a 5th grade reading level when prompted with “explain like I am a 5th grader” for example\(Beks van Raaijet al\.,[2024](https://arxiv.org/html/2606.06788#bib.bib37); Hedlinet al\.,[2025](https://arxiv.org/html/2606.06788#bib.bib14); Joshiet al\.,[2025](https://arxiv.org/html/2606.06788#bib.bib7); Färberet al\.,[2025](https://arxiv.org/html/2606.06788#bib.bib32)\)\. We instead focus on aninteractivecontext by evaluating how well models generate a range of responses, allowing users to choose between levels of complexity\. We explore this alternative framing because users may need to select their own version depending on their needs in the moment\(Foket al\.,[2024](https://arxiv.org/html/2606.06788#bib.bib8)\)rather than on a static attribute \(e\.g\., a professor wanting a simple definition of a term in their field\)\. While users can prompt models for this information, specifying information needs iteratively can be time consuming and distracting\(Zamfirescu\-Pereiraet al\.,[2023](https://arxiv.org/html/2606.06788#bib.bib10)\)\.

## 3Formative User Study

Because our motivation for analyzing language complexity is linked to an interactive context \(i\.e\., users directly manipulating language\), we start by conducting a formative study to structure our subsequent model evaluation \(Sec\.[4](https://arxiv.org/html/2606.06788#S4)\)\. Specifically, we aim to \(i\) validate the utility of interactive complexity, \(ii\) corroborate characteristics of language that participants associate with complexity, and \(iii\) identify criteria for model responses that enable successful interactive complexity\. In this section, we describe our user study design \(Sec\.[3\.1](https://arxiv.org/html/2606.06788#S3.SS1)\) and findings \(Sec\.[3\.2](https://arxiv.org/html/2606.06788#S3.SS2)\)\.

### 3\.1Study Procedure

We conducted a within\-subjects study with1616participants primarily from research and STEM backgrounds\. To test the utility of direct manipulation of response complexity, the study was counterbalanced between an interactive chat condition where participants used interactive complexity \(described below\) and a conventional chat condition\. To emulate the experience of exploring new topics with the interfaces, we asked each participant to provide two topics that they were interested in but had little to no knowledge about before the study\. In line with the motivation of information seeking in knowledge\-intensive domains, we required that the topics were in STEM\. During the study, participants had 15 minutes to interact with each interface and complete two simple information\-seeking tasks; details on instructions and tasks are provided in Appendix[A\.9](https://arxiv.org/html/2606.06788#A1.SS9)\. While interacting with each interface, participants were asked to actively verbalize their thought process, reasoning, and impressions\. After each condition, we asked a few questions about participants’ experience with and perceptions of the interface and its chat responses; the interview guide is provided in Appendix[A\.8](https://arxiv.org/html/2606.06788#A1.SS8)\. This study was approved by our institution’s IRB\.

#### Conditions

The interactive complexity condition provided users with a slider mechanism to adjust the language complexity of chat responses, choosing from55levels labeled11through55,11being the least complex \(Fig\.[1](https://arxiv.org/html/2606.06788#S3.F1)\)\. The conventional interface looked and functioned similarly, without the sliders\. All responses were generated using GPT\-5 mini, chosen as a recent model with low latency appropriate for an interactive context\.

![Refer to caption](https://arxiv.org/html/2606.06788v1/x1.png)Figure 1:Interactive language complexity interfaceUsers can manipulate textual complexity by moving the response slider \(A\) to a different notch\. Sentences that are significantly different from those in the previously\-displayed version are highlighted \(B\); significant differences are determined by comparing sentence\-level BERTScores\(Zhanget al\.,[2020](https://arxiv.org/html/2606.06788#bib.bib41)\)to a threshold\. The preset default is 3, but there is also the option to change the default to apply to all sliders\.
#### Response Generation

We used topics provided by participants to pregenerate scientific reports using a recent RAG\-based pipeline\(ScholarQA, Singhet al\.,[2025](https://arxiv.org/html/2606.06788#bib.bib24)\)with Claude Sonnet 4\.5\. Using these reports as a single ground\-truth document, we prompted GPT\-5 mini to generate the slider responses with audiences defined as College student, Junior Ph\.D\. student, Senior Ph\.D\. student, Postdoctoral researcher, and Senior researcher\. We based these levels off of previous work that has stratified model responses based on level of education\(Joshiet al\.,[2025](https://arxiv.org/html/2606.06788#bib.bib7);[Science Journal for Kids,](https://arxiv.org/html/2606.06788#bib.bib49); Augustet al\.,[2024](https://arxiv.org/html/2606.06788#bib.bib17)\), adjusting the levels to be appropriate for the envisioned prompts \(i\.e\., scientific literature queries\)\. We use a single prompt to generate all 5 levels\. Appendix[A\.1](https://arxiv.org/html/2606.06788#A1.SS1)includes the prompt and describes alternative prompts we tested\.

#### Participants

We recruited1616participants, primarily within academic institutions\. Due to our snowball sampling process, most participants held an academic affiliation and had a STEM background\. More participant details are provided in Appendix[A\.6](https://arxiv.org/html/2606.06788#A1.SS6)\. Studies were conducted over Zoom video calls and lasted 1 hour after which all participants received a$25\\mathdollar 25gift card, which is above the minimum wage policy in the area\.

#### Evaluation

To analyze the qualitative data from the study transcripts, we employ open coding\(Saldaña,[2021](https://arxiv.org/html/2606.06788#bib.bib73)\)where one author created an initial codebook from44studies randomly sampled evenly between conditions, which all authors then iterated on until a final codebook was agreed upon\. Using the final codebook, the same initial author coded the remaining studies\. Since all authors were on consultation for study coding, we did not calculate inter\-rater reliability\(McDonaldet al\.,[2019](https://arxiv.org/html/2606.06788#bib.bib74)\)\. The codes describe participants’ sentiments \(e\.g\., “Appreciates agency of interactive chat”\) and interactions \(e\.g\., “Prompted with desired complexity”\)\.

Discipline% of DataSample QueryBiology20\.4What are the biochemical analytical tools to assess the integrity and stability of LNPs?Biophysics9\.2Find some papers that discuss the methods to suppress multiple light scattering effect\.Computer Science33\.7Could you please provide some references to work on multi\-document summarization?Photonics30\.6What progress has been made in trapping and controlling multiple nanoparticles?Physics6\.1What are the ways to perform optomechanical cooling?Table 1:Distribution of data across disciplinesScholarQA\-Multi consists of9898queries distributed across Biology, Biophysics, Computer Science, Photonics, and Physics; a sample question from each domain is shown\.

### 3\.2Results

#### Interactive complexity is valuable over a conventional chat interface

The majority of participants \(13/16\) appreciated the flexibility that interactive complexity provided\. Participants found the responses from the conventional chat interface inconsistent in terms of complexity, sometimes providing too much or too little information or jargon \(13/16\)\. In fact, 7 participants \(4 of whom had not yet seen the interactive condition\) ended up describing their preferred level of complexity in follow\-up prompts, indicating a strong desire for control over perceived complexity of responses\. As P16 describes: “with the \[conventional\] chatbot, it felt like there was a misalignment between how I was interpreting my level of understanding and how \[the model\] interpreted it\. So…I would manually adjust the prompt…The complexity slider was nice because it was just a very quick and easy way”\.

#### Jargon, information, and length influenced perceived complexity

As participants decreased complexity, they prioritized three primary, desired trends: decreases in jargon \(16/16\), amount of information \(13/16\), and length \(12/16\)\. This finding reinforces expectations that prior work has assumed\(Augustet al\.,[2024](https://arxiv.org/html/2606.06788#bib.bib17); Joshiet al\.,[2025](https://arxiv.org/html/2606.06788#bib.bib7); Guoet al\.,[2024a](https://arxiv.org/html/2606.06788#bib.bib51)\)and forefronts the measures that we focus on in our model evaluation \(Sec\.[4](https://arxiv.org/html/2606.06788#S4)\)\.

#### Small changes in complexity are hard to perceive

While 5 levels of complexity seemed generally appropriate \(10/16\), 10 participants noted the levels needed to be more distinct from each other\. In particular,66participants noted that “one and two and four and five \[didn’t\] feel all that much different” \(P9\)\. This observation suggests that some responses did not strictly increase in complexity, conflicting with what users expect and need\.

## 4Evaluating Interactive Complexity

Motivated by our formative findings, we test the potential for models to provide responses that enable interactive complexity control\. Specifically, we evaluate model performance based on the relationship between multiple responses, rather than on a single response\. Below we describe the data \(Sec\.[4\.1](https://arxiv.org/html/2606.06788#S4.SS1)\) and models \(Sec\.[4\.2](https://arxiv.org/html/2606.06788#S4.SS2)\) used in our evaluation, as well as our measures of complexity \(Sec\.[4\.3](https://arxiv.org/html/2606.06788#S4.SS3)\) and our criterion to facilitate effective user control of model responses \(Sec\.[4\.4](https://arxiv.org/html/2606.06788#S4.SS4)\)\.

### 4\.1Scientific Query Data

We use the queries from ScholarQA\-Multi, a subset of ScholarQABench\(Asaiet al\.,[2026](https://arxiv.org/html/2606.06788#bib.bib2)\)that contains9898expert\-written scientific queries and answers across multiple STEM fields \(Tab\.[1](https://arxiv.org/html/2606.06788#S3.T1)\)\. We restricted to this subset to use the expert\-written answer reports for grounding model responses to a single ground truth report \(similar to our user study\), aligning with past work on the utility of human\-written context\(Tanet al\.,[2024](https://arxiv.org/html/2606.06788#bib.bib83); Zhanget al\.,[2023](https://arxiv.org/html/2606.06788#bib.bib84)\)and potential issues with reusing models for both initial ground truth generation and subsequent response generation\(Xuet al\.,[2024](https://arxiv.org/html/2606.06788#bib.bib85); Panicksseryet al\.,[2024](https://arxiv.org/html/2606.06788#bib.bib86)\)\. Since this restriction leads to a relatively small sample, we also evaluate a larger sample of459459queries with reports generated from ScholarQA\(Singhet al\.,[2025](https://arxiv.org/html/2606.06788#bib.bib24)\)as the RAG pipeline using Claude Sonnet 4\.5 \(Sec\.[5\.4](https://arxiv.org/html/2606.06788#S5.SS4)\)\.

### 4\.2Models

We prompt models the same way as for the formative user study \(Sec\.[3\.1](https://arxiv.org/html/2606.06788#S3.SS1)\) —to generate55versions of the response, stratified for55envisioned audiences going from least to most complex: a College student, Junior Ph\.D\. student, Senior Ph\.D\. student, Postdoctoral researcher, and Senior researcher\. We explore alternative anchors in Sec\.[5\.5](https://arxiv.org/html/2606.06788#S5.SS5)\. We evaluate55recent models across different sizes, model families, and reasoning abilities: GPT\-5\.1, GPT\-5 mini, Claude Sonnet 4\.5, Claude Sonnet 4\.5 \+ Thinking, and DeepSeek\-V3\.1\. Details about model configurations are in Appendix[A\.3](https://arxiv.org/html/2606.06788#A1.SS3)\.

### 4\.3Complexity Measures

Past work has quantified complexity through several linguistic measures\. These include readability formulas\(e\.g\., Flesch\-Kincaid score, Flesch,[1948](https://arxiv.org/html/2606.06788#bib.bib52)\), lexical features, and perplexity measures\(Guoet al\.,[2024a](https://arxiv.org/html/2606.06788#bib.bib51); Augustet al\.,[2024](https://arxiv.org/html/2606.06788#bib.bib17); Joshiet al\.,[2025](https://arxiv.org/html/2606.06788#bib.bib7)\)\. We take inspiration from these works and our formative study results \(Sec\.[3\.2](https://arxiv.org/html/2606.06788#S3.SS2)\) to curate a suite of three measures, each representing different dimensions of perceived language complexity:

- •JargonGuoet al\.\([2024a](https://arxiv.org/html/2606.06788#bib.bib51)\); Joshiet al\.\([2025](https://arxiv.org/html/2606.06788#bib.bib7)\): This measure denotes the proportion of text that consists of less familiar words\. We calculate the percentage of words in the text that is not on the Dale\-Chall Word List, a list of3,0003\{,\}000familiar English words\(Chall and Dale,[1995](https://arxiv.org/html/2606.06788#bib.bib56)\)\.
- •InformationTrieneset al\.\([2024](https://arxiv.org/html/2606.06788#bib.bib82)\); Guoet al\.\([2024a](https://arxiv.org/html/2606.06788#bib.bib51)\): More complex language often includes more information \(e\.g\., technical details\)\. We query GPT\-4\.1 to identify independent facts, using the pipeline for generating “atomic facts” fromMinet al\.\([2023](https://arxiv.org/html/2606.06788#bib.bib55)\); we validate model performance by manually inspecting a subset of2525examples\.
- •Length\(Joshiet al\.,[2025](https://arxiv.org/html/2606.06788#bib.bib7); Guoet al\.,[2024a](https://arxiv.org/html/2606.06788#bib.bib51)\): More complex language is often longer, though simpler language can also be longer when the same information elaborated upon\(Wuet al\.,[2023b](https://arxiv.org/html/2606.06788#bib.bib81)\)\. We quantify length by total number of response characters\.

The Flesch\-Kincaid Reading Ease Score is a commonly\-used scale for rating complexity\(Flesch,[1948](https://arxiv.org/html/2606.06788#bib.bib52); Joshiet al\.,[2025](https://arxiv.org/html/2606.06788#bib.bib7); Ágústsdóttiret al\.,[2025](https://arxiv.org/html/2606.06788#bib.bib53); Färberet al\.,[2025](https://arxiv.org/html/2606.06788#bib.bib32)\)\. However, past work has shown that Flesch\-Kincaid scores do not provide a reliable measure of complexity\(Cacholaet al\.,[2025](https://arxiv.org/html/2606.06788#bib.bib43); Tanprasert and Kauchak,[2021](https://arxiv.org/html/2606.06788#bib.bib44); Imperial and Tayyar Madabushi,[2023](https://arxiv.org/html/2606.06788#bib.bib42)\), and we found that our findings remain the same as the scores exhibit similar trends toJargonandInformation\. Thus, we focus on the measures that we confirmed in our user study and provide the Flesch\-Kincaid data in Appendix[A\.4](https://arxiv.org/html/2606.06788#A1.SS4)\.

### 4\.4Criterion for effective user control

Based on prior work\(Guoet al\.,[2024a](https://arxiv.org/html/2606.06788#bib.bib51); Joshiet al\.,[2025](https://arxiv.org/html/2606.06788#bib.bib7); Augustet al\.,[2024](https://arxiv.org/html/2606.06788#bib.bib17)\)and our formative user study, we define model performance based on a model’s ability to increase the measures as the intended complexity levels increase\. We operationalize this by evaluating thedirection of changesin the measures between the55levels of text that models generate for each query\.Models that are better at generating distinct levels of complexity will produce positive changes in all measures\.Negative changes indicate that the model decreased complexity when it should have increased\.

## 5Results

![Refer to caption](https://arxiv.org/html/2606.06788v1/x2.png)Figure 2:Model performance shown as changes in complexity measuresBetween consecutive levels of complexity, models produce changes inJargon,Information, andLengththat vary between increasing and decreasing, particularly forJargonandInformation\. Each point in the scatter overlay represents an input\. The three subsets \(e\.g\., “College→\\rightarrowSr\. Res\. \(n=98\)”\) correspond to the evaluation that used the listed audience range and sample size; “Senior researcher” is abbreviated to “Sr\. Res\.”\. That is, the sample size increases from “College→\\rightarrowSr\. Res\. \(n=98\)” to “College→\\rightarrowSr\. Res\. \(n=459\)”, while the audience range increases from “College→\\rightarrowSr\. Res\. \(n=98\)” to “ Child→\\rightarrowExpert \(n=98\)”\. Extreme outliers removed for visualization\.Fig\.[2](https://arxiv.org/html/2606.06788#S5.F2)plots the change in each measure between consecutive levels of intended complexity that each model generates\. To quantify model performance, we report the percent of changes that move in the correct direction \(i\.e\., increasingJargon,Information, orLength\) in Tab\.[3](https://arxiv.org/html/2606.06788#S5.T3)\. Because Fig\.[2](https://arxiv.org/html/2606.06788#S5.F2)and Tab\.[3](https://arxiv.org/html/2606.06788#S5.T3)display results separated by each transition between levels, we also report the overall performance across all55response levels in Tab\.[2](https://arxiv.org/html/2606.06788#S5.T2)\(i\.e\., the percent of inputs for which the generated55levels move in the correct direction across all transitions\)\. We confirm in Appendix[A\.5](https://arxiv.org/html/2606.06788#A1.SS5)that the distribution of the responses we report here reflect the response distribution from our user study\.

College→\\rightarrowSr\. Res\. \(n=98\)ModelJargonInfo\.LengthAllGPT\-5\.15\.1031\.6397\.963\.06GPT\-5 mini13\.2714\.2976\.530\.00Claude Sonnet 4\.532\.6542\.86100\.014\.29\+ Thinking45\.9232\.6586\.7317\.35DeepSeek\-V3\.118\.3724\.4998\.986\.12College→\\rightarrowSr\. Res\. \(n=459\)GPT\-5\.11\.9626\.3699\.560\.87GPT\-5 mini3\.925\.8869\.930\.22Claude Sonnet 4\.515\.0332\.90100\.05\.45\+ Thinking31\.5928\.7683\.664\.79DeepSeek\-V3\.17\.197\.4198\.470\.44Child→\\rightarrowExpert \(n=98\)GPT\-5\.118\.3751\.02100\.011\.22GPT\-5 mini26\.5335\.7195\.9211\.22Claude Sonnet 4\.560\.2045\.92100\.033\.67\+ Thinking79\.5940\.8293\.8831\.63DeepSeek\-V3\.146\.9436\.7397\.9616\.33Table 2:Model performance shown as percent of inputs where measures increase across all55levels\.These percentages represent how often models generate sets of levels that adhere to the desired increase in the complexity measures, so higher is better\. The “All” column shows the percent of inputs for which the model increases all three measures\.ModelJargonInfo\.LengthGPT\-5\.194\.9081\.63100\.01 to 2GPT\-5 mini87\.7672\.45100\.0Claude Sonnet 4\.5100\.080\.61100\.0\+ Thinking98\.9877\.55100\.0DeepSeek\-V3\.198\.9877\.55100\.0GPT\-5\.154\.0883\.67100\.02 to 3GPT\-5 mini57\.1474\.4997\.96Claude Sonnet 4\.583\.6780\.61100\.0\+ Thinking88\.7879\.59100\.0DeepSeek\-V3\.169\.3974\.49100\.0GPT\-5\.155\.1071\.43100\.03 to 4GPT\-5 mini65\.3157\.1480\.61Claude Sonnet 4\.571\.4278\.57100\.0\+ Thinking77\.5575\.5198\.98DeepSeek\-V3\.151\.0269\.3998\.98GPT\-5\.127\.5577\.5597\.964 to 5GPT\-5 mini60\.2069\.3995\.92Claude Sonnet 4\.555\.1083\.67100\.0\+ Thinking76\.5376\.5386\.73DeepSeek\-V3\.160\.2075\.51100\.0Table 3:Model performance per transition for College→\\rightarrowSr\. Res\. \(n=98\)Each model’s performance is shown as the percent of inputs where the measure goes in the correct direction at each transition\. A higher percentage means that the model performed better at that transition, by more often increasing complexity\.### 5\.1Models inconsistently increase complexity measures

All models vary between increasing and decreasing complexity measures when generating multiple levels\. In Fig\.[2](https://arxiv.org/html/2606.06788#S5.F2), this can be seen where the distributions cover positive and negative changes for the same model and transition between levels, particularly inJargonandInformation\. For example, when transitioning from Level 4 to 5, Claude Sonnet 4\.5 increasesJargonfor55\.10%55\.10\\%of the inputs \(Tab\.[3](https://arxiv.org/html/2606.06788#S5.T3)\)\. In other words, for the same transition, models can increase the complexity for some inputs while decreasing the complexity for others\. Fig\.[4](https://arxiv.org/html/2606.06788#S5.F4)shows an example whereJargonandInformationboth decrease when the complexity is supposed to increase\. The proportion of stagnant changes are at most1\.02%1\.02\\%forJargon,7\.14%7\.14\\%forInformation, and0%0\\%forLength; thus we focus the proportion of changes that decrease as a stronger indication of complexity going in the wrong direction\.

### 5\.2Models increase length with complexity

UnlikeJargonandInformation,Lengthgenerally increases with increasing complexity, as is the intended trend\. This can be seen in Tab\.[2](https://arxiv.org/html/2606.06788#S5.T2)as all models increase length across the55levels for the majority of their inputs \(e\.g\., DeepSeek\-V3\.1 increases length across all levels for98\.98%98\.98\\%of the inputs\)\. However, an increase in length may not always indicate an increase in complexity, especially whenJargonandInformationdecrease\. When directly examining responses, we notice that cases where length increases on its own \(i\.e\., without increasing the other complexity measures\) can be indicative of elaborative simplificationWuet al\.\([2023b](https://arxiv.org/html/2606.06788#bib.bib81)\); an example is shown in Fig\.[3](https://arxiv.org/html/2606.06788#S5.F3)\.

![Refer to caption](https://arxiv.org/html/2606.06788v1/x3.png)Figure 3:Example of elaborative simplificationThese are snippets of responses generated by Claude Sonnet 4\.5 that are supposed to increase in complexity\. Between the two,Lengthincreases, whileJargonandInformationdecrease\. We notice that the additional text in the Level 4 snippet explains in simple language what “meta\-reason” in the Level 3 snippet entails\.![Refer to caption](https://arxiv.org/html/2606.06788v1/x4.png)Figure 4:Example of text incorrectly decreasing in complexityShown are two snippets of text generated by GPT\-5\.1 that are meant to increase in complexity from Level 2 to Level 3\. However, we qualitatively observe that the complexity decreases between analogous phrases \(e\.g\., “platform selection” in the Level 3 snippet reads simpler than “achieving robust, high\-titer expression in suitable production platforms” from Level 2\)\. Decreases inJargonandInformationreflect this observation; note that the measures represent the texts that the snippets are from, not the snippets in isolation, explaining whyLengthincreases\.
### 5\.3Model struggle to differentiate responses in later audience levels

In addition to the overall performance discussed in the previous two findings, we consider how the performance varies by transition\. We find that models increase complexity measures more often when transitioning from Level 1 \(College student\) to Level 2 \(Junior Ph\.D\.\) than for the later transitions\. This can be seen in Tab\.[3](https://arxiv.org/html/2606.06788#S5.T3)where all models have the highest proportion of inputs going in the correct direction of complexity in the “1 to 2” row compared to the later rows\. For example, in theJargoncolumn, GPT\-5 mini correctly increases complexity for87\.76%87\.76\\%of the inputs when going from Level 1 to 2, which is higher than57\.14%57\.14\\%,65\.31%65\.31\\%, and60\.20%60\.20\\%for the later transitions\.

### 5\.4Increasing the sample size does not change these findings

The findings discussed thus far came from providing9898queries and their expert\-written reports as input\. To test the effect of sample size, we evaluate a larger set of queries but with model\-generated reports as input\. We randomly sample500500scientific queries from ScholarQABench \(excluding the9898original queries\) and generate a report for each using the ScholarQA pipeline\(Singhet al\.,[2025](https://arxiv.org/html/2606.06788#bib.bib24)\)with a Claude Sonnet 4\.5 backend\. We use these reports as input to each model and use the same framework as our original evaluation\. We report results on459459queries due to model refusal on some of the queries\. Our results show the same mix of increasing and decreasing complexity measures \(Fig\.[2](https://arxiv.org/html/2606.06788#S5.F2)and Tab\.[2](https://arxiv.org/html/2606.06788#S5.T2)\) along with the tendency to perform better when shifting from Level 1 to 2 than on the later transitions \(per transition performance is provided in Appendix[A\.2](https://arxiv.org/html/2606.06788#A1.SS2)\)\.

### 5\.5Expanding audience levels can improve performance but exhibits the same trends

We chose the audiences “College student” through “Senior researcher” because the wording of the questions in ScholarQABench implied that the inquirer had at least a college education\. We investigate if this choice of audience labels affected the similarity between levels by running an identical evaluation using a different set of audience labels\. Specifically, we use labels from a popular video series called “5 Levels” by WIRED222https://www\.wired\.com/video/series/5\-levels/that focuses on communicating specialized concepts to55different audiences: Child, Teen, College student, Grad student, and Expert\. Results for this evaluation are included in Fig\.[2](https://arxiv.org/html/2606.06788#S5.F2)and Tab\.[2](https://arxiv.org/html/2606.06788#S5.T2)\. While complexity measures increase more often \(i\.e\., the percentages of model responses with complexity measures in the correct direction are higher, Tab\.[2](https://arxiv.org/html/2606.06788#S5.T2)\), the same trend of measures going in the wrong direction holds\. For example, forInformation, the performance of GPT\-5\.1 increased from31\.63%31\.63\\%to51\.02%51\.02\\%with the change in audience labels \(Tab\.[2](https://arxiv.org/html/2606.06788#S5.T2)\); however,51\.02%51\.02\\%is still close to chance level\. Additionally, we observe the same trend where models differentiate between Levels 1 and 2 better than the later levels \(Tab\.[4](https://arxiv.org/html/2606.06788#S5.T4)\)\. This can be seen forJargonwhen the performance of GPT\-5 mini decreases from98\.98%98\.98\\%in the first transition to69\.39%69\.39\\%in the last\.

ModelJargonInfo\.LengthGPT\-5\.1100\.081\.63100\.01 to 2GPT\-5 mini98\.9881\.6398\.98Claude Sonnet 4\.5100\.087\.76100\.0\+ Thinking100\.091\.84100\.0DeepSeek\-V3\.197\.9687\.7697\.96GPT\-5\.195\.9292\.86100\.02 to 3GPT\-5 mini75\.5180\.61100\.0Claude Sonnet 4\.597\.9680\.61100\.0\+ Thinking97\.9681\.63100\.0DeepSeek\-V3\.1100\.081\.63100\.0GPT\-5\.163\.2778\.57100\.03 to 4GPT\-5 mini65\.3173\.4798\.98Claude Sonnet 4\.579\.5983\.67100\.0\+ Thinking88\.7884\.69100\.0DeepSeek\-V3\.178\.5773\.47100\.0GPT\-5\.136\.7385\.71100\.04 to 5GPT\-5 mini69\.3981\.6397\.96Claude Sonnet 4\.578\.5783\.67100\.0\+ Thinking92\.8671\.4393\.88DeepSeek\-V3\.164\.2982\.65100\.0Table 4:Model performance per transition for Child→\\rightarrowExpert \(n=98\)Each model’s performance is shown as the percent of inputs where the measure goes in the correct direction at each transition\. A higher percentage means that the model performed better at that transition, by more often increasing complexity according to these measures\.

## 6Discussion & Conclusion

Enabling users to directly adjust the language of model responses allows LLM\-powered systems to better accommodate user needs\. While prompting remains the default for adjusting model responses, new malleable interfaces promise more direct user control beyond articulating information needs through language\(Minet al\.,[2025](https://arxiv.org/html/2606.06788#bib.bib35); Zhanget al\.,[2026](https://arxiv.org/html/2606.06788#bib.bib46)\)\. However, while interfaces are empowering users to interact with models beyond prompting, evaluations of model responses remain fixed to the traditional chat interface\. In this paper, we propose the idea of evaluating models’ potential for powering interfaces beyond chat\. We do this by testing multiple responses relative to each other rather than single responses\. This approach has parallels with other recent trends in model evaluations, such as evaluating multi\-turn conversations\(Labanet al\.,[2026](https://arxiv.org/html/2606.06788#bib.bib40)\), integrated or live use settings\(Mehtaet al\.,[2026](https://arxiv.org/html/2606.06788#bib.bib39); Bragget al\.,[2026a](https://arxiv.org/html/2606.06788#bib.bib38)\), and performance measures based on different intended audiences\(Joshiet al\.,[2025](https://arxiv.org/html/2606.06788#bib.bib7)\)\. Through a formative user study, we establish that these interactive use cases can be desirable, specifically for controlling language complexity, and identify measures and a criterion for distinguishing between levels of complexity\.

To investigate how current models adhere to or stray from users’ expectations, we evaluate model\-generated levels of complexity for a dataset of scientific questions\. We show that evaluating models across levels of complexity reveals model weaknesses that would be difficult to identify in single response evaluations\. While models tend to increase length with complexity levels, they frequently decrease jargon and information, suggesting that models often neglect other attributes that are important for perceived complexity and end\-user control\. This finding holds even when increasing the sample size and extending complexity response anchors to encompass a wider range of intended audiences\.

## Limitations

We evaluated models using linguistic measures that were strongly supported by prior work and our formative study\. However, there are other potential measures for complexity that could go beyond linguistic characteristics towards the content of the text, such as elaborative simplification and analogies\. Thus, one avenue for future work would be to quantify and evaluate the impact of these additional measures\.

Additionally, the questions in ScholarQABench are expert\-written, scientific questions \(Tab\.[1](https://arxiv.org/html/2606.06788#S3.T1)\)\. As a result, the content of the question may not always match the audience, particularly when we tested the WIRED audiences \(e\.g\., a child is unlikely to ask about “ways to perform optomechanical cooling”\)\. We did test a range of audiences that would be more likely to pose such questions \(College student to Senior researcher\); however, another option for future work could be to try queries that vary by audience\.

Lastly, we establish that models do not consistently increase complexity, causing the differences between generated levels of complexity to fluctuate\. However, even if models did follow the correct direction of complexity, we do not know exactly how much of a difference between levels there needs to be for a user to notice\. Thus, an interesting follow\-up work would be to quantify that difference by performing a human\-centered evaluation on multiple versions of text that vary systematically in one or more complexity measures\.

## Ethical Considerations

This work uses LLMs to generate and evaluate responses\. In addition to their impact on the environment\(Renet al\.,[2024](https://arxiv.org/html/2606.06788#bib.bib76); Desislavovet al\.,[2023](https://arxiv.org/html/2606.06788#bib.bib77)\), LLMs can hallucinate and exhibit biases that affect the information they generate\(Venkitet al\.,[2024](https://arxiv.org/html/2606.06788#bib.bib78); Sharmaet al\.,[2024](https://arxiv.org/html/2606.06788#bib.bib79); Zhou and Di Eugenio,[2025](https://arxiv.org/html/2606.06788#bib.bib80)\)\. At the same time, we believe that the contribution of this work towards making information from knowledge\-intensive domains accessible to people of varying expertise is valuable\.

Additionally, we focus on English text; however, this does not necessarily account for contexts with ESL learners who may have experiences that impact scientific information seeking\.

## References

- S\. Acharya, B\. Di Eugenio, A\. Boyd, R\. Cameron, K\. Dunn Lopez, P\. Martyn\-Nemeth, C\. Dickens, and A\. Ardati \(2018\)Towards generating personalized hospitalization summaries\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop,S\. R\. Cordeiro, S\. Oraby, U\. Pavalanathan, and K\. Rim \(Eds\.\),New Orleans, Louisiana, USA,pp\. 74–82\.External Links:[Link](https://aclanthology.org/N18-4011/),[Document](https://dx.doi.org/10.18653/v1/N18-4011)Cited by:[§2\.2](https://arxiv.org/html/2606.06788#S2.SS2.p1.1)\.
- PersaLog: personalization of news article content\.InProceedings of the 2017 CHI Conference on Human Factors in Computing Systems,CHI ’17,New York, NY, USA,pp\. 3188–3200\.External Links:ISBN 9781450346559,[Link](https://doi.org/10.1145/3025453.3025631),[Document](https://dx.doi.org/10.1145/3025453.3025631)Cited by:[§2\.2](https://arxiv.org/html/2606.06788#S2.SS2.p1.1)\.
- D\. H\. Ágústsdóttir, J\. Rosenberg, and J\. J\. Baker \(2025\)ChatGPT‐4o compared with human researchers in writing plain‐language summaries for cochrane reviews: a blinded, randomized non‐inferiority controlled trial\.Cochrane Evidence Synthesis and Methods3\(4\)\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1002/cesm.70037)Cited by:[§4\.3](https://arxiv.org/html/2606.06788#S4.SS3.p2.1)\.
- A\. Asai, J\. He, R\. Shao, W\. Shi, A\. Singh, J\. C\. Chang, K\. Lo, L\. Soldaini, S\. Feldman, D\. Mike, D\. Wadden, M\. Latzke, M\. Tian, P\. Ji, S\. Liu, H\. Tong, B\. Wu, Y\. Xiong, L\. Zettlemoyer, D\. Weld, G\. Neubig, D\. Downey, W\. Yih, P\. W\. Koh, and H\. Hajishirzi \(2026\)Synthesizing scientific literature with retrieval\-augmented language models\.Nature,pp\. 857–863\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1038/s41586-025-10072-4)Cited by:[§1](https://arxiv.org/html/2606.06788#S1.p2.1),[§1](https://arxiv.org/html/2606.06788#S1.p4.4),[§4\.1](https://arxiv.org/html/2606.06788#S4.SS1.p1.2)\.
- T\. August, K\. Lo, N\. A\. Smith, and K\. Reinecke \(2024\)Know your audience: the benefits and pitfalls of generating plain language summaries beyond the "general" audience\.InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems,CHI ’24,New York, NY, USA\.External Links:ISBN 9798400703300,[Link](https://doi.org/10.1145/3613904.3642289),[Document](https://dx.doi.org/10.1145/3613904.3642289)Cited by:[§2\.1](https://arxiv.org/html/2606.06788#S2.SS1.p1.1),[§3\.1](https://arxiv.org/html/2606.06788#S3.SS1.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2606.06788#S3.SS2.SSS0.Px2.p1.1),[§4\.3](https://arxiv.org/html/2606.06788#S4.SS3.p1.1),[§4\.4](https://arxiv.org/html/2606.06788#S4.SS4.p1.1)\.
- T\. August, K\. Reinecke, and N\. A\. Smith \(2022\)Generating scientific definitions with controllable complexity\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 8298–8317\.External Links:[Link](https://aclanthology.org/2022.acl-long.569/),[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.569)Cited by:[§1](https://arxiv.org/html/2606.06788#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.06788#S2.SS2.p1.1),[footnote 1](https://arxiv.org/html/2606.06788#footnote1)\.
- T\. August, L\. L\. Wang, J\. Bragg, M\. A\. Hearst, A\. Head, and K\. Lo \(2023\)Paper plain: making medical research papers approachable to healthcare consumers with natural language processing\.ACM Trans\. Comput\.\-Hum\. Interact\.30\(5\)\.External Links:ISSN 1073\-0516,[Link](https://doi.org/10.1145/3589955),[Document](https://dx.doi.org/10.1145/3589955)Cited by:[§2\.1](https://arxiv.org/html/2606.06788#S2.SS1.p1.1)\.
- N\. Beks van Raaij, D\. Kolkman, and K\. Podoynitsyna \(2024\)Clearer governmental communication: text simplification with ChatGPT evaluated by quantitative and qualitative research\.InProceedings of the Workshop on DeTermIt\! Evaluating Text Difficulty in a Multilingual Context @ LREC\-COLING 2024,G\. M\. D\. Nunzio, F\. Vezzani, L\. Ermakova, H\. Azarbonyad, and J\. Kamps \(Eds\.\),Torino, Italia,pp\. 152–178\.External Links:[Link](https://aclanthology.org/2024.determit-1.15/)Cited by:[§2\.2](https://arxiv.org/html/2606.06788#S2.SS2.p1.1)\.
- J\. Bragg, M\. D’Arcy, N\. Balepur, D\. Bareket, B\. Dalvi, S\. Feldman, D\. Haddad, J\. D\. Hwang, P\. Jansen, V\. Kishore, B\. P\. Majumder, A\. Naik, S\. Rahamimov, K\. Richardson, A\. Singh, H\. Surana, A\. Tiktinsky, R\. Vasu, G\. Wiener, C\. Anastasiades, S\. Candra, J\. Dunkelberger, D\. Emery, R\. Evans, M\. Hamada, R\. Huff, R\. Kinney, M\. Latzke, J\. Lochner, R\. Lozano\-Aguilera, C\. Nguyen, S\. Rao, A\. Tanaka, B\. Vlahos, P\. Clark, D\. Downey, Y\. Goldberg, A\. Sabharwal, and D\. S\. Weld \(2026a\)AstaBench: rigorous benchmarking of ai agents with a scientific research suite\.External Links:2510\.21652,[Link](https://arxiv.org/abs/2510.21652)Cited by:[§6](https://arxiv.org/html/2606.06788#S6.p1.1)\.
- J\. Bragg, M\. D’Arcy, N\. Balepur, D\. Bareket, B\. D\. Mishra, S\. Feldman, D\. Haddad, J\. D\. Hwang, P\. Jansen, V\. Kishore, B\. P\. Majumder, A\. Naik, S\. Rahamimov, K\. Richardson, A\. Singh, H\. Surana, A\. Tiktinsky, R\. Vasu, G\. Wiener, C\. Anastasiades, S\. Candra, J\. Dunkelberger, D\. Emery, R\. Evans, M\. Hamada, R\. Huff, R\. Kinney, M\. Latzke, J\. Lochner, R\. Lozano\-Aguilera, N\. Nguyen, S\. Rao, A\. Tanaka, B\. Vlahos, P\. Clark, D\. Downey, Y\. Goldberg, A\. Sabharwal, and D\. S\. Weld \(2026b\)AstaBench: rigorous benchmarking of AI agents with a scientific research suite\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=M7TNf5J26u)Cited by:[§1](https://arxiv.org/html/2606.06788#S1.p1.1)\.
- I\. Cachola, D\. Khashabi, and M\. Dredze \(2025\)Evaluating the evaluators: are readability metrics good measures of readability?\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 24011–24027\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1225/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1225),ISBN 979\-8\-89176\-332\-6Cited by:[§4\.3](https://arxiv.org/html/2606.06788#S4.SS3.p2.1)\.
- J\. S\. Chall and E\. Dale \(1995\)Readability revisited : the new dale\-chall readability formula\.External Links:[Link](https://api.semanticscholar.org/CorpusID:61078711)Cited by:[1st item](https://arxiv.org/html/2606.06788#S4.I1.i1.p1.1)\.
- R\. Desislavov, F\. Martínez\-Plumed, and J\. Hernández\-Orallo \(2023\)Trends in ai inference energy consumption: beyond the performance\-vs\-parameter laws of deep learning\.Sustainable Computing: Informatics and Systems38,pp\. 100857\.External Links:ISSN 2210\-5379,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.suscom.2023.100857)Cited by:[Ethical Considerations](https://arxiv.org/html/2606.06788#Sx2.p1.1)\.
- M\. Färber, P\. Aghdam, K\. Im, M\. Tawfelis, and H\. Ghoshal \(2025\)SimplifyMyText: an llm\-based system for inclusive plain language text simplification\.InAdvances in Information Retrieval: 47th European Conference on Information Retrieval, ECIR 2025, Lucca, Italy, April 6–10, 2025, Proceedings, Part IV,Berlin, Heidelberg,pp\. 418–424\.External Links:ISBN 978\-3\-031\-88716\-1Cited by:[§2\.2](https://arxiv.org/html/2606.06788#S2.SS2.p1.1),[§4\.3](https://arxiv.org/html/2606.06788#S4.SS3.p2.1)\.
- R\. Flesch \(1948\)A new readability yardstick\.\.Journal of Applied Psychology32\(3\),pp\. 221–233\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1037/h0057532)Cited by:[§4\.3](https://arxiv.org/html/2606.06788#S4.SS3.p1.1),[§4\.3](https://arxiv.org/html/2606.06788#S4.SS3.p2.1)\.
- R\. Fok, J\. C\. Chang, T\. August, A\. X\. Zhang, and D\. S\. Weld \(2024\)Qlarify: recursively expandable abstracts for dynamic information retrieval over scientific papers\.InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology,UIST ’24,New York, NY, USA\.External Links:ISBN 9798400706288,[Link](https://doi.org/10.1145/3654777.3676397),[Document](https://dx.doi.org/10.1145/3654777.3676397)Cited by:[§1](https://arxiv.org/html/2606.06788#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.06788#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.06788#S2.SS2.p1.1)\.
- R\. Fok, H\. Kambhamettu, L\. Soldaini, J\. Bragg, K\. Lo, M\. Hearst, A\. Head, and D\. S\. Weld \(2023\)Scim: intelligent skimming support for scientific papers\.InProceedings of the 28th International Conference on Intelligent User Interfaces,IUI ’23,New York, NY, USA,pp\. 476–490\.External Links:ISBN 9798400701061,[Link](https://doi.org/10.1145/3581641.3584034),[Document](https://dx.doi.org/10.1145/3581641.3584034)Cited by:[§1](https://arxiv.org/html/2606.06788#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.06788#S2.SS1.p1.1)\.
- Y\. Guo, T\. August, G\. Leroy, T\. Cohen, and L\. L\. Wang \(2024a\)APPLS: evaluating evaluation metrics for plain language summarization\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 9194–9211\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.519/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.519)Cited by:[§1](https://arxiv.org/html/2606.06788#S1.p4.4),[§3\.2](https://arxiv.org/html/2606.06788#S3.SS2.SSS0.Px2.p1.1),[1st item](https://arxiv.org/html/2606.06788#S4.I1.i1.p1.1),[2nd item](https://arxiv.org/html/2606.06788#S4.I1.i2.p1.1),[3rd item](https://arxiv.org/html/2606.06788#S4.I1.i3.p1.1),[§4\.3](https://arxiv.org/html/2606.06788#S4.SS3.p1.1),[§4\.4](https://arxiv.org/html/2606.06788#S4.SS4.p1.1)\.
- Y\. Guo, J\. C\. Chang, M\. Antoniak, E\. Bransom, T\. Cohen, L\. Wang, and T\. August \(2024b\)Personalized jargon identification for enhanced interdisciplinary communication\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\),Mexico City, Mexico,pp\. 4535–4550\.External Links:[Link](https://aclanthology.org/2024.naacl-long.255/),[Document](https://dx.doi.org/10.18653/v1/2024.naacl-long.255)Cited by:[§1](https://arxiv.org/html/2606.06788#S1.p3.1)\.
- Y\. Guo, W\. Qiu, Y\. Wang, and T\. Cohen \(2021\)Automated lay language summarization of biomedical scientific reviews\.Proceedings of the AAAI Conference on Artificial Intelligence35\(1\),pp\. 160–168\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/16089),[Document](https://dx.doi.org/10.1609/aaai.v35i1.16089)Cited by:[§1](https://arxiv.org/html/2606.06788#S1.p2.1)\.
- A\. Head, K\. Lo, D\. Kang, R\. Fok, S\. Skjonsberg, D\. S\. Weld, and M\. A\. Hearst \(2021\)Augmenting scientific papers with just\-in\-time, position\-sensitive definitions of terms and symbols\.InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems,CHI ’21,New York, NY, USA\.External Links:ISBN 9781450380966,[Link](https://doi.org/10.1145/3411764.3445648),[Document](https://dx.doi.org/10.1145/3411764.3445648)Cited by:[§2\.2](https://arxiv.org/html/2606.06788#S2.SS2.p1.1)\.
- E\. Hedlin, L\. Estling, J\. Wong, C\. Demmans Epp, and O\. Viberg \(2025\)Got it\! prompting readability using chatgpt to enhance academic texts for diverse learning needs\.InProceedings of the 15th International Learning Analytics and Knowledge Conference,LAK ’25,New York, NY, USA,pp\. 115–125\.External Links:ISBN 9798400707018,[Link](https://doi.org/10.1145/3706468.3706483),[Document](https://dx.doi.org/10.1145/3706468.3706483)Cited by:[§2\.2](https://arxiv.org/html/2606.06788#S2.SS2.p1.1)\.
- J\. M\. Imperial and H\. Tayyar Madabushi \(2023\)Flesch or fumble? evaluating readability standard alignment of instruction\-tuned language models\.InProceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics \(GEM\),S\. Gehrmann, A\. Wang, J\. Sedoc, E\. Clark, K\. Dhole, K\. R\. Chandu, E\. Santus, and H\. Sedghamiz \(Eds\.\),Singapore,pp\. 205–223\.External Links:[Link](https://aclanthology.org/2023.gem-1.18/)Cited by:[§4\.3](https://arxiv.org/html/2606.06788#S4.SS3.p2.1)\.
- B\. Joshi, K\. He, S\. Ramnath, S\. Sabouri, K\. Zhou, S\. Chattopadhyay, S\. Swayamdipta, and X\. Ren \(2025\)ELI\-why: evaluating the pedagogical utility of language model explanations\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 25466–25499\.External Links:[Link](https://aclanthology.org/2025.findings-acl.1306/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.1306),ISBN 979\-8\-89176\-256\-5Cited by:[§1](https://arxiv.org/html/2606.06788#S1.p2.1),[§1](https://arxiv.org/html/2606.06788#S1.p3.1),[§1](https://arxiv.org/html/2606.06788#S1.p4.4),[§2\.1](https://arxiv.org/html/2606.06788#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.06788#S2.SS2.p1.1),[§3\.1](https://arxiv.org/html/2606.06788#S3.SS1.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2606.06788#S3.SS2.SSS0.Px2.p1.1),[1st item](https://arxiv.org/html/2606.06788#S4.I1.i1.p1.1),[3rd item](https://arxiv.org/html/2606.06788#S4.I1.i3.p1.1),[§4\.3](https://arxiv.org/html/2606.06788#S4.SS3.p1.1),[§4\.3](https://arxiv.org/html/2606.06788#S4.SS3.p2.1),[§4\.4](https://arxiv.org/html/2606.06788#S4.SS4.p1.1),[§6](https://arxiv.org/html/2606.06788#S6.p1.1),[footnote 1](https://arxiv.org/html/2606.06788#footnote1)\.
- N\. Joshi and D\. Vogel \(2026\)Designing and evaluating ai margin notes in document reader software\.InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems,CHI ’26,New York, NY, USA\.External Links:ISBN 9798400722783,[Link](https://doi-org.proxy2.library.illinois.edu/10.1145/3772318.3790786),[Document](https://dx.doi.org/10.1145/3772318.3790786)Cited by:[§1](https://arxiv.org/html/2606.06788#S1.p1.1)\.
- S\. S\. Y\. Kim, J\. W\. Vaughan, Q\. V\. Liao, T\. Lombrozo, and O\. Russakovsky \(2025a\)Fostering appropriate reliance on large language models: the role of explanations, sources, and inconsistencies\.InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems,CHI ’25,New York, NY, USA\.External Links:ISBN 9798400713941,[Link](https://doi.org/10.1145/3706598.3714020),[Document](https://dx.doi.org/10.1145/3706598.3714020)Cited by:[§A\.6](https://arxiv.org/html/2606.06788#A1.SS6.p1.2)\.
- T\. Kim, D\. Agarwal, J\. Ackerman, and M\. Saha \(2025b\)Steering ai\-driven personalization of scientific text for general audiences\.Proc\. ACM Hum\.\-Comput\. Interact\.9\(7\)\.External Links:[Link](https://doi.org/10.1145/3757660),[Document](https://dx.doi.org/10.1145/3757660)Cited by:[§2\.2](https://arxiv.org/html/2606.06788#S2.SS2.p1.1)\.
- H\. R\. Kirk, B\. Vidgen, P\. Röttger, and S\. A\. Hale \(2024\)The benefits, risks and bounds of personalizing the alignment of large language models to individuals\.Nature Machine Intelligence6\(4\),pp\. 383–392\.Cited by:[§2\.1](https://arxiv.org/html/2606.06788#S2.SS1.p1.1)\.
- P\. Laban, H\. Hayashi, Y\. Zhou, and J\. Neville \(2026\)LLMs get lost in multi\-turn conversation\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=VKGTGGcwl6)Cited by:[§6](https://arxiv.org/html/2606.06788#S6.p1.1)\.
- M\. Lee, K\. I\. Gero, J\. J\. Y\. Chung, S\. B\. Shum, V\. Raheja, H\. Shen, S\. Venugopalan, T\. Wambsganss, D\. Zhou, E\. A\. Alghamdi, T\. August, A\. Bhat, M\. Z\. Choksi, S\. Dutta, J\. L\.C\. Guo, M\. N\. Hoque, Y\. Kim, S\. Knight, S\. P\. Neshaei, A\. Shibani, D\. Shrivastava, L\. Shroff, A\. Sergeyuk, J\. Stark, S\. Sterman, S\. Wang, A\. Bosselut, D\. Buschek, J\. C\. Chang, S\. Chen, M\. Kreminski, J\. Park, R\. Pea, E\. H\. R\. Rho, Z\. Shen, and P\. Siangliulue \(2024\)A design space for intelligent and interactive writing assistants\.InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems,CHI ’24,New York, NY, USA\.External Links:ISBN 9798400703300,[Link](https://doi-org.proxy2.library.illinois.edu/10.1145/3613904.3642697),[Document](https://dx.doi.org/10.1145/3613904.3642697)Cited by:[§1](https://arxiv.org/html/2606.06788#S1.p1.1)\.
- Z\. Liao, M\. Antoniak, I\. Cheong, E\. Y\. Cheng, A\. Lee, K\. Lo, J\. C\. Chang, and A\. X\. Zhang \(2025\)LLMs as research tools: a large scale survey of researchers’ usage and perceptions\.InProceedings of the 2nd Conference on Language Modeling \(COLM\),External Links:[Link](https://arxiv.org/abs/2411.05025)Cited by:[§A\.6](https://arxiv.org/html/2606.06788#A1.SS6.p2.2)\.
- K\. Lo, J\. C\. Chang, A\. Head, J\. Bragg, A\. X\. Zhang, C\. Trier, C\. Anastasiades, T\. August, R\. Authur, D\. Bragg, E\. Bransom, I\. Cachola, S\. Candra, Y\. Chandrasekhar, Y\. Chen, E\. \(\. Cheng, Y\. Chou, D\. Downey, R\. Evans, R\. Fok, F\.Q\. Hu, R\. Huff, D\. Kang, T\. S\. Kim, R\. M\. Kinney, A\. Kittur, H\. B\. Kang, E\. Klevak, B\. Kuehl, M\. Langan, M\. Latzke, J\. Lochner, K\. MacMillan, E\. Marsh, T\. Murray, A\. Naik, N\. Nguyen, S\. Palani, S\. Park, C\. Paulic, N\. Rachatasumrit, S\. Rao, P\. L\. Sayre, Z\. Shen, P\. Siangliulue, L\. Soldaini, H\. Tran, M\. van Zuylen, L\. L\. Wang, C\. Wilhelm, C\. M\. Wu, J\. Yang, A\. Zamarron, M\. A\. Hearst, and D\. S\. Weld \(2023\)The semantic reader project: augmenting scholarly documents through ai\-powered interactive reading interfaces\.InCommunications of the ACM \(CACM\),Cited by:[§1](https://arxiv.org/html/2606.06788#S1.p2.1)\.
- N\. McDonald, S\. Schoenebeck, and A\. Forte \(2019\)Reliability and inter\-rater reliability in qualitative research: norms and guidelines for cscw and hci practice\.Proc\. ACM Hum\.\-Comput\. Interact\.3\(CSCW\)\.External Links:[Link](https://doi.org/10.1145/3359174),[Document](https://dx.doi.org/10.1145/3359174)Cited by:[§3\.1](https://arxiv.org/html/2606.06788#S3.SS1.SSS0.Px4.p1.1)\.
- S\. Mehta, L\. Ritchie, S\. Garre, I\. Niebres, N\. Heiner, and E\. Chen \(2026\)EnterpriseBench corecraft: training generalizable agents on high\-fidelity rl environments\.External Links:2602\.16179,[Link](https://arxiv.org/abs/2602.16179)Cited by:[§6](https://arxiv.org/html/2606.06788#S6.p1.1)\.
- B\. Min, A\. Chen, Y\. Cao, and H\. Xia \(2025\)Malleable overview\-detail interfaces\.InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems,CHI ’25,New York, NY, USA\.External Links:ISBN 9798400713941,[Link](https://doi.org/10.1145/3706598.3714164),[Document](https://dx.doi.org/10.1145/3706598.3714164)Cited by:[§1](https://arxiv.org/html/2606.06788#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.06788#S2.SS2.p1.1),[§6](https://arxiv.org/html/2606.06788#S6.p1.1)\.
- S\. Min, K\. Krishna, X\. Lyu, M\. Lewis, W\. Yih, P\. Koh, M\. Iyyer, L\. Zettlemoyer, and H\. Hajishirzi \(2023\)FActScore: fine\-grained atomic evaluation of factual precision in long form text generation\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 12076–12100\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.741/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.741)Cited by:[2nd item](https://arxiv.org/html/2606.06788#S4.I1.i2.p1.1)\.
- A\. Mudd, T\. Conroy, S\. L\. Voldbjerg, A\. Goldschmied, R\. Feo, and L\. Schuwirth \(2025\)Developing and evaluating the use of chatgpt as a screening tool for nurses conducting structured literature reviews: proof of concept study results\.Journal of Clinical Nursing,pp\. 1–13\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1111/jocn.17818),[Link](https://onlinelibrary.wiley.com/doi/abs/10.1111/jocn.17818),https://onlinelibrary\.wiley\.com/doi/pdf/10\.1111/jocn\.17818Cited by:[§2\.1](https://arxiv.org/html/2606.06788#S2.SS1.p1.1)\.
- S\. K\. Murthy, K\. Lo, D\. King, C\. Bhagavatula, B\. Kuehl, S\. Johnson, J\. Borchardt, D\. S\. Weld, T\. Hope, and D\. Downey \(2022\)ACCoRD: a multi\-document approach to generating diverse descriptions of scientific concepts\.InEMNLP: System Demonstrations,pp\. 200–213\.Cited by:[§1](https://arxiv.org/html/2606.06788#S1.p3.1)\.
- OpenAI \(2025\)ChatGPT with Deep Research\.External Links:[Link](https://chat.openai.com/)Cited by:[§1](https://arxiv.org/html/2606.06788#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.06788#S2.SS1.p1.1)\.
- A\. Panickssery, S\. R\. Bowman, and S\. Feng \(2024\)LLM evaluators recognize and favor their own generations\.InAdvances in Neural Information Processing Systems,A\. Globerson, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. Tomczak, and C\. Zhang \(Eds\.\),Vol\.37,pp\. 68772–68802\.External Links:[Document](https://dx.doi.org/10.52202/079017-2197),[Link](https://proceedings.neurips.cc/paper_files/paper/2024/file/7f1f0218e45f5414c79c0679633e47bc-Paper-Conference.pdf)Cited by:[§4\.1](https://arxiv.org/html/2606.06788#S4.SS1.p1.2)\.
- S\. Ren, B\. Tomlinson, R\. W\. Black, and A\. W\. Torrance \(2024\)Reconciling the contrasting narratives on the environmental impact of large language models\.Scientific Reports14\(1\),pp\. 26310\(en\)\.External Links:ISSN 2045\-2322,[Document](https://dx.doi.org/10.1038/s41598-024-76682-6)Cited by:[Ethical Considerations](https://arxiv.org/html/2606.06788#Sx2.p1.1)\.
- P\. Rust, J\. Frings, S\. Meister, and L\. Fehring \(2025a\)Evaluation of a large language model to simplify discharge summaries and provide cardiological lifestyle recommendations\.Communications Medicine5\(1\),pp\. 208\.External Links:ISSN 2730\-664X,[Link](https://doi.org/10.1038/s43856-025-00927-2),[Document](https://dx.doi.org/10.1038/s43856-025-00927-2)Cited by:[§2\.1](https://arxiv.org/html/2606.06788#S2.SS1.p1.1)\.
- P\. Rust, J\. Frings, S\. Meister, and L\. Fehring \(2025b\)Evaluation of a large language model to simplify discharge summaries and provide cardiological lifestyle recommendations\.Communications Medicine5\(1\),pp\. 208\.External Links:ISSN 2730\-664X,[Document](https://dx.doi.org/10.1038/s43856-025-00927-2),[Link](https://doi.org/10.1038/s43856-025-00927-2)Cited by:[§2\.1](https://arxiv.org/html/2606.06788#S2.SS1.p1.1)\.
- J\. Saldaña \(2021\)The coding manual for qualitative researchers\.Fourth Edition edition,Sage Publications\.External Links:ISBN 978\-1\-5297\-3174\-3Cited by:[§3\.1](https://arxiv.org/html/2606.06788#S3.SS1.SSS0.Px4.p1.1)\.
- \[45\]Science Journal for KidsScience journal for kids and teens\.External Links:[Link](https://www.sciencejournalforkids.org/)Cited by:[§3\.1](https://arxiv.org/html/2606.06788#S3.SS1.SSS0.Px2.p1.1)\.
- N\. Sharma, Q\. V\. Liao, and Z\. Xiao \(2024\)Generative echo chamber? effect of llm\-powered search systems on diverse information seeking\.InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems,CHI ’24,New York, NY, USA\.External Links:ISBN 9798400703300,[Link](https://doi.org/10.1145/3613904.3642459),[Document](https://dx.doi.org/10.1145/3613904.3642459)Cited by:[Ethical Considerations](https://arxiv.org/html/2606.06788#Sx2.p1.1)\.
- A\. Singh, J\. C\. Chang, D\. Haddad, A\. Naik, J\. D\. Hwang, R\. Kinney, D\. S\. Weld, D\. Downey, and S\. Feldman \(2025\)Ai2 scholar QA: organized literature synthesis with attribution\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 3: System Demonstrations\),P\. Mishra, S\. Muresan, and T\. Yu \(Eds\.\),Vienna, Austria,pp\. 513–523\.External Links:[Link](https://aclanthology.org/2025.acl-demo.49/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-demo.49),ISBN 979\-8\-89176\-253\-4Cited by:[§2\.1](https://arxiv.org/html/2606.06788#S2.SS1.p1.1),[§3\.1](https://arxiv.org/html/2606.06788#S3.SS1.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.06788#S4.SS1.p1.2),[§5\.4](https://arxiv.org/html/2606.06788#S5.SS4.p1.4)\.
- S\. S\. Sundar, Q\. Xu, and S\. Bellur \(2010\)Designing interactivity in media interfaces: a communications perspective\.InProceedings of the SIGCHI Conference on Human Factors in Computing Systems,CHI ’10,New York, NY, USA,pp\. 2247–2256\.External Links:ISBN 9781605589299,[Link](https://doi.org/10.1145/1753326.1753666),[Document](https://dx.doi.org/10.1145/1753326.1753666)Cited by:[§2\.2](https://arxiv.org/html/2606.06788#S2.SS2.p1.1)\.
- H\. Tan, F\. Sun, W\. Yang, Y\. Wang, Q\. Cao, and X\. Cheng \(2024\)Blinded by generated contexts: how language models merge generated and retrieved contexts when knowledge conflicts?\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 6207–6227\.External Links:[Link](https://aclanthology.org/2024.acl-long.337/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.337)Cited by:[§4\.1](https://arxiv.org/html/2606.06788#S4.SS1.p1.2)\.
- T\. Tanprasert and D\. Kauchak \(2021\)Flesch\-kincaid is not a text simplification evaluation metric\.InProceedings of the First Workshop on Natural Language Generation, Evaluation, and Metrics \(GEM\),A\. Bosselut, E\. Durmus, V\. P\. Gangal, S\. Gehrmann, Y\. Jernite, L\. Perez\-Beltrachini, S\. Shaikh, and W\. Xu \(Eds\.\),Online,pp\. 1–14\.External Links:[Link](https://aclanthology.org/2021.gem-1.1/),[Document](https://dx.doi.org/10.18653/v1/2021.gem-1.1)Cited by:[§4\.3](https://arxiv.org/html/2606.06788#S4.SS3.p2.1)\.
- J\. Trienes, S\. Joseph, J\. Schlötterer, C\. Seifert, K\. Lo, W\. Xu, B\. Wallace, and J\. J\. Li \(2024\)InfoLossQA: characterizing and recovering information loss in text simplification\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 4263–4294\.External Links:[Link](https://aclanthology.org/2024.acl-long.234/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.234)Cited by:[2nd item](https://arxiv.org/html/2606.06788#S4.I1.i2.p1.1)\.
- P\. N\. Venkit, T\. Chakravorti, V\. Gupta, H\. Biggs, M\. Srinath, K\. Goswami, S\. Rajtmajer, and S\. Wilson \(2024\)An audit on the perspectives and challenges of hallucinations in NLP\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 6528–6548\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.375/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.375)Cited by:[Ethical Considerations](https://arxiv.org/html/2606.06788#Sx2.p1.1)\.
- S\. Whitfield and M\. A\. Hofmann \(2023\)Elicit: ai literature review research assistant\.Public Services Quarterly19\(3\),pp\. 201–207\.External Links:[Document](https://dx.doi.org/10.1080/15228959.2023.2224125),[Link](https://doi.org/10.1080/15228959.2023.2224125),https://doi\.org/10\.1080/15228959\.2023\.2224125Cited by:[§2\.1](https://arxiv.org/html/2606.06788#S2.SS1.p1.1)\.
- N\. Wu, M\. Gong, L\. Shou, S\. Liang, and D\. Jiang \(2023a\)Large language models are diverse role\-players for summarization evaluation\.InNatural Language Processing and Chinese Computing: 12th National CCF Conference, NLPCC 2023, Foshan, China, October 12–15, 2023, Proceedings, Part I,Berlin, Heidelberg,pp\. 695–707\.External Links:ISBN 978\-3\-031\-44692\-4,[Link](https://doi.org/10.1007/978-3-031-44693-1_54),[Document](https://dx.doi.org/10.1007/978-3-031-44693-1%5F54)Cited by:[§2\.1](https://arxiv.org/html/2606.06788#S2.SS1.p1.1)\.
- Y\. Wu, W\. Sheffield, K\. Mahowald, and J\. J\. Li \(2023b\)Elaborative simplification as implicit questions under discussion\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 5525–5537\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.336/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.336)Cited by:[3rd item](https://arxiv.org/html/2606.06788#S4.I1.i3.p1.1),[§5\.2](https://arxiv.org/html/2606.06788#S5.SS2.p1.2)\.
- W\. Xu, G\. Zhu, X\. Zhao, L\. Pan, L\. Li, and W\. Wang \(2024\)Pride and prejudice: LLM amplifies self\-bias in self\-refinement\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 15474–15492\.External Links:[Link](https://aclanthology.org/2024.acl-long.826/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.826)Cited by:[§4\.1](https://arxiv.org/html/2606.06788#S4.SS1.p1.2)\.
- J\.D\. Zamfirescu\-Pereira, R\. Y\. Wong, B\. Hartmann, and Q\. Yang \(2023\)Why johnny can’t prompt: how non\-ai experts try \(and fail\) to design llm prompts\.InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems,CHI ’23,New York, NY, USA\.External Links:ISBN 9781450394215,[Link](https://doi-org.proxy2.library.illinois.edu/10.1145/3544548.3581388),[Document](https://dx.doi.org/10.1145/3544548.3581388)Cited by:[§2\.2](https://arxiv.org/html/2606.06788#S2.SS2.p1.1)\.
- C\. Zhang, Y\. Liu, L\. Nie, J\. M\. Rzeszotarski, Y\. Huang, and T\. August \(2026\)From words to widgets for controllable llm generation\.External Links:2604\.10925,[Link](https://arxiv.org/abs/2604.10925)Cited by:[§1](https://arxiv.org/html/2606.06788#S1.p3.1),[§6](https://arxiv.org/html/2606.06788#S6.p1.1)\.
- T\. Zhang, V\. Kishore, F\. Wu, K\. Q\. Weinberger, and Y\. Artzi \(2020\)BERTScore: evaluating text generation with bert\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=SkeHuCVFDr)Cited by:[Figure 1](https://arxiv.org/html/2606.06788#S3.F1)\.
- Y\. Zhang, M\. Khalifa, L\. Logeswaran, M\. Lee, H\. Lee, and L\. Wang \(2023\)Merging generated and retrieved knowledge for open\-domain QA\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 4710–4728\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.286/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.286)Cited by:[§4\.1](https://arxiv.org/html/2606.06788#S4.SS1.p1.2)\.
- Y\. Zhou and B\. Di Eugenio \(2025\)Veracity bias and beyond: uncovering LLMs’ hidden beliefs in problem\-solving reasoning\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 21298–21310\.External Links:[Link](https://aclanthology.org/2025.acl-long.1034/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.1034),ISBN 979\-8\-89176\-251\-0Cited by:[Ethical Considerations](https://arxiv.org/html/2606.06788#Sx2.p1.1)\.

## Appendix AAppendix

### A\.1Prompt for Interactive Complexity

We use55levels because it allows for some nuance between the versions of text, and a popular video series called “5 Levels” by WIRED333https://www\.wired\.com/video/series/5\-levels/focuses on communicating specialized concepts to55different audiences which aligns well with the motivation for this work\. To populate the slider with55responses for the user study and run the model evaluation, we experimented with different characteristics of prompts, withhighlightsindicating what worked better:specifying audiencevs\. generic levels,singlevs\. multi\-prompt, and defining endpoints of complexity vs\.defining all 5 complexity levels\. Our final prompt for the model evaluation is shown in Fig\.[5](https://arxiv.org/html/2606.06788#A1.F5); we allowed the model to include a “References” section in its responses during the user study\. We also enforced a JSON schema\.

We chose the audience labels to range from College student to Senior Researcher because the wording of the questions in ScholarQABench implied that the inquirer had at least a college education\. However, as mentioned in Sec\.[5\.5](https://arxiv.org/html/2606.06788#S5.SS5), we tested using the audiences from the WIRED 5 levels series \(Child, Teen, College student, Grad student, Expert\) to check the effect of the audience labels, using the same prompt as a template\.

College student to Senior researcher PromptYou are given a user query and a report responding to that query as input\.
Using information only from the report and query, rewrite the chatbot response into 5 versions where each version responds with a level of complexity appropriate for a College student, Junior Ph\.D\. student, Senior Ph\.D\. student, Postdoctoral researcher, or Senior researcher respectively\. That is, Version 1 should be written with a level of complexity that a college student would understand, Version 2 should be written with a level of complexity that a junior Ph\.D\. student would understand, and so on\. Assume that the user understands the words in their query\.
Preserve mentions of papers and citations by preserving abbreviated, in\-text citations\. To be clear, do not write out a References section, just use in\-text citations like this: \(Vaswani, 2017\)\.
Do not add any additional text like greetings or ornamental words\.Figure 5:Prompt for Interactive Complexity
### A\.2Additional Data for Increased Sample Size

Tab\.[5](https://arxiv.org/html/2606.06788#A1.T5)shows how often models moved complexity in the correct direction per transition when evaluating the larger sample of459459ScholarQABench queries \(Sec\.[5\.4](https://arxiv.org/html/2606.06788#S5.SS4)\)\.

ModelJargonInfo\.LengthGPT\-5\.193\.0384\.75100\.01 to 2GPT\-5 mini86\.0659\.4892\.37Claude Sonnet 4\.5100\.076\.03100\.0\+ Thinking100\.081\.0599\.78DeepSeek\-V3\.199\.5669\.5099\.56GPT\-5\.144\.0179\.30100\.02 to 3GPT\-5 mini51\.2066\.0199\.35Claude Sonnet 4\.574\.2976\.69100\.0\+ Thinking79\.7476\.4799\.78DeepSeek\-V3\.162\.9661\.22100\.0GPT\-5\.135\.5167\.76100\.03 to 4GPT\-5 mini50\.7648\.5877\.34Claude Sonnet 4\.556\.4380\.17100\.0\+ Thinking63\.8376\.0398\.04DeepSeek\-V3\.148\.3759\.0499\.35GPT\-5\.125\.0570\.5999\.564 to 5GPT\-5 mini47\.9369\.2898\.04Claude Sonnet 4\.545\.3279\.96100\.0\+ Thinking67\.3266\.4585\.19DeepSeek\-V3\.145\.9764\.9299\.56Table 5:Model performance per transition for College→\\rightarrowSr\. Res\. \(n=459\)Each model’s performance is shown as the percent of inputs where the measure goes in the correct direction at each transition\. A higher percentage means that the model performed better at that transition, by more often increasing complexity according to these measures\.
### A\.3Model Configuration Details

Tab\.[6](https://arxiv.org/html/2606.06788#A1.T6)shows the model configurations for the model evaluation\. We prompted all but 1 input only once; the 1 input that we had to regenerate was initially refused by Sonnet 4\.5, possibly due to the topic\. We ran the evaluation during November 2025 through January 2026\.

ModelHyperparametersgpt\-5\.1\-2025\-11\-13temperature = 0gpt\-5\-mini\-2025\-08\-07none specifiedClaude Sonnet 4\.5temperature = 0; max\_tokens = 9000Claude Sonnet 4\.5 \+ Thinkingmax\_tokens = 11192 \(includes 3000 for thinking\)deepseek\.v3\-v1:0temperature = 0; maxTokens = 9000Table 6:Model configurations
### A\.4Flesch\-Kincaid Reading Ease Score Data

Fig\.[6](https://arxiv.org/html/2606.06788#A1.F6)shows the changes in Flesch\-Kincaid scores across levels labeled with our initial audiences and the WIRED audiences\. Flesch\-Kincaid scores follow the opposite trend to the measures in the main paper; adecreasein the score indicates higher complexity\. Similar toJargonandInformation\(Fig\.[2](https://arxiv.org/html/2606.06788#S5.F2)\), we observe inconsistent increases in complexity \(i\.e\., decreasing Flesch\-Kincaid scores\)\.

![Refer to caption](https://arxiv.org/html/2606.06788v1/x5.png)Figure 6:Flesch\-Kincaid Reading Ease score dataThese plots show the distributions of changes in the Flesch\-Kincaid scores between consecutive levels for the two sets of audience labels that we prompted with\. Since a higher Flesch\-Kincaid score means that the text is more readable,decreasingthe scores as complexity increases is desirable\.
### A\.5Complexity of Responses in User Study

In this section, we report the changes in complexity for the aggregated4545responses across the1616participants from the interactive condition of the user study\. We use GPT\-5 mini to supply the responses as a low\-latency option\. The distributions generally match those of GPT\-5 mini in the model evaluation \(Fig\.[2](https://arxiv.org/html/2606.06788#S5.F2)\), confirming that the user study was a fair instantiation of the model evaluation\.

![Refer to caption](https://arxiv.org/html/2606.06788v1/x6.png)Figure 7:Model performance during user study based on complexity measuresBetween consecutive levels of complexity, the interactive condition had comparable changes inJargon,Information, andLengthto the model evaluation \(Fig\.[2](https://arxiv.org/html/2606.06788#S5.F2)\), varying between increasing and decreasing\. We use GPT\-5 mini to supply the responses as a low\-latency option\. Each point in the scatter overlay represents one of the4545participant queries\. Ideally, all measures should increase between levels \(i\.e\., all points should be strictly above the zero line\)\. Extreme outliers removed for visualization\.
### A\.6Participant Details

Tab\.[7](https://arxiv.org/html/2606.06788#A1.T7)lists participants’ background and LLM usage\. Most participants held an academic affiliation and had a STEM background\. To ascertain general LLM usage, we asked “How often do you use LLMs and LLM\-infused applications?”, and the choices were:

- •Never
- •Rarely, about 1–2 times a month
- •Sometimes, about 3–4 times a month
- •Often, about twice a week
- •Always, about once or more a day

Some wording came fromKimet al\.\([2025a](https://arxiv.org/html/2606.06788#bib.bib87)\)\.

For LLM usage in research, we asked “Do you use LLMs and LLM\-infused applications for any parts of the research process? Select all that apply\.”, and the choices were:

- •Yes, for information seeking \(e\.g\., discovering papers, generating summaries, discovering topics\)
- •Yes, for editing writing \(e\.g\., fixing grammar or rephrasing, looking up synonyms, formatting papers\)
- •Yes, for direct writing \(e\.g\., rewriting to another style, shortening, summarizing\)
- •Yes, for data cleaning & analysis \(e\.g\., cleaning and reformatting data, statistical reporting, qualitative analysis\)
- •Yes, for ideation & framing \(e\.g\., brainstorming research questions, coming up with ways to frame a paper, getting inspiration for methods\)
- •Yes, for data generation \(e\.g\., generating synthetic data, producing examples and labels\)
- •Yes, for other purposes \(Select this box and Other and specify below in Other\)
- •No
- •Other:

These answer choices mostly came fromLiaoet al\.\([2025](https://arxiv.org/html/2606.06788#bib.bib1)\)\.

BackgroundField of StudyResearch
ExperienceLLM
UsageLLM Research UsageEngineerComputer Science7 yearsAlwaysInformation seeking, Ideation & framing, Other: Coding, IndexingPostdoctoral researcherComparative Literature8 yearsRarelyInformation seeking, Ideation & framing1st year Ph\.D\. studentComputer Science1 yearRarelyNoneMaster’s studentComputer Science2 yearsAlwaysInformation seeking, Editing writing, Direct writing, Data cleaning & analysis, Ideation & framing2nd year Ph\.D\. studentComputer Science4 yearsAlwaysInformation seeking, Editing writing, Direct writing, Data cleaning & analysis, Ideation & framing, Data generation, Other: Getting feedback on paper draftsUndergraduate studentStatistics, Actuarial ScienceNoneOftenInformation seeking, Editing writing, Direct writing, Data cleaning & analysis, Ideation & framing5th year Ph\.D\. studentApplied Mathematics4 yearsAlwaysInformation seeking, Editing writing, Direct writingMaster’s studentComputer Science2 yearsAlwaysInformation seeking, Editing writingMaster’s studentComputer Science1 yearAlwaysInformation seeking, Editing writing, Ideation & framing4th year Ph\.D\. studentComputer Science5 yearsOftenInformation seeking, Other: Coding4th year Ph\.D\. studentComputer Science4 yearsAlwaysInformation seeking, Editing writing, Direct writing, Data cleaning & analysis, Ideation & framing2nd year Ph\.D\. studentComputational Biology4 yearsOftenInformation seeking, Data cleaning & analysis2nd year Ph\.D\. studentComputer Science2 yearsAlwaysEditing writing, Other: CodingCollege GraduateStatistics, Computer ScienceNoneAlwaysInformation seeking, Editing writing, Data cleaning & analysisMaster’s studentElectrical & Computer Engineering2 yearsAlwaysInformation seeking, Editing writing, Ideation & framing5th year Ph\.D\. studentComputer Science6 yearsAlwaysEditing writing, Ideation & framingTable 7:Participants’ background and LLM usage
### A\.7User Study Interface

We extended a chat interface from the Streamlit framework444https://docs\.streamlit\.io/develop/tutorials/chat\-and\-llm\-apps/build\-conversational\-appsto build our conventional and interactive interfaces\. More details are provided in Sec\.[3\.1](https://arxiv.org/html/2606.06788#S3.SS1)\.

### A\.8User Study Interview Guide

We include the interview guide we used to structure questions during the user study\. The questions aim to probe at participants’ perceptions of and interactions with both interfaces\.

The following questions were asked for both conditions:

- •How did you feel about the level of complexity in the chat responses? Were there times that you felt that you had too much or too little information? Was the amount of complexity appropriate or overwhelming?
- •How did you use the system generally? What worked well and what didn’t?
- •What about the text in the responses would you change if anything?
- •How did you feel about not reading the papers? \(if the participant expressed something about this\)

The following questions were asked only for the interactive condition:

- •How did you feel about having a choice of 5 responses with varying complexity as opposed to one response with fixed complexity? \[after participants had experienced both conditions\]
- •When did you find the slider helpful or not helpful?
- •When did you ask follow\-up questions versus use the slider versus use the response as is?
- •Is there anything about the progression of the text between the 5 levels that you would want to control?
- •Did the variations in complexity match what you expected from the 5 levels? If not, what would you have wanted?
- •Was there anything hard about going through the different levels?
- •Did you read all 5 levels, why or why not?
- •Could you tell the difference between the 5 levels \[or the levels you did read\] and how?
- •Does 5 levels feel appropriate?
- •Do you want the slider for every response?

### A\.9Study Materials

Participants were guided through a 1\-hour Zoom session where they interacted with two interfaces, one conventional chat interface and one with interactive complexity as described in Sec\.[3\.1](https://arxiv.org/html/2606.06788#S3.SS1)\. For the conventional chat interface, participants were told “This is a simple chatbot where you ask a question and it provides a response”\. Since the interactive complexity version had slider features \(Fig\.[1](https://arxiv.org/html/2606.06788#S3.F1)\), participants watched a 1\-minute video tutorial\. After learning how to use the interfaces, participants were then given task instructions as shown in Fig\.[9](https://arxiv.org/html/2606.06788#A1.F9)\. The task questions are displayed in Fig\.[8](https://arxiv.org/html/2606.06788#A1.F8); participants were also provided with a note\-taking box in the same document\. After the study, participants were provided with the disclaimer shown in Fig\.[10](https://arxiv.org/html/2606.06788#A1.F10)\.

![Refer to caption](https://arxiv.org/html/2606.06788v1/x7.png)Figure 8:Task QuestionsTask Instructions Given to ParticipantsBefore this session, you provided a topic: \[topic provided by participant\]\.This document contains a few tasks about this topic for you to complete using the chatbot\. It also contains an area for you to take notes if you would like to\.
You will have 15 minutes to complete the tasks using only the chatbot; I will provide you with time warnings\.
Please focus on using the chatbot rather than reading through links or papers\. You can open links to verify content from the system, but do not read the entire paper\.
While doing the tasks, you will be ‘‘thinking aloud’’\. Basically, we want you to tell us everything that goes through your mind from the start to the end as you complete the tasks\. When you are thinking aloud:
\- There are no right or wrong answers\. You are not being tested\.
\- Be honest\. If you feel confused, frustrated, or surprised, please say so\.
\- Keep talking\. Demonstrate your entire process, what you’re doing, why you’re doing it, what you’re thinking, and what you’re expecting
\- Focus on the task\. Complete the tasks to the best of your ability, while simultaneously describing your process\.
For example, if you click something, you could say ‘‘I am clicking this because’’ followed by your reason\. If you’re confused, you could say ‘‘I’m not sure what to do here because’’ again followed by a reason\.Figure 9:Task Instructions Given to ParticipantsPost\-Study DisclaimerTo create a realistic setting, we showed AI answers that are directly from responses from an actual AI system\. As known, AI systems can make up information\. Please note that the AI answers you saw in this study may have been inaccurate, incomplete, or inconsistent, even when they sounded convincing\.Figure 10:Post\-Study Disclaimer
Explain Like I'm 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses

Similar Articles

Comprehensive Evaluation of Large Language Model Responses: A Multi-Factor Scoring System

How Human-Like Are Large Language Models? A Register-Aware Linguistic Evaluation Framework

Large Language Models Are Overconfident in Their Own Responses

Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models

Examining Human-Like Behaviors in LLMs: A Multi-Dimensional Analysis of Model Behaviors, User Factors, and System Prompts

Submit Feedback

Similar Articles

Comprehensive Evaluation of Large Language Model Responses: A Multi-Factor Scoring System
How Human-Like Are Large Language Models? A Register-Aware Linguistic Evaluation Framework
Large Language Models Are Overconfident in Their Own Responses
Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models
Examining Human-Like Behaviors in LLMs: A Multi-Dimensional Analysis of Model Behaviors, User Factors, and System Prompts