Incentives Of EdTech: A Systematic Review Of EduNLP Research

arXiv cs.CL Papers

Summary

This systematic review of 204 EduNLP papers reveals that teachers are underrepresented as beneficiaries despite being most affected, real-world deployment is rare, and ethical engagement tends toward acknowledgement rather than action, highlighting a tension between private-sector incentives and foundational educational needs.

arXiv:2606.13691v1 Announce Type: cross Abstract: While the Natural Language Processing community has dedicated significant resources in developing educational technologies (EdTech) that support this shift, it remains unclear whose interests are being best served among the stakeholders of education. In this paper, we present a systematic literature review of 204 papers published in venues of the Association for Computational Linguistics' Special Interest Group on Building Educational Applications in 2024 and 2025, and validate these against EdTech papers from the wider ACL Anthology. By examining stakeholder inclusion and the prioritisation of research tasks, our findings reveal a critical tension: a push and pull between private-sector incentives and the foundational needs of educational infrastructure. Our analysis reveals that teachers are systematically under-represented as beneficiaries of research (33.3%) despite being the most affected, that real-world deployment remains rare (9.8%), and that ethical engagement tends toward acknowledgement rather than action. Drawing on exemplary papers in our corpus, we offer concrete recommendations for more responsible EduNLP research practices.
Original Article
View Cached Full Text

Cached at: 06/15/26, 08:59 AM

# Incentives Of EdTech: A Systematic Review Of EduNLP Research
Source: [https://arxiv.org/html/2606.13691](https://arxiv.org/html/2606.13691)
Gabrielle Gaudeau1,Aoife O’Driscoll1,Jasper Degraeuwe2, Andrew Caines1,Donya Rooein3,Zeerak Talat4

1ALTA Institute, Computer Laboratory, University of Cambridge \(UK\), 2Ghent University \(Belgium\),3Bocconi University \(Italy\),4University of Edinburgh \(UK\) Correspondence:[gjg34@cam\.ac\.uk](https://arxiv.org/html/2606.13691v1/mailto:[email protected]),[ao514@cam\.ac\.uk](https://arxiv.org/html/2606.13691v1/mailto:[email protected]),[jasper\.degraeuwe@ugent\.be](https://arxiv.org/html/2606.13691v1/mailto:[email protected]), [apc38@cam\.ac\.uk](https://arxiv.org/html/2606.13691v1/mailto:[email protected]),[donya\.rooein@unibocconi\.it](https://arxiv.org/html/2606.13691v1/mailto:[email protected]),[z@zeerak\.org](https://arxiv.org/html/2606.13691v1/mailto:[email protected])

###### Abstract

The global teacher shortage is pushing schools and institutions towards an ever\-greater reliance on artificial intelligence\. While the Natural Language Processing community has dedicated significant resources in developing educational technologies \(EdTech\) that support this shift, it remains unclear whose interests are being best served among the stakeholders of education\.

In this paper, we present a systematic literature review of 204 papers published in venues of the Association for Computational Linguistics’ Special Interest Group on Building Educational Applications in 2024 and 2025, and validate these against EdTech papers from the wider ACL Anthology\. By examining stakeholder inclusion and the prioritisation of research tasks, our findings reveal a critical tension: a push and pull between private\-sector incentives and the foundational needs of educational infrastructure\. Our analysis reveals that teachers are systematically under\-represented as beneficiaries of research \(33\.3%\) despite being the most affected, that real\-world deployment remains rare \(9\.8%\), and that ethical engagement tends toward acknowledgement rather than action\. Drawing on exemplary papers in our corpus, we offer concrete recommendations for more responsible EduNLP research practices\.

Incentives Of EdTech: A Systematic Review Of EduNLP Research

Gabrielle Gaudeau1, Aoife O’Driscoll1, Jasper Degraeuwe2,Andrew Caines1,Donya Rooein3,Zeerak Talat41ALTA Institute, Computer Laboratory, University of Cambridge \(UK\),2Ghent University \(Belgium\),3Bocconi University \(Italy\),4University of Edinburgh \(UK\)Correspondence:[gjg34@cam\.ac\.uk](https://arxiv.org/html/2606.13691v1/mailto:[email protected]),[ao514@cam\.ac\.uk](https://arxiv.org/html/2606.13691v1/mailto:[email protected]),[jasper\.degraeuwe@ugent\.be](https://arxiv.org/html/2606.13691v1/mailto:[email protected]),[apc38@cam\.ac\.uk](https://arxiv.org/html/2606.13691v1/mailto:[email protected]),[donya\.rooein@unibocconi\.it](https://arxiv.org/html/2606.13691v1/mailto:[email protected]),[z@zeerak\.org](https://arxiv.org/html/2606.13691v1/mailto:[email protected])

## 1Introduction

Education has long been a domain of inspiration for Artificial Intelligence \(AI\) and Natural Language Processing \(NLP\)\. From early feature\-based auto\-markers \(e\.g\.,e\-rater®; ,[2006](https://arxiv.org/html/2606.13691#bib.bib246)\) to large language model \(LLM\)\-powered intelligent tutoring systems \(ITS\) \(e\.g\., Khanmigo111[https://www\.khanmigo\.ai](https://www.khanmigo.ai/)by Khan Academy\), the goals have remained constant: for technology to extend the reach of good teaching and to support learners who might otherwise go without\. These are meaningful goals – socially urgent, technically challenging, and worthy of scientific investment – and their urgency has only grown in recent years with global teacher shortages\(UNESCO,[2026](https://arxiv.org/html/2606.13691#bib.bib245)\), widening equity gaps\(World Inequality Lab,[2026](https://arxiv.org/html/2606.13691#bib.bib243)\), and the rapid uptake of commercial AI products for education\(Gomes,[2026](https://arxiv.org/html/2606.13691#bib.bib244)\)\. Held together, they have made the question of the role of technology in supporting education more pressing than ever\.

There is a particular risk that comes with being deeply embedded in a fast\-moving research area: the closer we are to the technical problems in front of us, the easier it is to lose sight of the overarching goal\. As researchers, we are drawn towards the datasets we know, the metrics we trust, the tasks where progress is legible\. Specialisation is necessary, but it can quietly narrow the frame of reference until the question, “Does this system work?”, crowds out the most important question: “Does this actually serve the people we said we were building it for?” This paper is, in part, an attempt to step back from that narrowing and ask plainly: as a field, are we meeting our own aspirations?

To answer this question, we conduct a systematic literature review of EduNLP research\. We survey 204 papers published in 2024 and 2025 at ACL SIGEDU venues \(BEA222Workshop on Innovative Use of NLP for Building Educational Applicationsand NLP4CALL333Workshop on NLP for Computer\-Assisted Language Learningworkshops\) and the main \*ACL conferences\. To the best of our knowledge, this is the first systematic review of EduNLP research that focuses on publications in the ACL Anthology\. For each paper, we examine its tasks, motivations, stakeholder inclusion, incentives, and engagement with ethical risks to answer three research questions:

1. RQ 1Which tasks are prioritised in EduNLP research, what motivates them, and in which contexts are the resulting systems deployed?
2. RQ 2Who are the stakeholders of EduNLP research, how are they included, and whose interests does the research serve?
3. RQ 3What risks, concerns, and limitations are raised, and to what extent does the research mitigate them?

Our findings show that teachers are systematically under\-represented as beneficiaries in EduNLP research, real\-world deployment is rare, and ethical engagement tends toward acknowledgement rather than action\. We identify exemplary counter\-examples and derive from them a set of concrete recommendations for the field\.

## 2Related Work

Education has been a domain for innovation dating back millennia\. Digital technology is a modern feature of this long history: much of the early pioneering work on AI in the twentieth century was directed towards educational aims and applications in AIED\(Newellet al\.,[1958](https://arxiv.org/html/2606.13691#bib.bib300); Minsky,[1974](https://arxiv.org/html/2606.13691#bib.bib299); Papert,[1980](https://arxiv.org/html/2606.13691#bib.bib301); Doroudi,[2023](https://arxiv.org/html/2606.13691#bib.bib302)\)\. In recent years the growth of interest in LLMs has also seen increasing application to education\(Caineset al\.,[2023](https://arxiv.org/html/2606.13691#bib.bib303); Daviset al\.,[2024](https://arxiv.org/html/2606.13691#bib.bib309); Packet al\.,[2024](https://arxiv.org/html/2606.13691#bib.bib308)\), further evidenced by the growing popularity of the annual Workshop on Innovative Use of NLP for Building Educational Applications \(BEA\), the foundation of the ACL SIGEDU in 2017444[https://sig\-edu\.org/](https://sig-edu.org/), and investment by large technology firms into products such as Google’s LearnLM555[https://cloud\.google\.com/solutions/learnlm](https://cloud.google.com/solutions/learnlm)and OpenAI’s ChatGPT Edu666[https://openai\.com/chatgpt/education/](https://openai.com/chatgpt/education/)\.

EdTech covers a wide\-range of applications for educational purposes, often involving AI or NLP\. There have been several surveys on EdTech and its use in various domains\(Ahmadet al\.,[2024](https://arxiv.org/html/2606.13691#bib.bib289); Benedettoet al\.,[2023](https://arxiv.org/html/2606.13691#bib.bib305); Hidayat and Firmanti,[2024](https://arxiv.org/html/2606.13691#bib.bib304)\)spanning classroom support, virtual learning environments, websites, and tutoring chatbots\. In this paper, we focus on ethical matters, which have received growing attention in AI and NLP more broadly, including the identification of different bias types throughout the “machine learning life cycle”\(Suresh and Guttag,[2021](https://arxiv.org/html/2606.13691#bib.bib306)\)\.

Within EdTech, several surveys and position papers have addressed ethical issues\.Yanet al\.\([2025](https://arxiv.org/html/2606.13691#bib.bib1)\)presents a systematic review of 34 publications involving EdTech with AI in schools or higher education from 2020\-2024, reporting a “constellation of recurring ethical tensions” relating to algorithmic bias, data privacy, transparency, accountability, and academic integrity\. They observe that these are known issues with AI applications, and recommend co\-design with stakeholders, an emphasis on explainability, regulatory improvements, and AI literacy training for teachers\.Alfredoet al\.\([2024](https://arxiv.org/html/2606.13691#bib.bib296)\)arrive at similar conclusions from a review of 108 papers relating to human\-centred or participatory design and learning analytics\.

Fu and Weng \([2024](https://arxiv.org/html/2606.13691#bib.bib284)\)conduct a systematic review of empirical studies focused on EdTech and responsible AI, making similar conclusions toYanet al\.\([2025](https://arxiv.org/html/2606.13691#bib.bib1)\)based on 40 selected papers\. They present a vision for “responsible human\-centered AIED” which includes core principles of Fairness and Equity, Transparency and Intelligibility, Agency and Autonomy, Privacy and Security, and Beneficence and Non\-maleficence\.Holmeset al\.\([2022](https://arxiv.org/html/2606.13691#bib.bib286)\)surveyed EdTech researchers, reporting high interest in but low confidence about ethical issues, attributed to a lack of ethics training in AI\-related courses\. They propose a framework for ethics in AIED aimed at ensuring “ethical by design” research, and emphasise the importance of cross\-disciplinary engagement\. Taken together, these reviews converge on a shared diagnosis: ethical considerations are widely recognised in principle but inconsistently integrated in practice\.

This review extends these prior work by including research published throughout 2025, and by considering tasks, contexts, stakeholders, incentives, and risks across 204 EduNLP papers from \*ACL main conference and workshop proceedings\.

## 3Methodology

### Search Protocol\.

We collected all papers from the BEA and NLP4CALL workshops published in 2024 and 2025\. We also conducted a search of the ACL Anthology using the Anthology API777[https://acl\-anthology\.readthedocs\.io/py\-v0\.5\.3/api/](https://acl-anthology.readthedocs.io/py-v0.5.3/api/)for papers published in main \*ACL and associated conferences whose title or abstract contained at least one of 38 EduNLP\-relevant search terms \(e\.g\., “student modeling”; see Appendix[B](https://arxiv.org/html/2606.13691#A2)for a complete list of venues and search terms\)\.888The search was conducted on January 21, 2026\.This sampling approach affords an in\-depth view into contemporary trends at the expense of longitudinal analyses\. This search resulted in 191 papers from the two workshops, and 316 papers from \*ACL conferences\.

For BEA and NLP4CALL, we randomly sample 25% of contributions for each shared task, with a minimum sampling threshold of 5 papers for each task\. We further include all shared task overview papers, as these represent a qualitatively distinct type of contribution\. For the \*ACL main conference papers, we reviewed all abstracts for relevance to educational applications, excluding 214 papers as non\-relevant\. The remaining 102 papers were stratified by publication year, venue, and search term, yielding a sample of 44 papers\. This resulted in a final sample of 160 papers from BEA and NLP4CALL workshops, and 44 papers from \*ACL conferences, for a total of 204 papers \(see Table[8](https://arxiv.org/html/2606.13691#A5.T8)for paper details\)\. Figure[1](https://arxiv.org/html/2606.13691#S3.F1)shows the distribution of papers across venues and years\.

![Refer to caption](https://arxiv.org/html/2606.13691v1/images/papers_by_venue_v2.png)Figure 1:Number of papers per venue and year\. We reviewed a total of 204 papers \(160 BEA\+NLP4CALL papers and 44 ACL Anthology main conference papers\)\.
### Data extraction\.

Data extraction was conducted manually by three of the authors using a shared extraction schema \(see Appendix[C](https://arxiv.org/html/2606.13691#A3)\)\. The schema captures: the specific task addressed; datasets used and their availability; the explicit motivation for the research; stakeholders mentioned and included \(with associated quotes\); the level of stakeholder inclusion; the deployment context of any system; incentives \(both explicit and implicit\) that the research serves; ethical risks and concerns raised; measures taken to address those risks; and future directions pertaining to risk, ethics, or aspiration\.

Extraction proceeded in three phases\. In the first phase \(1\), a single paper was annotated collaboratively to develop and validate the schema\. In the second phase \(2\), annotators independently reviewed a shared batch of 25 papers,999The shared batch was a stratified sample from our corpus of 204 papers \(12\.3%\) based on venue and year of publication; it included 6 BEA 2024, 10 BEA 2025, 2 NLP4CALL 2024, 1 NLP4CALL 2025, 1 EACL 2024, 1 LREC\-COLING, 1 NAACL 2025, 1 ACL 2025, and 2 Findings 2025 papers\.meeting to discuss schema revisions and resolve ambiguities\. Note that phase \(2\) was conducted in an iterative manner: following phase \(1\), each time the schema was modified or extended, all annotators updated their previous phase \(2\) annotation to reflect the revised guidelines\. In the third and final phase \(3\), the remaining papers were reviewed independently by three authors\. Extracting data took an annotator on average 45 minutes per paper \(ranging between 30–60 minutes\); we estimate that the review took a combined total of about 190 hours to complete\.

![Refer to caption](https://arxiv.org/html/2606.13691v1/images/papers_by_task_v2.png)Figure 2:Number of papers per high\-level task\. See Table[8](https://arxiv.org/html/2606.13691#A5.T8)for the detailed mapping\.
### Agreement\.

Inter\-annotator agreement \(IAA\) was measured on phase \(2\)’s independently reviewed shared batch\. Table[1](https://arxiv.org/html/2606.13691#A4.T1)in Appendix[D](https://arxiv.org/html/2606.13691#A4)shows the agreement for the free\-text dimensions of our schema based on the Percentage Agreement\(PA; Roaché,[2017](https://arxiv.org/html/2606.13691#bib.bib50)\)measure \(see Tables[6](https://arxiv.org/html/2606.13691#A4.T6)and[7](https://arxiv.org/html/2606.13691#A4.T7)for illustrations of how free\-text agreement was computed\)\. For the four multi\-label dimensions, we report both Krippendorff’sα\\alpha\(Krippendorff,[2011](https://arxiv.org/html/2606.13691#bib.bib51)\)and PA in Tables[2](https://arxiv.org/html/2606.13691#A4.T2),[3](https://arxiv.org/html/2606.13691#A4.T3),[4](https://arxiv.org/html/2606.13691#A4.T4)and[5](https://arxiv.org/html/2606.13691#A4.T5)\.

For the free\-text fields, PA ranges between 0\.52 \(for implicit incentives\) and 1 \(for deployment\)\. For the multi\-label dimensions, PA was consistently high \(0\.84–94 overall\), whileα\\alphawas more variable\. Agreement on the presence of stakeholders was generally moderate to strong \(α\\alpha= 0\.49–0\.7 overall, with agreement on teachers being particularly high atα\\alpha= 0\.79–0\.84\)\. Agreement on stakeholder inclusion level and risk engagement level was lower \(α\\alpha= 0\.52–0\.61 overall\)\. Taking into account the qualitative and inherently interpretative nature of the annotation task \(especially for dimensions such as risks/concerns\), we consider these agreement values to be sufficiently high to justify the independent reviewing in phase \(3\)\.

## 4Tasks, Motivations, Deployment

### Tasks\.

Figure[2](https://arxiv.org/html/2606.13691#S3.F2)shows the distribution of high\-level tasks across our corpus of papers\. Automated assessment – i\.e\., automated essay scoring \(AES\) and automated short\-answer scoring \(ASAG\) – is by far the most common task \(56 papers\), followed by grammatical error correction \(GEC, 30 papers\) and text simplification and complexity prediction \(28\)\. Content generation \(22\), intelligent tutoring systems \(ITS, 22 papers\), dataset creation and knowledge extraction \(19\), and knowledge tracing and learner modelling \(8\) are also represented\. The “Other” task type includes a variety of research, most often relating to the novel capabilities of LLMs \(e\.g\., multimodal assessment, alignment with human eye\-tracking data, and discourse evaluation\) and detecting LLM\-generated texts\.

The dominance of language assessment and feedback tasks is striking: taken together, AES/ASAG and GEC account for almost half of the corpus\. This reflects a longstanding priority in EduNLP: indeed, automated assessment has been an active area of research for decades, benefitting from well\-established datasets \(e\.g\., ASAP;Hamneret al\.,[2012](https://arxiv.org/html/2606.13691#bib.bib53)\)\. However, this prevalence also raises questions about whose priorities are being served: automated assessment and feedback tools are of direct commercial value to large\-scale testing organisations and EdTech companies\.

### Shared tasks\.

The NLP4CALL 2025 shared task introduced multilingual GEC\(Mascioliniet al\.,[2025](https://arxiv.org/html/2606.13691#bib.bib20)\), a direction of particular importance given that GEC, while already the second most represented task in our corpus, has historically been dominated by English\-language systems\. Broadening GEC to multilingual settings introduces non\-trivial challenges around low\-resource languages, cross\-lingual transfer, and the availability of annotated learner corpora, and a shared task framing is well\-suited to mobilising community effort around these barriers\. On the other hand, the BEA 2024 shared tasks addressed automated prediction of item difficulty and response time\(Yanevaet al\.,[2024a](https://arxiv.org/html/2606.13691#bib.bib129)\), and multilingual lexical simplification\(Shardlowet al\.,[2024](https://arxiv.org/html/2606.13691#bib.bib124)\); the 2025 shared task addressed pedagogical ability assessment of AI\-powered tutors\(Kochmaret al\.,[2025](https://arxiv.org/html/2606.13691#bib.bib312)\)\. We note that all three of these problems receive less attention in the non\-shared\-task literature\.

This suggests that shared tasks are playing a valuable role in broadening the community’s agenda, including towards less commercially obvious but educationally important problems such as pedagogical quality assessment, and towards underserved languages in otherwise established tasks\. Beyond their immediate proceedings, shared tasks also exert a longer\-lasting influence through the datasets they produce; resources like the W&I\+LOCNESS dataset which was introduced for the BEA 2019 Shared Task on GEC\(Bryantet al\.,[2019](https://arxiv.org/html/2606.13691#bib.bib52)\)tend to attract sustained reuse by the community \(as illustrated by Figure[3](https://arxiv.org/html/2606.13691#S4.F3)\), and thus continue to shape which problems remain visible and tractable long after the shared task itself has concluded\.

![Refer to caption](https://arxiv.org/html/2606.13691v1/images/datasets_used_v2_wide.png)Figure 3:Dataset popularity \(i\.e\., the number of times a dataset was used, and not only mentioned\)\. We do not report private datasets given their absence of references\.
### Datasets\.

Papers in the corpus reported using 284 distinct datasets used a combined total of 460 times \(373 for public datasets, 33 for those available upon\-request and 54 for private datasets\)\. Figure[11](https://arxiv.org/html/2606.13691#A6.F11)shows that 73\.9% of datasets used are publicly available, 7\.4% are only available upon\-request or through paid licences, and 18\.7% are private\. While the high proportion of public datasets is a positive indicator for reproducibility, Figure[3](https://arxiv.org/html/2606.13691#S4.F3)reveals a high concentration of usage around a small number of datasets: the top three – W&I\+LOCNESS\(Bryantet al\.,[2019](https://arxiv.org/html/2606.13691#bib.bib52)\), ASAP\(Hamneret al\.,[2012](https://arxiv.org/html/2606.13691#bib.bib53)\), and CoNLL\-2014\(Nget al\.,[2014](https://arxiv.org/html/2606.13691#bib.bib49)\)– together account for 12\.9% of total public dataset usage \(373\), with a long tail of datasets used only once\.

This concentration partially reflects the task distribution noted previously: namely that AES and GEC are both well\-established\. However, this also raises questions about whether research findings generalise beyond the narrow slice of learner populations, languages, and educational contexts that these datasets represent\. We return to this concern in Section[7](https://arxiv.org/html/2606.13691#S7)\.

### Motivations\.

During extraction, we took note of the explicit motivation presented by papers for their presented research, and later classified each into one or more of seven high\-level categories\. Figure[12](https://arxiv.org/html/2606.13691#A6.F12)shows that the most common motivation type across our corpus is to “help a stakeholder” \(110 papers\), followed by addressing a pedagogical or ethical concern \(82\), and assuming the role of a stakeholder \(53\)\. Technical motivation alone, with no stated stakeholder benefit, accounts for 43 papers, which is a non\-trivial proportion \(21\.1%\)\. Figure[4](https://arxiv.org/html/2606.13691#S4.F4)reveals the stakeholder composition underlying papers’ motivations: learners and students are invoked in 61\.1% of papers with stakeholder\-based motivation \(96 of 157 papers\), making them by far the most frequently cited intended beneficiary\. Teachers appear in 40\.1% of such papers \(63 of 157 papers\), though they are most commonly invoked as a pressure points, referenced in terms of the cost, time, or burden associated with their labour, and implicitly positioned as a bottleneck that automation should relieve\. This framing matters\. A motivation to reduce teacher burden through automation is meaningfully different from one that seeks to augment teacher capability or support teacher agency\. In a number of papers in our corpus, teachers appear in the motivation but then disappear from the research design entirely: they are not consulted, included in evaluation, or named as beneficiaries of the results\. We discuss this pattern and its implications further in Section[6](https://arxiv.org/html/2606.13691#S6)\.

![Refer to caption](https://arxiv.org/html/2606.13691v1/images/motivations_who_v2.png)Figure 4:Distribution of different stakeholders for the three stakeholder\-based motivations in Figure[12](https://arxiv.org/html/2606.13691#A6.F12)\.
### Context deployment\.

Figure[5](https://arxiv.org/html/2606.13691#S4.F5)shows that 79\.4% of papers \(162 papers\) present systems or models that are never deployed to real\-world users\. Only 9\.8% of papers report genuine deployment\. We label resource and survey papers as “Not a system paper\.” Non\-deployment is not itself a failing: fundamental research that develops methods, datasets, or evaluation frameworks may legitimately precede any deployment\. More concerning is that papers describing non\-deployed systems rarely discuss the pathway to deployment: the educational contexts in which the system might operate, the stakeholders who would need to be involved, or the risks real\-world deployment would introduce\. This creates a body of research that is optimised for benchmark performance in conditions that may bear little resemblance to the classrooms, tutoring sessions, and assessment environments it nominally serves\.

![Refer to caption](https://arxiv.org/html/2606.13691v1/images/deployment_v2_bar.png)Figure 5:Papers that deployed their method to real\-world users or tested it on pre\-existing real\-world data\.

## 5The Roles of Stakeholders

### Author affiliations and acknowledged entities\.

Figure[15](https://arxiv.org/html/2606.13691#A9.F15)shows that the paper author affiliations in our corpus are geographically concentrated: the United States accounts for the largest single\-country share of author affiliations \(58 papers\), followed by Germany \(29 papers\), China \(23 papers\), and other European countries \(similar observations can be made on the origin of the acknowledged entities in Figure[14](https://arxiv.org/html/2606.13691#A9.F14)\)\. Figure[6](https://arxiv.org/html/2606.13691#S5.F6)shows that universities dominate author affiliations \(188 papers\), followed by research institutes \(75\) and companies \(30\)\. Funding acknowledgements are concentrated within governmental bodies \(80\), with national science foundations of China and the US appearing the most frequently \(Figure[16](https://arxiv.org/html/2606.13691#A9.F16)\)\. Industry acknowledgements \(e\.g\., Microsoft\) appear in a small but non\-trivial number of papers \(20\)\. While industry involvement in research funding is not inherently problematic, it creates potential conflicts of interest that deserve explicit discussion, particularly in a field where commercial EdTech products are directly shaped by research agendas\. Notably, few papers in our corpus explicitly disclose or discuss potential conflicts of interest arising from their funding sources; a gap that mirrors findings in adjacent fields\(Garrettet al\.,[2020](https://arxiv.org/html/2606.13691#bib.bib285)\)\.

![Refer to caption](https://arxiv.org/html/2606.13691v1/images/affiliations_types_v2.png)Figure 6:Number of papers per type of author affiliation and type of entity mentioned in acknowledgements\.
### Stakeholders mentioned or included\.

Figure[7](https://arxiv.org/html/2606.13691#S5.F7)shows that learners and students are mentioned in the most papers overall \(170 papers\), followed by teachers \(97\), and domain experts \(88\)\. However, mention does not equate to inclusion: the proportion of mentioned stakeholders who are also actively included in the research is substantially lower across all groups\. Among teachers, 26\.8% of papers that mention them also include them in the research \(26\)\. For learners, 22\.4% of mentioning papers include them \(38\)\. Domain experts show a much higher inclusion rate \(56\.8%\), in part because they are frequently recruited as annotators or raters\. Most strikingly, parents were only mentioned in two papers, despite their having such an important role in children education\(Kostov,[2026](https://arxiv.org/html/2606.13691#bib.bib86)\)\.

![Refer to caption](https://arxiv.org/html/2606.13691v1/images/stakeholders_v2.png)Figure 7:Number of papers per type of stakeholderincludedormentioned onlyin the research\.Figure[13](https://arxiv.org/html/2606.13691#A8.F13)reveals the overall distribution of inclusion levels across all included stakeholders: 47\.0% of inclusions are classified asMiddling\(involved in data evaluation or annotation, but with no input on research design\), 32\.1% asHigh\(integral to research design and completion\), and 20\.9% asLow\(test subjects in data collection only\)\. Figure[8](https://arxiv.org/html/2606.13691#S5.F8)shows that this breakdown varies substantially by stakeholder type\. Other than paper authors themselves, schools and universities are most likely to be included at aHighlevel \(76\.9%\), while teachers, when included at all, are predominantly included at aMiddlinglevel \(65\.5%\), most often as annotators\. Learners are most often included as test subjects \(59\.5%\)\. The implication is that even when stakeholders are formally included, they are rarely positioned as agents who shape the research, they are more often positioned as instruments of it\.

![Refer to caption](https://arxiv.org/html/2606.13691v1/images/inclusion_level_specific_v2.png)Figure 8:Level of inclusion of included stakeholders by stakeholder type; we distinguish 3 levels:High\(integral to research design & completion\),Middling\(involved in data evaluation or annotation, without input on research design\), andLow\(test subjects in data collection only\)\.
### Incentives\.

Figure[9](https://arxiv.org/html/2606.13691#S5.F9)shows the distribution of stakeholders explicitly mentioned as benefiting from the research alongside those we identified as implicit beneficiaries\. We note that the identification of implicit beneficiaries is the most subjective dimension of our annotation: it required annotators to infer who stands to gain from a piece of research beyond what authors themselves state, based on the nature of the task, the deployment context, and the funding sources involved\. For instance, a paper developing an AES system for standardised testing, funded by a testing organisation, was coded as implicitly benefiting industry, even if no such benefit was named\. Due to the subjective nature of this dimension, inter\-annotator agreement was accordingly lower \(0\.53; Table[1](https://arxiv.org/html/2606.13691#A4.T1)\), and these findings should be read as indicative rather than definitive\.

Learners and students are the most frequently named explicit beneficiary \(125 papers\)\. Teachers stand out starkly here: 80\.9% of their appearances are explicit \(55 papers\)\. Stated differently, teachers are almost never the unstated but evident beneficiary of research; when they benefit, papers say so\. However, the vast majority of papers do not position them as benefiting at all\. On the other hand, non\-profit organisations, industry and governmental bodies appear prominently as implicit beneficiaries\. That is, while they are not named in the paper as intended beneficiaries, the research clearly serves their interests\. This is most visible in the task\-level breakdown in Figure[17](https://arxiv.org/html/2606.13691#A10.F17): automated assessment research \(the largest task category in the corpus\) consistently benefits learners and industry, while teachers and examiners are sparsely represented\. The commercial relationship here is direct: automated scoring tools reduce the need for human markers and are of clear value to large\-scale testing organisations\. For ITS, learners dominate, with limited acknowledgement of teachers\. GEC research shows the broadest stakeholder spread, in part because GEC tools serve not only learners and teachers but also the general public who use writing assistance tools in everyday tasks\.

![Refer to caption](https://arxiv.org/html/2606.13691v1/images/incentives_who_v2.png)Figure 9:Stakeholders explicitly stated as benefitting from the research, as well as those that we could see benefitting that were not explicitly mentioned \(Implicit\)\. Note that a stakeholder may be both explicitly mentioned to benefit in some way and implicitly in another\.

## 6Risks, Concerns, Limitations, and Measures Taken

### Risks, concerns and limitations raised\.

Figure[18](https://arxiv.org/html/2606.13691#A11.F18)shows the distribution of risks, concerns, and limitations explicitly raised by paper authors, organised into six high\-level categories\. We note that inter\-annotator agreement was lower for this dimension than others \(0\.57; Table[1](https://arxiv.org/html/2606.13691#A4.T1)\), owing to the need to assess coverage across a large and varied set of concerns; these results should therefore be read as indicative trends rather than precise counts\. The most commonly noted concerns are methodology limitations \(69 papers\), dataset limitations \(60\), followed by lack of generalisability and language\-specificity \(56\), risk of bias \(46\) and and task/domain\-specific limitations \(44\), reflecting the tendency of research to develop systems for specific languages or educational contexts that may not transfer\. Several important risk categories are raised much less frequently\. Risk of hallucination appears in only 12 papers, risk of dual\-use in 6, and safety concerns in 26\. Within the contextualising research category, the gap between research and real\-world application is noted in 32 papers and the need for human\-in\-the\-loop in 19, suggesting some awareness of deployment limitations, this rarely translates into direct mitigation \(Figure[19](https://arxiv.org/html/2606.13691#A12.F19)\)\. Data protection and anonymisation concerns are raised in 37 papers, while informed consent and fair compensation for included stakeholders, critical ethical requirements for human\-subjects research, appear in only 11 and 10 papers respectively\. That human\-subjects protections remain among the least commonly raised concerns in a corpus that routinely collects learner data and recruits human annotators is itself a notable finding\.

### Engagement with risks\.

Figure[19](https://arxiv.org/html/2606.13691#A12.F19)distinguishes three levels of engagement with stated risks:High\(directly mitigated or discussed in substantial depth\),Middling\(discussed as part of future work\), andLow\(briefly mentioned only\)\. Across most risk categories, the majority of engagement is at aLoworMiddlinglevel\.Highengagement is most consistently found in the participant and data concern category: fair compensation for included stakeholders \(100\.0%\) and informed consent \(72\.7%\) are the most actively addressed concerns, though both are raised by relatively few papers to begin with\. By contrast, the largest categories show the weakest engagement: methodology limitations are 98\.6%MiddlingorLow, and dataset limitations 90\.0%MiddlingorLow\. Risk of bias, one of the most frequently raised concerns at 46 papers, is engaged at aHighlevel in only 15\.2% of cases\. The gap between research and real\-world application and the need for human\-in\-the\-loop, two concerns with clear implications for responsible deployment, are predominantlyMiddlingorLow\. This pattern suggests a community that is aware of the ethical dimensions of its work but has not yet developed consistent norms for acting on them within the scope of individual papers\.

### Future work\.

Figure[20](https://arxiv.org/html/2606.13691#A12.F20)shows a distribution of the areas of future work explicitly mentioned in the papers\. We report future work specifically related to any risks, concerns or higher aspirations rather than any purely technical work; of our data sample, 21 papers do not discuss any such future work\. Four high\-level themes emerge within discussed future work: stakeholder inclusion, technical development, expanding the scope of the research, and engaging with issues emerging from the research\. Of these, the most frequently mentioned category is expanding the scope of the research, with expanding the data \(42\), language selection \(36\), and subject domain \(35\) the most common fine\-grained directions\. EduNLP research is often performed at language\- or task\-specific levels, resulting in common limitations which translate to clear future directions\. The least common high\-level category is engaging with issues emerging from research, with fine\-grained categories including interpretability and bias mitigation \(16 and 14 papers respectively\), exploring performance\-cost trade\-offs \(12\), and initiating broader discussions in the EduNLP space\. Within the fine\-grained categories overall, the most frequently referenced future direction is general technical improvements related to the paper’s risks and concerns \(91 papers\)\. Despite controlling for purely technical work in our analysis, the primary focus for EduNLP researchers remains within this domain\. Within the stakeholder inclusion category, user study and user inclusion \(37\) and integration into real\-world systems \(32\) are the most common directions, suggesting some awareness that current work falls short\. Analysis of the future directions in our sample therefore reveals a tendency towards prioritising empirically\-motivated fine\-grained technical work rather than ethically\-driven broader work\. In part, this may be due to an imbalance in available resources for conducting such research\.

## 7Discussion: Opportunities, Recommendations, Aspirations

### Opportunities\.

Our findings reveal some structural gaps in EduNLP research that constitute genuine opportunities for the field\. First, teachers are under\-represented both as beneficiaries and active participants, despite their central role in education\. This represents a significant misalignment between stated purpose and actual design\. Research that nominally aims to support education but systematically excludes the professional educators who mediate it risks building tools that are technically sophisticated but pedagogically ill\-fitting, or that automate away precisely the human judgement that makes good teaching effective\. Second, the gap between research development and real\-world deployment is striking: only 9\.8% of system papers are deployed in live educational settings\. This reflects a missing discourse about what responsible deployment looks like: which stakeholders need to be involved, what evaluation is appropriate for real students and teachers, and what accountability mechanisms should be in place\. This gap is further sharpened by the concentration of datasets around high\-stakes standardised testing contexts, and by the dominance of assessment tasks which, given their direct commercial value to the testing industry, risk pulling the research agenda toward institutional efficiency over the full range of educational stakeholders\.

### Recommendations\.

Drawing on exemplary papers in our corpus, we offer three concrete recommendations for the EduNLP community:

1. 1\.Co\-design with teachers and learners from the outset\.Research that positions stakeholders as genuine co\-designers, rather than test subjects or future\-work items, produces better\-grounded systems and more honest evaluation\.Galletti and Cesaroni \([2025](https://arxiv.org/html/2606.13691#bib.bib88)\)offer a replicable model for this: conducting focus groups and questionnaires with teachers at different stages of system development surfaced concerns around transparency, autonomy, and pedagogical alignment that would not have emerged from technical evaluation alone\. See alsoHuovinen and Hämäläinen \([2025](https://arxiv.org/html/2606.13691#bib.bib14)\)\. Their work echoes principles of design justice\(Costanza\-Chock,[2020](https://arxiv.org/html/2606.13691#bib.bib125)\)which seek to decentre technical expertise in favour of lived experience and domain expertise – in all regards save technical implementation – as a mechanism for ensuring that those affected by a system retain meaningful agency in shaping it\. As it stands, a true expression of design justice was not found in any of the reviewed papers of this corpus, howeverWanget al\.\([2025c](https://arxiv.org/html/2606.13691#bib.bib164)\)embodies some aspects of it\. Though not a system, the paper demonstrates that design justice principles can be embedded even at the resource creation stage: their math world problem benchmark was developed through structured interviews with primary school math teachers, whose pedagogical expertise directly shaped what counts as a meaningful visual, ensuring that future systems trained or evaluated on this benchmark will be held to a standard defined by them\.
2. 2\.Make deployment contexts and costs explicit\.Authors should describe the educational context in which their system could or has been deployed, the stakeholder roles involved, and provide an honest account of computational, financial, and human costs alongside claimed benefits\(Akteret al\.,[2025](https://arxiv.org/html/2606.13691#bib.bib85); Guptaet al\.,[2025](https://arxiv.org/html/2606.13691#bib.bib77); Li and Ng,[2024](https://arxiv.org/html/2606.13691#bib.bib54)\)\.
3. 3\.Adopt structured ethical reflection and act on it\.Our data show that named concerns rarely translate into mitigation within the same paper\. Venues should normalise the expectation that ethical risks raised are addressed in the current work, not deferred to future work\. Checklists like the ARR Responsible NLP Checklist already support this: they prompt authors to interrogate their own design choices \(e\.g\.,[Gotoet al\.](https://arxiv.org/html/2606.13691#bib.bib157),[2025a](https://arxiv.org/html/2606.13691#bib.bib147), who voluntarily engage in detailing a number of the checklist items\) and give reviewers a structured basis for evaluating ethical engagement\.

### Aspirations\.

The potential for AI in education is genuine: it could improve access to education, helping reduce inequalities related to geography, language, resources, and infrastructure, for learners who might otherwise go without\. It could also help free educators of repetitive and time\-consuming tasks so they can concentrate on the relational aspect of education that systems cannot and should not replace\. Automated tools can also mitigate some human weaknesses that threaten fairness in assessment: fatigue, inconsistency, and unconscious biases\. The question is not whether AI belongs in education, but whether we are stewarding its development responsibly\. The exemplary papers in our corpus demonstrate that it is possible\. Our aspiration for the field is a research community that treats educational infrastructure as a site of social responsibility, not merely technical opportunity\. The trajectory the field takes will depend on choices that are made now about which tasks to prioritise, whose voices to include, and what counts as success\.Harding \([2025](https://arxiv.org/html/2606.13691#bib.bib39)\), reflecting on AI in language assessment, frames this as a choice between utopian and dystopian futures: one in which assessment technology is context\-sensitive, transparent, connected with learning, and deeply oriented toward justice; as opposed to one driven by expediency, opacity, and the logic of scale\.

## 8Conclusion

This paper has presented a systematic review of 204 EduNLP papers published at ACL SIGEDU venues and main \*ACL conferences in 2024 and 2025, examining tasks, motivations, stakeholder inclusion, incentive structures, and ethical engagement\. Our analysis reveals a field that is technically productive but structurally misaligned with key educational stakeholders, particularly teachers who are rarely included in research and almost never positioned as implicit beneficiaries\. At the same time, our corpus contains exemplary work that demonstrates what responsible, stakeholder\-grounded EduNLP research looks like in practice\. The norms and practices embedded in these papers are neither technically burdensome nor novel in principle\. What is needed is for the community to adopt them consistently, and for publication venues to create the conditions in which doing so is expected rather than exceptional\. We hope this review serves as both a diagnostic and a resource: a map of where the field currently stands, and a set of orientations for where it should go\.

## 9Limitations

This review has several limitations that should be noted\. First, while our corpus of 204 papers is broad in scope, it is not exhaustive, meaning that some relevant papers will have been missed\. Our focus on ACL Anthology venues also means that work published in AIED journals, learning analytics conferences, and EdTech\-specific venues falls outside our scope: the picture we paint is of the NLP community specifically, not the broader field\. Second, annotation of inherently interpretive dimensions, particularly stakeholder inclusion level and risk engagement level, carries subjectivity that agreement scores can only partially represent\. We report these as indicative trends rather than precise counts, but readers should bear this in mind when interpreting figures\. Third, our corpus covers 2024–2025 only; while this captures the most recent work, it is a short window and trends may not generalise to earlier or future periods\. An interesting direction for future work would be to extend the temporal scope to publications published before the release of ChatGPT\(OpenAI,[2026](https://arxiv.org/html/2606.13691#bib.bib126)\)in November 2022, which would allow for a direct comparison of research priorities, stakeholder inclusion, and ethical engagement before and after the widespread availability of generative AI\. Finally, as researchers embedded in the EduNLP community ourselves, we are not neutral observers, our framing of what constitutes meaningful stakeholder inclusion or adequate ethical engagement reflects our own values, which we have tried to make explicit throughout\.

### On financial disclosures\.

While complete financial disclosures of which entities have funded research is a desirable trait in papers due to the transparency it affords, disclosure can be structurally limited\. For example, some grant funders – particularly military funding – may require non\-disclosure, nation\-wide regulation may limit disclosure, research can be funded across multiple grants, work may be conducted on an entirely voluntary basis, among many other reasons\. We therefore see financial disclosure as a spectrum between complete opacity and complete transparency\. We advocate for researchers to approach the question of financial disclosure according to a maximalist approach, i\.e\., we argue that researchers should share as much information as is possible to them in a given situation\. A transparency maximalist approach will afford greater insight into how research into educational technologies is being, forcefully and subtly, shifted by the interests of different entities\.

## 10Ethical considerations

All papers surveyed in this review are publicly available through the ACL Anthology; no private or unpublished materials were used\. No human subjects were involved in the review itself\. The annotation process involved researchers reading and characterising the work of others, which carries a risk of misrepresentation; we have sought to mitigate this through iterative schema development, inter\-annotator agreement measurement, and the use of direct quotes to ground our characterisations\. Our normative claims – that teachers are under\-served, that ethical engagement is insufficient, that commercial incentives distort research agendas – are recommendations and observations, not accusations about individual papers or authors\. We acknowledge that we are ourselves part of the community we critique, and that future reviews may find similar gaps in our own work\. This paper has been pre\-registered on OSF101010[https://osf\.io/nhb2q/overview?view\_only=533ca23658d644a4abaf0bbd7e63087c](https://osf.io/nhb2q/overview?view_only=533ca23658d644a4abaf0bbd7e63087c)\.

## Acknowledgements

Gabrielle Gaudeau, Aoife O’Driscoll and Andrew Caines are supported by Cambridge University Press & Assessment\. Donya Rooein is a member of the MilaNLP group and the Data & Marketing Insights Unit of the Bocconi Institute for Data Science and Analysis\. Her research is supported through the European Research Council \(ERC\) under the European Union’s Horizon 2020 research and innovation program \(No\. 949944, INTEGRATOR\)\. We thank the anonymous reviewers for their time and valuable feedback\. Finally, we note that Claude Sonnet 4\.6111111[https://www\.anthropic\.com/claude/sonnet](https://www.anthropic.com/claude/sonnet)was used to improve the language and readability of the manuscript\.

## References

- K\. Ahmad, W\. Iqbal, A\. El\-Hassan, J\. Qadir, D\. Benhaddou, M\. Ayyash, and A\. Al\-Fuqaha \(2024\)Data\-driven artificial intelligence in education: a comprehensive review\.IEEE Transactions on Learning Technologies17\(\),pp\. 12–31\.External Links:[Document](https://dx.doi.org/10.1109/TLT.2023.3314610)Cited by:[§2](https://arxiv.org/html/2606.13691#S2.p2.1)\.
- S\. Akef, D\. Meurers, A\. Mendes, and P\. Rebuschat \(2025\)Interpretable machine learning for societal language identification: modeling English and German influences on Portuguese heritage language\.InProceedings of the 14th Workshop on Natural Language Processing for Computer Assisted Language Learning,R\. Muñoz Sánchez, D\. Alfter, E\. Volodina, and J\. Kallas \(Eds\.\),Tallinn, Estonia,pp\. 50–62\.External Links:[Link](https://aclanthology.org/2025.nlp4call-1.4/),ISBN 978\-9908\-53\-112\-0Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.7.6.2.1.1)\.
- S\. S\. Akter, S\. Hunter, D\. Woo, and A\. Anastasopoulos \(2025\)Costs and benefits of AI\-enabled topic modeling in P\-20 research: the case of school improvement plans\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 460–476\.External Links:[Link](https://aclanthology.org/2025.bea-1.34/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.34),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.7.6.2.1.1),[item 2](https://arxiv.org/html/2606.13691#S7.I1.i2.p1.1)\.
- R\. Alfredo, V\. Echeverria, Y\. Jin, L\. Yan, Z\. Swiecki, D\. Gašević, and R\. Martinez\-Maldonado \(2024\)Human\-centred learning analytics and AI in education: a systematic literature review\.Computers and Education: Artificial Intelligence6,pp\. 100215\.External Links:[Document](https://dx.doi.org/10.1016/j.caeai.2024.100215)Cited by:[§2](https://arxiv.org/html/2606.13691#S2.p3.1)\.
- D\. Alfter \(2024\)Out\-of\-the\-box graded vocabulary lists with generative language models: fact or fiction?\.InProceedings of the 13th Workshop on Natural Language Processing for Computer Assisted Language Learning,T\. Gaillat, C\. Mallart, F\. Moreau, J\. Li, G\. Drouet, D\. Alfter, E\. Volodina, and A\. Jönsson \(Eds\.\),Rennes, France,pp\. 1–19\.External Links:[Link](https://aclanthology.org/2024.nlp4call-1.1/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.4.3.2.1.1)\.
- D\. Alfter \(2025\)The need for truly graded lexical complexity prediction\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 326–333\.External Links:[Link](https://aclanthology.org/2025.bea-1.25/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.25),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.4.3.2.1.1)\.
- B\. Alhafni and N\. Habash \(2025\)Enhancing text editing for grammatical error correction: Arabic as a case study\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 17892–17914\.External Links:[Link](https://aclanthology.org/2025.acl-long.875/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.875),ISBN 979\-8\-89176\-251\-0Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- M\. Almasi and R\. D\. Kristensen\-McLachlan \(2025\)Alignment drift in CEFR\-prompted LLMs for interactive Spanish tutoring\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 70–88\.External Links:[Link](https://aclanthology.org/2025.bea-1.6/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.6),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.5.4.2.1.1)\.
- J\. An, X\. Fu, B\. Liu, X\. Zong, C\. Kong, S\. Liu, S\. Wang, Z\. Liu, L\. Yang, H\. Fan, and E\. Yang \(2025\)BLCU\-ICALL at BEA 2025 shared task: multi\-strategy evaluation of AI tutors\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 1084–1097\.External Links:[Link](https://aclanthology.org/2025.bea-1.84/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.84),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.5.4.2.1.1)\.
- A\. Arronte Alvarez and N\. Xie Fincham \(2025\)Automated L2 proficiency scoring: weak supervision, large language models, and statistical guarantees\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 384–397\.External Links:[Link](https://aclanthology.org/2025.bea-1.30/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.30),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- Y\. Asano, B\. Beigman Klebanov, and J\. Mikeska \(2025\)Exploring task formulation strategies to evaluate the coherence of classroom discussions with GPT\-4o\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 716–736\.External Links:[Link](https://aclanthology.org/2025.bea-1.52/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.52),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- N\. Ashok Kumar and A\. Lan \(2024\)Improving socratic question generation using data augmentation and preference optimization\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 108–118\.External Links:[Link](https://aclanthology.org/2024.bea-1.10/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.6.5.2.1.1)\.
- Y\. Attali and J\. Burstein \(2006\)Automated essay scoring with e\-rater® v\.2\.The Journal of Technology, Learning and Assessment4\(3\)\.External Links:[Link](https://ejournals.bc.edu/index.php/jtla/article/view/1650)Cited by:[§1](https://arxiv.org/html/2606.13691#S1.p1.1)\.
- S\. E\. Ayari and Z\. Li \(2024\)Potential of ASR for the study of L2 learner corpora\.InProceedings of the 13th Workshop on Natural Language Processing for Computer Assisted Language Learning,T\. Gaillat, C\. Mallart, F\. Moreau, J\. Li, G\. Drouet, D\. Alfter, E\. Volodina, and A\. Jönsson \(Eds\.\),Rennes, France,pp\. 49–58\.External Links:[Link](https://aclanthology.org/2024.nlp4call-1.4/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.9.8.2.1.1)\.
- N\. Ballier and A\. Méli \(2024\)Investigating acoustic correlates of whisper scoring for L2 speech using forced alignment with the Italian component of the ISLE corpus\.InProceedings of the 13th Workshop on Natural Language Processing for Computer Assisted Language Learning,T\. Gaillat, C\. Mallart, F\. Moreau, J\. Li, G\. Drouet, D\. Alfter, E\. Volodina, and A\. Jönsson \(Eds\.\),Rennes, France,pp\. 20–32\.External Links:[Link](https://aclanthology.org/2024.nlp4call-1.2/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.9.8.2.1.1)\.
- S\. Bannò, K\. M\. Knill, and M\. J\. F\. Gales \(2025\)Exploiting the English vocabulary profile for L2 word\-level vocabulary assessment with LLMs\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 632–646\.External Links:[Link](https://aclanthology.org/2025.bea-1.45/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.45),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- S\. Bannò, H\. K\. Vydana, K\. M\. Knill, and M\. J\. F\. Gales \(2024\)Can GPT\-4 do L2 analytic assessment?\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 149–164\.External Links:[Link](https://aclanthology.org/2024.bea-1.14/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- B\. Beigman Klebanov, M\. Suhan, T\. O’Reilly, and Z\. Wang \(2024\)From miscue to evidence of difficulty: analysis of automatically detected miscues in oral reading for feedback potential\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 459–469\.External Links:[Link](https://aclanthology.org/2024.bea-1.38/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.7.6.2.1.1)\.
- L\. Benedetto, P\. Cremonesi, A\. Caines, P\. Buttery, A\. Cappelli, A\. Giussani, and R\. Turrin \(2023\)A survey on recent approaches to question difficulty estimation from text\.ACM Computing Surveys55\(9\),pp\. 1–37\.External Links:[Link](https://doi.org/10.1145/3556538),[Document](https://dx.doi.org/10.1145/3556538)Cited by:[§2](https://arxiv.org/html/2606.13691#S2.p2.1)\.
- L\. Benedetto, S\. Taslimipoor, and P\. Buttery \(2025\)A survey on automated distractor evaluation in multiple\-choice tasks\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 55–69\.External Links:[Link](https://aclanthology.org/2025.bea-1.5/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.5),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.6.5.2.1.1)\.
- S\. Berruti, A\. Collazo, D\. Sellanes, A\. Rosá, and L\. Chiruzzo \(2024\)Automatic crossword clues extraction for language learning\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 381–390\.External Links:[Link](https://aclanthology.org/2024.bea-1.31/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.6.5.2.1.1)\.
- M\. Bexte, Y\. Ding, and A\. Horbach \(2025\)Increasing the generalizability of similarity\-based essay scoring through cross\-prompt training\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 225–236\.External Links:[Link](https://aclanthology.org/2025.bea-1.17/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.17),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- M\. Bexte, A\. Horbach, L\. Schützler, O\. Christ, and T\. Zesch \(2024\)Scoring with confidence? – exploring high\-confidence scoring for saving manual grading effort\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 119–124\.External Links:[Link](https://aclanthology.org/2024.bea-1.11/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- M\. Bexte and T\. Zesch \(2025\)Is lunch free yet? overcoming the cold\-start problem in supervised content scoring using zero\-shot LLM\-generated training data\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 144–159\.External Links:[Link](https://aclanthology.org/2025.bea-1.11/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.11),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- P\. Bhattacharyya and A\. Bhattacharya \(2025\)Leveraging LLMs for Bangla grammar error correction: error categorization, synthetic data, and model evaluation\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 8220–8239\.External Links:[Link](https://aclanthology.org/2025.findings-acl.431/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.431),ISBN 979\-8\-89176\-256\-5Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- L\. Bloch, J\. Rückert, and C\. Friedrich \(2025\)Towards automatic formal feedback on scientific documents\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 334–344\.External Links:[Link](https://aclanthology.org/2025.bea-1.26/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.26),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- S\. Bodnar \(2025\)A prototype authoring tool for editing authentic texts using LLMs to increase support for contextualised L2 grammar practice\.InProceedings of the 14th Workshop on Natural Language Processing for Computer Assisted Language Learning,R\. Muñoz Sánchez, D\. Alfter, E\. Volodina, and J\. Kallas \(Eds\.\),Tallinn, Estonia,pp\. 63–71\.External Links:[Link](https://aclanthology.org/2025.nlp4call-1.5/),ISBN 978\-9908\-53\-112\-0Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.6.5.2.1.1)\.
- E\. N\. V\. Boquio and P\. C\. Naval \(2024\)Beyond canonical fine\-tuning: leveraging hybrid multi\-layer pooled representations of BERT for automated essay scoring\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\),N\. Calzolari, M\. Kan, V\. Hoste, A\. Lenci, S\. Sakti, and N\. Xue \(Eds\.\),Torino, Italia,pp\. 2285–2295\.External Links:[Link](https://aclanthology.org/2024.lrec-main.204/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- A\. Bradford, K\. Steimel, B\. Riordan, and M\. Linn \(2024\)Building robust content scoring models for student explanations of social justice science issues\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 450–458\.External Links:[Link](https://aclanthology.org/2024.bea-1.37/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- C\. Bryant, M\. Felice, Ø\. E\. Andersen, and T\. Briscoe \(2019\)The BEA\-2019 shared task on grammatical error correction\.InProceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications,H\. Yannakoudakis, E\. Kochmar, C\. Leacock, N\. Madnani, I\. Pilán, and T\. Zesch \(Eds\.\),Florence, Italy,pp\. 52–75\.External Links:[Link](https://aclanthology.org/W19-4406/),[Document](https://dx.doi.org/10.18653/v1/W19-4406)Cited by:[§4](https://arxiv.org/html/2606.13691#S4.SS0.SSS0.Px2.p2.1),[§4](https://arxiv.org/html/2606.13691#S4.SS0.SSS0.Px3.p1.1)\.
- O\. Bulut, G\. Gorgun, and B\. Tan \(2024\)Item difficulty and response time prediction with large language models: an empirical analysis of USMLE items\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 522–527\.External Links:[Link](https://aclanthology.org/2024.bea-1.44/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.4.3.2.1.1)\.
- A\. Caines, L\. Benedetto, S\. Taslimipoor, C\. Davis, Y\. Gao, Ø\. Andersen, Z\. Yuan, M\. Elliott, R\. Moore, C\. Bryant, M\. Rei, H\. Yannakoudakis, A\. Mullooly, D\. Nicholls, and P\. Buttery \(2023\)On the application of large language models for language teaching and assessment technology\.InProceedings of the Empowering Education with LLMs – the Next\-Gen Interface and Content Generation Workshop at AIED,External Links:[Link](https://arxiv.org/abs/2307.08393)Cited by:[§2](https://arxiv.org/html/2606.13691#S2.p1.1)\.
- Y\. Cao, T\. Wang, L\. Xu, Z\. Wang, and M\. Cai \(2025\)CxGGEC: construction\-guided grammatical error correction\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 6143–6156\.External Links:[Link](https://aclanthology.org/2025.acl-long.307/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.307),ISBN 979\-8\-89176\-251\-0Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- D\. Carpenter, W\. Min, S\. Lee, G\. Ozogul, X\. Zheng, and J\. Lester \(2024\)Assessing student explanations with large language models using fine\-tuning and few\-shot learning\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 403–413\.External Links:[Link](https://aclanthology.org/2024.bea-1.33/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- A\. Chakravarty, M\. Brenchley, T\. Breakspear, I\. Lewin, and Y\. Huang \(2025\)Enhancing marker scoring accuracy through ordinal confidence modelling in educational assessments\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 6: Industry Track\),G\. Rehm and Y\. Li \(Eds\.\),Vienna, Austria,pp\. 1498–1507\.External Links:[Link](https://aclanthology.org/2025.acl-industry.106/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-industry.106),ISBN 979\-8\-89176\-288\-6Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- I\. Chamieh, T\. Zesch, and K\. Giebermann \(2024\)LLMs in short answer scoring: limitations and promise of zero\-shot and few\-shot approaches\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 309–315\.External Links:[Link](https://aclanthology.org/2024.bea-1.25/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- P\. Chen, B\. Tsai, S\. K\. Wei, C\. Wang, J\. Wang, and Y\. Huang \(2025\)Mixture of ordered scoring experts for cross\-prompt essay trait scoring\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 18071–18084\.External Links:[Link](https://aclanthology.org/2025.acl-long.884/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.884),ISBN 979\-8\-89176\-251\-0Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- R\. Chen and Y\. Zhao \(2025\)EduCSW: building a Mandarin\-English code\-switched generation pipeline for computer science learning\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 908–919\.External Links:[Link](https://aclanthology.org/2025.bea-1.68/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.68),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.7.6.2.1.1)\.
- Y\. Chen and X\. Li \(2024\)PLAES: prompt\-generalized and level\-aware learning framework for cross\-prompt automated essay scoring\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\),N\. Calzolari, M\. Kan, V\. Hoste, A\. Lenci, S\. Sakti, and N\. Xue \(Eds\.\),Torino, Italia,pp\. 12775–12786\.External Links:[Link](https://aclanthology.org/2024.lrec-main.1118/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- M\. Chifligarov, J\. Laâguidi, M\. Schellenberg, A\. Dill, A\. Timukova, A\. Drackert, and R\. Laarmann\-Quante \(2025\)Automated scoring of a German written elicited imitation test\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 237–247\.External Links:[Link](https://aclanthology.org/2025.bea-1.18/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.18),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- M\. Chitez, L\. Dinu, M\. Micluta\-Campeanu, A\. Bucur, and R\. Rogobete \(2025\)Assessing critical thinking components in Romanian secondary school textbooks: a data mining approach to the ROTEX corpus\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 780–793\.External Links:[Link](https://aclanthology.org/2025.bea-1.56/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.56),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.7.6.2.1.1)\.
- S\. Chu, J\. W\. Kim, B\. Wong, and M\. Y\. Yi \(2025a\)Rationale behind essay scores: enhancing S\-LLM’s multi\-trait essay scoring with rationale generated by LLMs\.InFindings of the Association for Computational Linguistics: NAACL 2025,L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 5811–5829\.External Links:[Link](https://aclanthology.org/2025.findings-naacl.322/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-naacl.322),ISBN 979\-8\-89176\-195\-7Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- Z\. Chu, J\. Xie, S\. Wang, Z\. Wang, and Q\. Wen \(2025b\)UniEDU: toward unified and efficient large multimodal models for educational tasks\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track,S\. Potdar, L\. Rojas\-Barahona, and S\. Montella \(Eds\.\),Suzhou \(China\),pp\. 1007–1016\.External Links:[Link](https://aclanthology.org/2025.emnlp-industry.68/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-industry.68),ISBN 979\-8\-89176\-333\-3Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.8.7.2.1.1)\.
- S\. Correa Busquets, V\. Córdova Véliz, and J\. Baier \(2025\)IALab UC at BEA 2025 shared task: LLM\-powered expert pedagogical feature extraction\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 1187–1193\.External Links:[Link](https://aclanthology.org/2025.bea-1.94/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.94),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- S\. Costanza\-Chock \(2020\)Design justice\.The MIT Press\.Cited by:[item 1](https://arxiv.org/html/2606.13691#S7.I1.i1.p1.1)\.
- P\. Cristea and S\. Nisioi \(2024\)Archaeology at mlsp 2024: machine translation for lexical complexity prediction and lexical simplification\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 610–617\.External Links:[Link](https://aclanthology.org/2024.bea-1.55/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.4.3.2.1.1),[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.8.7.2.1.1)\.
- S\. Crossley, P\. Baffour, M\. Dascalu, and S\. Ruseti \(2024\)A world CLASSE student summary corpus\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 99–107\.External Links:[Link](https://aclanthology.org/2024.bea-1.9/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- S\. Dascalescu, M\. Dumitran, and M\. A\. Vasiluta \(2025\)Leveraging generative AI for enhancing automated assessment in programming education contests\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 89–99\.External Links:[Link](https://aclanthology.org/2025.bea-1.7/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.7),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- C\. Davis, A\. Caines, Ø\. E\. Andersen, S\. Taslimipoor, H\. Yannakoudakis, Z\. Yuan, C\. Bryant, M\. Rei, and P\. Buttery \(2024\)Prompting open\-source and commercial language models for grammatical error correction of English learner text\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 11952–11967\.External Links:[Link](https://aclanthology.org/2024.findings-acl.711/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.711)Cited by:[§2](https://arxiv.org/html/2606.13691#S2.p1.1)\.
- A\. de Chillaz, A\. Sotnikova, P\. Jermann, and A\. Bosselut \(2025\)Challenges for AI in multimodal STEM assessments: a human\-AI comparison\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 279–293\.External Links:[Link](https://aclanthology.org/2025.bea-1.22/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.22),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.9.8.2.1.1)\.
- K\. De Kuthy, L\. Girrbach, and D\. Meurers \(2025\)Automatic concept extraction for learning domain modeling: a weakly supervised approach using contextualized word embeddings\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 175–185\.External Links:[Link](https://aclanthology.org/2025.bea-1.13/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.13),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.7.6.2.1.1)\.
- M\. De Vrindt, R\. Bouwer, W\. Van Den Noortgate, M\. Lesterhuis, and A\. Tack \(2025\)Explaining holistic essay scores in comparative judgment assessments by predicting scores on rubrics\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 535–548\.External Links:[Link](https://aclanthology.org/2025.bea-1.39/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.39),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- M\. De Vrindt, A\. Tack, R\. Bouwer, W\. Van Den Noortgate, and M\. Lesterhuis \(2024\)Predicting initial essay quality scores to increase the efficiency of comparative judgment assessments\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 125–136\.External Links:[Link](https://aclanthology.org/2024.bea-1.12/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- J\. Degraeuwe and P\. Goethals \(2024\)Leading by example: the use of generative artificial intelligence to create pedagogically suitable example sentences\.InProceedings of the 13th Workshop on Natural Language Processing for Computer Assisted Language Learning,T\. Gaillat, C\. Mallart, F\. Moreau, J\. Li, G\. Drouet, D\. Alfter, E\. Volodina, and A\. Jönsson \(Eds\.\),Rennes, France,pp\. 33–48\.External Links:[Link](https://aclanthology.org/2024.nlp4call-1.3/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.9.8.2.1.1)\.
- J\. Degraeuwe \(2025\)You shall know a word’s difficulty by the family it keeps: word family features in personalised word difficulty classifiers for L2 Spanish\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 312–325\.External Links:[Link](https://aclanthology.org/2025.bea-1.24/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.24),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.4.3.2.1.1)\.
- Y\. Ding, J\. Lohmann, N\. Schaller, T\. Jansen, and A\. Horbach \(2024\)Transfer learning of argument mining in student essays\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 439–449\.External Links:[Link](https://aclanthology.org/2024.bea-1.36/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.7.6.2.1.1)\.
- K\. Doi, K\. Sudoh, and S\. Nakamura \(2024\)Automated essay scoring using grammatical variety and errors with multi\-task learning and item response theory\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 316–329\.External Links:[Link](https://aclanthology.org/2024.bea-1.26/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- S\. Doroudi \(2023\)The intertwined histories of artificial intelligence and education\.International Journal of Artificial Intelligence in Education33,pp\. 885–928\.External Links:[Link](https://doi.org/10.1007/s40593-022-00313-2)Cited by:[§2](https://arxiv.org/html/2606.13691#S2.p1.1)\.
- M\. Dumitran, M\. Buca, and T\. Moroianu \(2025\)MateInfoUB: a real\-world benchmark for testing LLMs in competitive, multilingual, and multimodal educational tasks\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 24–37\.External Links:[Link](https://aclanthology.org/2025.bea-1.3/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.3),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.7.6.2.1.1)\.
- M\. Durward and C\. Thomson \(2024\)Evaluating vocabulary usage in LLMs\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 266–282\.External Links:[Link](https://aclanthology.org/2024.bea-1.22/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.6.5.2.1.1)\.
- B\. Dutilleul, M\. Debaillon, and S\. Mathias \(2024\)ISEP\_Presidency\_University at MLSP 2024 shared task: using GPT\-3\.5 to generate substitutes for lexical simplification\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 605–609\.External Links:[Link](https://aclanthology.org/2024.bea-1.54/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.4.3.2.1.1)\.
- M\. Elaraby and D\. Litman \(2025\)Lessons learned in assessing student reflections with LLMs\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 672–686\.External Links:[Link](https://aclanthology.org/2025.bea-1.48/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.48),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- S\. Eltanbouly, S\. Albatarni, and T\. Elsayed \(2025\)TRATES: trait\-specific rubric\-assisted cross\-prompt essay scoring\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 20528–20543\.External Links:[Link](https://aclanthology.org/2025.findings-acl.1054/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.1054),ISBN 979\-8\-89176\-256\-5Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- T\. Enomoto, H\. Kim, T\. Hirasawa, Y\. Nagai, A\. Sato, K\. Nakajima, and M\. Komachi \(2024\)TMU\-HIT at MLSP 2024: how well can GPT\-4 tackle multilingual lexical simplification?\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 590–598\.External Links:[Link](https://aclanthology.org/2024.bea-1.52/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.4.3.2.1.1)\.
- A\. Foret, E\. Hupel, and P\. Morvan \(2024\)Enhancing a multi\-faceted verb\-centered resource to help a language learner: the case of breton\.InProceedings of the 13th Workshop on Natural Language Processing for Computer Assisted Language Learning,T\. Gaillat, C\. Mallart, F\. Moreau, J\. Li, G\. Drouet, D\. Alfter, E\. Volodina, and A\. Jönsson \(Eds\.\),Rennes, France,pp\. 59–66\.External Links:[Link](https://aclanthology.org/2024.nlp4call-1.5/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.7.6.2.1.1)\.
- T\. A\. N\. Frederick Eneye, C\. F\. Ijezue, A\. Imam Amjad, M\. Amjad, S\. Butt, and G\. Castañeda\-Garza \(2025\)Advances in auto\-grading with large language models: a cross\-disciplinary survey\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 477–498\.External Links:[Link](https://aclanthology.org/2025.bea-1.35/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.35),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- Y\. Fu and Z\. Weng \(2024\)Navigating the ethical terrain of AI in education: a systematic review on framing responsible human\-centered AI practices\.Computers and Education: Artificial Intelligence7,pp\. 100306\.External Links:[Link](https://doi.org/10.1016/j.caeai.2024.100306)Cited by:[§2](https://arxiv.org/html/2606.13691#S2.p4.1)\.
- M\. Galletti and V\. Cesaroni \(2025\)From end\-users to co\-designers: lessons from teachers\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 505–516\.External Links:[Link](https://aclanthology.org/2025.bea-1.37/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.37),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1),[item 1](https://arxiv.org/html/2606.13691#S7.I1.i1.p1.1)\.
- N\. Garrett, N\. Beard, and C\. Fiesler \(2020\)More than "if time allows": the role of ethics in AI Education\.InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society \(AIES\),pp\. 272–278\.External Links:[Link](https://doi.org/10.1145/3375627.3375868),[Document](https://dx.doi.org/10.1145/3375627.3375868)Cited by:[§5](https://arxiv.org/html/2606.13691#S5.SS0.SSS0.Px1.p1.1)\.
- T\. Geng and D\. Alfter \(2025\)Towards a real\-time Swedish speech analyzer for language learning games: a hybrid AI approach to language assessment\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 186–201\.External Links:[Link](https://aclanthology.org/2025.bea-1.14/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.14),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- D\. Glandorf and D\. Meurers \(2024\)Towards fine\-grained pedagogical control over English grammar complexity in educational text generation\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 299–308\.External Links:[Link](https://aclanthology.org/2024.bea-1.24/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.6.5.2.1.1)\.
- B\. Gomes \(2026\)Learners and educators are AI’s new “super users”\.Google Blog: The Keyword\.Note:[https://blog\.google/products\-and\-platforms/products/education/our\-life\-with\-ai\-2025/](https://blog.google/products-and-platforms/products/education/our-life-with-ai-2025/)\[Accessed 31\-03\-2026\]External Links:[Link](https://blog.google/products-and-platforms/products/education/our-life-with-ai-2025/)Cited by:[§1](https://arxiv.org/html/2606.13691#S1.p1.1)\.
- D\. Goswami, K\. North, and M\. Zampieri \(2024\)GMU at MLSP 2024: multilingual lexical simplification with transformer models\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 627–634\.External Links:[Link](https://aclanthology.org/2024.bea-1.57/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.4.3.2.1.1)\.
- T\. Goto, Y\. Sakai, and T\. Watanabe \(2025a\)Gec\-metrics: a unified library for grammatical error correction evaluation\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 3: System Demonstrations\),P\. Mishra, S\. Muresan, and T\. Yu \(Eds\.\),Vienna, Austria,pp\. 524–534\.External Links:[Link](https://aclanthology.org/2025.acl-demo.50/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-demo.50),ISBN 979\-8\-89176\-253\-4Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1),[item 3](https://arxiv.org/html/2606.13691#S7.I1.i3.p1.1)\.
- T\. Goto, Y\. Sakai, and T\. Watanabe \(2025b\)Reliability crisis of reference\-free metrics for grammatical error correction\.InFindings of the Association for Computational Linguistics: EMNLP 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 24913–24926\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.1356/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.1356),ISBN 979\-8\-89176\-335\-7Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1),[item 3](https://arxiv.org/html/2606.13691#S7.I1.i3.p1.1)\.
- T\. Goto, Y\. Sakai, and T\. Watanabe \(2025c\)Rethinking evaluation metrics for grammatical error correction: why use a different evaluation process than human?\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 1165–1172\.External Links:[Link](https://aclanthology.org/2025.acl-short.92/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-short.92),ISBN 979\-8\-89176\-252\-7Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- V\. Gupta, S\. Pal Chowdhury, V\. Zouhar, D\. Rooein, and M\. Sachan \(2025\)Are large language models for education reliable across languages?\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 612–631\.External Links:[Link](https://aclanthology.org/2025.bea-1.44/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.44),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.9.8.2.1.1),[item 2](https://arxiv.org/html/2606.13691#S7.I1.i2.p1.1)\.
- A\. Gurin Schleifer, B\. Beigman Klebanov, M\. Ariely, and G\. Alexandron \(2024\)Anna karenina strikes again: pre\-trained LLM embeddings may favor high\-performing learners\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 391–402\.External Links:[Link](https://aclanthology.org/2024.bea-1.32/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.8.7.2.1.1)\.
- B\. Hamner, J\. Morgan, lynnvandev, M\. Shermis, and T\. V\. Ark \(2012\)The hewlett foundation: automated essay scoring\.Note:[https://kaggle\.com/competitions/asap\-aes](https://kaggle.com/competitions/asap-aes)KaggleCited by:[§4](https://arxiv.org/html/2606.13691#S4.SS0.SSS0.Px1.p2.1),[§4](https://arxiv.org/html/2606.13691#S4.SS0.SSS0.Px3.p1.1)\.
- J\. Han and J\. D\. Choi \(2025\)Beyond linear digital reading: an LLM\-powered concept mapping approach for reducing cognitive load\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 805–817\.External Links:[Link](https://aclanthology.org/2025.bea-1.58/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.58),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.7.6.2.1.1)\.
- L\. Harding \(2025\)Utopian and dystopian visions: steering a course for the responsible use of artificial intelligence \(ai\) in language testing and assessment\.Language Testing42\(4\),pp\. 561–575\.External Links:[Document](https://dx.doi.org/10.1177/02655322251350717),[Link](https://doi.org/10.1177/02655322251350717),https://doi\.org/10\.1177/02655322251350717Cited by:[§7](https://arxiv.org/html/2606.13691#S7.SS0.SSS0.Px3.p1.1)\.
- A\. Hayat, B\. Khan, and M\. Hasan \(2024\)Improving transfer learning for early forecasting of academic performance by contextualizing language models\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 137–148\.External Links:[Link](https://aclanthology.org/2024.bea-1.13/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.8.7.2.1.1)\.
- J\. He and X\. Li \(2024\)Zero\-shot cross\-lingual automated essay scoring\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\),N\. Calzolari, M\. Kan, V\. Hoste, A\. Lenci, S\. Sakti, and N\. Xue \(Eds\.\),Torino, Italia,pp\. 17819–17832\.External Links:[Link](https://aclanthology.org/2024.lrec-main.1550/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- A\. Hidayat and P\. Firmanti \(2024\)Navigating the tech frontier: a systematic review of technology integration in mathematics education\.Cogent Education11\(1\),pp\. 2373559\.External Links:[Link](https://doi.org/10.1080/2331186X.2024.2373559)Cited by:[§2](https://arxiv.org/html/2606.13691#S2.p2.1)\.
- N\. Hjortnaes, D\. Dakota, S\. Kübler, and F\. Tyers \(2024\)Evaluating automatic pronunciation scoring with crowd\-sourced speech corpus annotations\.InProceedings of the 13th Workshop on Natural Language Processing for Computer Assisted Language Learning,T\. Gaillat, C\. Mallart, F\. Moreau, J\. Li, G\. Drouet, D\. Alfter, E\. Volodina, and A\. Jönsson \(Eds\.\),Rennes, France,pp\. 67–77\.External Links:[Link](https://aclanthology.org/2024.nlp4call-1.6/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- W\. Holmes, K\. Porayska\-Pomsta, K\. Holstein, E\. Sutherland, T\. Baker, S\. Shum, O\. C\. Santos, M\. Rodrigo, M\. Cukurova, I\. Bittencourt, and K\. Koedinger \(2022\)Ethics of AI in Education: towards a community\-wide framework\.International Journal of Artificial Intelligence in Education32,pp\. 504–526\.External Links:[Document](https://dx.doi.org/10.1007/s40593-021-00239-1)Cited by:[§2](https://arxiv.org/html/2606.13691#S2.p4.1)\.
- A\. Hülsing and A\. Horbach \(2024\)Opinions are buildings: metaphors in secondary education foreign language learning\.InProceedings of the 13th Workshop on Natural Language Processing for Computer Assisted Language Learning,T\. Gaillat, C\. Mallart, F\. Moreau, J\. Li, G\. Drouet, D\. Alfter, E\. Volodina, and A\. Jönsson \(Eds\.\),Rennes, France,pp\. 78–95\.External Links:[Link](https://aclanthology.org/2024.nlp4call-1.7/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.9.8.2.1.1)\.
- L\. Huovinen and M\. Hämäläinen \(2025\)LLM\-assisted, iterative curriculum writing: a human\-centered AI approach in Finnish higher education\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 1002–1010\.External Links:[Link](https://aclanthology.org/2025.bea-1.76/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.76),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.6.5.2.1.1),[item 1](https://arxiv.org/html/2606.13691#S7.I1.i1.p1.1)\.
- F\. Ikram, A\. Scarlatos, and A\. Lan \(2025\)Exploring LLMs for predicting tutor strategy and student outcomes in dialogues\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 765–779\.External Links:[Link](https://aclanthology.org/2025.bea-1.55/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.55),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.5.4.2.1.1)\.
- M\. Ilagan, B\. Beigman Klebanov, and J\. Mikeska \(2024\)Automated evaluation of teacher encouragement of student\-to\-student interactions in a simulated classroom discussion\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 182–198\.External Links:[Link](https://aclanthology.org/2024.bea-1.16/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.9.8.2.1.1)\.
- Z\. Jiang, T\. Zhang, P\. Peng, J\. Chen, Y\. Xun, H\. Zhang, L\. Li, Y\. Li, and S\. Zhang \(2025\)Towards generating controllable and solvable geometry problem by leveraging symbolic deduction engine\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 6: Industry Track\),G\. Rehm and Y\. Li \(Eds\.\),Vienna, Austria,pp\. 1378–1398\.External Links:[Link](https://aclanthology.org/2025.acl-industry.97/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-industry.97),ISBN 979\-8\-89176\-288\-6Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.6.5.2.1.1)\.
- A\. Karim, Q\. Wang, and Z\. Yuan \(2025\)Beyond the score: uncertainty\-calibrated LLMs for automated essay assessment\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 19631–19636\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.992/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.992),ISBN 979\-8\-89176\-332\-6Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- A\. Katinskaia, A\. Vu, J\. Hou, U\. Vanhatalo, Y\. Wu, and R\. Yangarber \(2025\)Estimation of text difficulty in the context of language learning\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 594–611\.External Links:[Link](https://aclanthology.org/2025.bea-1.43/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.43),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.4.3.2.1.1)\.
- A\. Katinskaia and R\. Yangarber \(2024\)GPT\-3\.5 for grammatical error correction\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\),N\. Calzolari, M\. Kan, V\. Hoste, A\. Lenci, S\. Sakti, and N\. Xue \(Eds\.\),Torino, Italia,pp\. 7831–7843\.External Links:[Link](https://aclanthology.org/2024.lrec-main.692/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- A\. Kelious, M\. Constant, and C\. Coeur \(2024a\)Complex word identification: a comparative study between ChatGPT and a dedicated model for this task\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\),N\. Calzolari, M\. Kan, V\. Hoste, A\. Lenci, S\. Sakti, and N\. Xue \(Eds\.\),Torino, Italia,pp\. 3645–3653\.External Links:[Link](https://aclanthology.org/2024.lrec-main.323/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.4.3.2.1.1)\.
- A\. Kelious, M\. Constant, and C\. Coeur \(2024b\)Investigating strategies for lexical complexity prediction in a multilingual setting using generative language models and supervised approaches\.InProceedings of the 13th Workshop on Natural Language Processing for Computer Assisted Language Learning,T\. Gaillat, C\. Mallart, F\. Moreau, J\. Li, G\. Drouet, D\. Alfter, E\. Volodina, and A\. Jönsson \(Eds\.\),Rennes, France,pp\. 96–114\.External Links:[Link](https://aclanthology.org/2024.nlp4call-1.8/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.4.3.2.1.1)\.
- D\. Kim, J\. Jo, B\. On, and I\. Lee \(2025a\)Representation\-to\-creativity \(R2C\): automated holistic scoring model for essay creativity\.InFindings of the Association for Computational Linguistics: NAACL 2025,L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 5272–5290\.External Links:[Link](https://aclanthology.org/2025.findings-naacl.292/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-naacl.292),ISBN 979\-8\-89176\-195\-7Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- E\. Kim, S\. Li, S\. Khalil, and H\. J\. Shin \(2025b\)STAIR\-AIG: optimizing the automated item generation process through human\-AI collaboration for critical thinking assessment\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 920–930\.External Links:[Link](https://aclanthology.org/2025.bea-1.69/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.69),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.6.5.2.1.1)\.
- M\. Kobayashi, M\. Mita, and M\. Komachi \(2024\)Large language models are state\-of\-the\-art evaluator for grammatical error correction\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 68–77\.External Links:[Link](https://aclanthology.org/2024.bea-1.6/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- E\. Kochmar, K\. Maurya, K\. Petukhova, K\. A\. Srivatsa, A\. Tack, and J\. Vasselli \(2025\)Findings of the BEA 2025 shared task on pedagogical ability assessment of AI\-powered tutors\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 1011–1033\.External Links:[Link](https://aclanthology.org/2025.bea-1.77/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.77),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.5.4.2.1.1),[§4](https://arxiv.org/html/2606.13691#S4.SS0.SSS0.Px2.p1.1)\.
- Z\. Kolagar, F\. Zalkow, and A\. Zarcone \(2025\)Investigating methods for mapping learning objectives to bloom’s revised taxonomy in course descriptions for higher education\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 415–445\.External Links:[Link](https://aclanthology.org/2025.bea-1.32/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.32),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.9.8.2.1.1)\.
- S\. Koo, J\. Kim, C\. Park, and H\. Lim \(2024\)Search if you don’t know\! knowledge\-augmented Korean grammatical error correction with large language models\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 96–125\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.6/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.6)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- G\. Kostov \(2026\)The role of the parents in modern education: educational policy and interaction with the family environment\.International Journal of Didactical Studies7,pp\.\.External Links:[Document](https://dx.doi.org/10.33902/ijods.202635822)Cited by:[§5](https://arxiv.org/html/2606.13691#S5.SS0.SSS0.Px2.p1.1)\.
- C\. Koutcheme, N\. Dainese, and A\. Hellas \(2024\)Using program repair as a proxy for language models’ feedback ability in programming education\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 165–181\.External Links:[Link](https://aclanthology.org/2024.bea-1.15/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- C\. Koutcheme, N\. Dainese, and A\. Hellas \(2025\)Direct repair optimization: training small language models for educational program repair improves feedback\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 564–581\.External Links:[Link](https://aclanthology.org/2025.bea-1.41/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.41),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- A\. Koyama, M\. Mita, S\. Yoon, Y\. Takama, and M\. Komachi \(2025\)Targeted syntactic evaluation for grammatical error correction\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 21108–21125\.External Links:[Link](https://aclanthology.org/2025.acl-long.1026/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.1026),ISBN 979\-8\-89176\-251\-0Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- K\. Krippendorff \(2011\)Computing krippendorff’s alpha\-reliability\.External Links:[Link](https://api.semanticscholar.org/CorpusID:59901023)Cited by:[§3](https://arxiv.org/html/2606.13691#S3.SS0.SSS0.Px3.p1.1)\.
- A\. Kucharavy, C\. Vallez, and D\. Percia David \(2025\)LLMs protégés: tutoring LLMs with knowledge gaps improves student learning outcome\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 248–257\.External Links:[Link](https://aclanthology.org/2025.bea-1.19/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.19),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- A\. Kucheria, N\. Sawhney, and A\. Hellas \(2025\)Comparing behavioral patterns of LLM and human tutors: a population\-level analysis with the CIMA dataset\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 873–881\.External Links:[Link](https://aclanthology.org/2025.bea-1.64/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.64),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.5.4.2.1.1)\.
- A\. Kwako and C\. Ormerod \(2024\)Can language models guess your identity? analyzing demographic biases in AI essay scoring\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 78–86\.External Links:[Link](https://aclanthology.org/2024.bea-1.7/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- M\. Lee, B\. Rudzewitz, and X\. Chen \(2024a\)Developing a pedagogically oriented interactive reading tool with teachers in the loops\.InProceedings of the 13th Workshop on Natural Language Processing for Computer Assisted Language Learning,T\. Gaillat, C\. Mallart, F\. Moreau, J\. Li, G\. Drouet, D\. Alfter, E\. Volodina, and A\. Jönsson \(Eds\.\),Rennes, France,pp\. 115–125\.External Links:[Link](https://aclanthology.org/2024.nlp4call-1.9/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.5.4.2.1.1)\.
- S\. Lee, Y\. Cai, D\. Meng, Z\. Wang, and Y\. Wu \(2024b\)Unleashing large language models’ proficiency in zero\-shot essay scoring\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 181–198\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.10/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.10)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- B\. Leite and H\. Lopes Cardoso \(2025\)Advancing question generation with joint narrative and difficulty control\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 647–659\.External Links:[Link](https://aclanthology.org/2025.bea-1.46/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.46),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.6.5.2.1.1)\.
- S\. Li and V\. Ng \(2024\)Automated essay scoring: a reflection on the state of the art\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 17876–17888\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.991/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.991)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1),[item 2](https://arxiv.org/html/2606.13691#S7.I1.i2.p1.1)\.
- S\. Li and V\. Ng \(2025\)Graph\-based multi\-trait essay scoring\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 33325–33351\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1691/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1691),ISBN 979\-8\-89176\-332\-6Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- T\. Li, Z\. Liu, L\. Matsumura, E\. Wang, D\. Litman, and R\. Correnti \(2024\)Using large language models to assess young students’ writing revisions\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 365–380\.External Links:[Link](https://aclanthology.org/2024.bea-1.30/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- W\. Li, W\. Luo, G\. Peng, and H\. Wang \(2025\)Explanation based in\-context demonstrations retrieval for multilingual grammatical error correction\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 4881–4897\.External Links:[Link](https://aclanthology.org/2025.naacl-long.251/),[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.251),ISBN 979\-8\-89176\-189\-6Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- H\. H\. Lim and J\. Lee \(2024\)Improving readability assessment with ordinal log\-loss\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 343–350\.External Links:[Link](https://aclanthology.org/2024.bea-1.28/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.4.3.2.1.1)\.
- Z\. Liu, S\. X\. Yin, D\. H\. Goh, and N\. Chen \(2025\)COGENT: a curriculum\-oriented framework for generating grade\-appropriate educational content\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 129–143\.External Links:[Link](https://aclanthology.org/2025.bea-1.10/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.10),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.6.5.2.1.1)\.
- S\. Löber, B\. Rudzewitz, D\. V\. Souto, L\. Ribeiro\-Flucht, and X\. Chen \(2024\)Developing a web\-based intelligent language assessment platform powered by natural language processing technologies\.InProceedings of the 13th Workshop on Natural Language Processing for Computer Assisted Language Learning,T\. Gaillat, C\. Mallart, F\. Moreau, J\. Li, G\. Drouet, D\. Alfter, E\. Volodina, and A\. Jönsson \(Eds\.\),Rennes, France,pp\. 126–136\.External Links:[Link](https://aclanthology.org/2024.nlp4call-1.10/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- A\. Luhtaru, E\. Korotkova, and M\. Fishel \(2024\)No error left behind: multilingual grammatical error correction with pre\-trained translation models\.InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),Y\. Graham and M\. Purver \(Eds\.\),St\. Julian’s, Malta,pp\. 1209–1222\.External Links:[Link](https://aclanthology.org/2024.eacl-long.73/),[Document](https://dx.doi.org/10.18653/v1/2024.eacl-long.73)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- W\. \(\. Ma, M\. Flor, and Z\. Wang \(2025\)Automatic generation of inference making questions for reading comprehension assessments\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 398–414\.External Links:[Link](https://aclanthology.org/2025.bea-1.31/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.31),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.6.5.2.1.1)\.
- Z\. Mao, A\. Bisliouk, R\. Nama, and I\. Ruchkin \(2025\)Temporalizing confidence: evaluation of chain\-of\-thought reasoning with signal temporal logic\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 882–890\.External Links:[Link](https://aclanthology.org/2025.bea-1.65/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.65),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.9.8.2.1.1)\.
- J\. Marciniak, M\. Kubis, M\. Gulczyński, A\. Szpilkowski, A\. Wieczarek, and M\. Szczepański \(2025\)Improving AI assistants embedded in short e\-learning courses with limited textual content\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 794–804\.External Links:[Link](https://aclanthology.org/2025.bea-1.57/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.57),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- D\. Martynova, J\. Macina, N\. Daheim, N\. Yalcin, X\. Zhang, and M\. Sachan \(2025\)Can LLMs effectively simulate human learners? teachers’ insights from tutoring LLM students\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 100–117\.External Links:[Link](https://aclanthology.org/2025.bea-1.8/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.8),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.8.7.2.1.1)\.
- A\. Masciolini, A\. Caines, O\. De Clercq, J\. Kruijsbergen, M\. Kurfalı, R\. Muñoz Sánchez, E\. Volodina, and R\. Östling \(2025\)The MultiGEC\-2025 shared task on multilingual grammatical error correction at NLP4CALL\.InProceedings of the 14th Workshop on Natural Language Processing for Computer Assisted Language Learning,R\. Muñoz Sánchez, D\. Alfter, E\. Volodina, and J\. Kallas \(Eds\.\),Tallinn, Estonia,pp\. 1–33\.External Links:[Link](https://aclanthology.org/2025.nlp4call-1.1/),ISBN 978\-9908\-53\-112\-0Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1),[§4](https://arxiv.org/html/2606.13691#S4.SS0.SSS0.Px2.p1.1)\.
- N\. Michael and A\. Horbach \(2025\)GermDetect: verb placement error detection datasets for learners of Germanic languages\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 818–829\.External Links:[Link](https://aclanthology.org/2025.bea-1.59/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.59),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- M\. Minsky \(1974\)A framework for representing knowledge\.MIT Artificial Intelligence Laboratory Memo306\.External Links:[Link](http://hdl.handle.net/1721.1/6089)Cited by:[§2](https://arxiv.org/html/2606.13691#S2.p1.1)\.
- A\. Mirabella and D\. Brunato \(2025\)Exploring LLM\-based assessment of Italian middle school writing: a pilot study\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 708–715\.External Links:[Link](https://aclanthology.org/2025.bea-1.51/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.51),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- M\. Mita, K\. Sakaguchi, M\. Hagiwara, T\. Mizumoto, J\. Suzuki, and K\. Inui \(2024\)Towards automated document revision: grammatical error correction, fluency edits, and beyond\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 251–265\.External Links:[Link](https://aclanthology.org/2024.bea-1.21/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.7.6.2.1.1)\.
- R\. Miyata, T\. Urakawa, H\. Tamori, and T\. Kajiwara \(2025\)Unsupervised sentence readability estimation based on parallel corpora for text simplification\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 499–504\.External Links:[Link](https://aclanthology.org/2025.bea-1.36/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.36),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.4.3.2.1.1)\.
- P\. Mulcaire and N\. Madnani \(2025\)Span labeling with large language models: shell vs\. meat\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 850–859\.External Links:[Link](https://aclanthology.org/2025.bea-1.62/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.62),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.7.6.2.1.1)\.
- R\. Muñoz Sánchez, D\. Alfter, S\. Dobnik, M\. I\. Szawerna, and E\. Volodina \(2024a\)Jingle BERT, frozen all the way: freezing layers to identify CEFR levels of second language learners using BERT\.InProceedings of the 13th Workshop on Natural Language Processing for Computer Assisted Language Learning,T\. Gaillat, C\. Mallart, F\. Moreau, J\. Li, G\. Drouet, D\. Alfter, E\. Volodina, and A\. Jönsson \(Eds\.\),Rennes, France,pp\. 137–152\.External Links:[Link](https://aclanthology.org/2024.nlp4call-1.11/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- R\. Muñoz Sánchez, S\. Dobnik, and E\. Volodina \(2024b\)Harnessing GPT to study second language learner essays: can we use perplexity to determine linguistic competence?\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 414–427\.External Links:[Link](https://aclanthology.org/2024.bea-1.34/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.9.8.2.1.1)\.
- K\. N J, K\. Bhatt, G\. Ramakrishnan, and P\. Jyothi \(2025\)LEVOS: leveraging vocabulary overlap with Sanskrit to generate technical lexicons in Indian languages\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 258–265\.External Links:[Link](https://aclanthology.org/2025.bea-1.20/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.20),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.9.8.2.1.1)\.
- K\. Nebhi, A\. Panesar, and H\. Bantilan \(2025\)End\-to\-end automated item generation and scoring for adaptive English writing assessment with large language models\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 968–977\.External Links:[Link](https://aclanthology.org/2025.bea-1.73/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.73),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- A\. Newell, J\. C\. Shaw, and H\. Simon \(1958\)Elements of a theory of human problem solving\.Psychological Review65,pp\. 151–166\.External Links:[Link](https://doi.org/10.1037/h0048495)Cited by:[§2](https://arxiv.org/html/2606.13691#S2.p1.1)\.
- H\. T\. Ng, S\. M\. Wu, T\. Briscoe, C\. Hadiwinoto, R\. H\. Susanto, and C\. Bryant \(2014\)The CoNLL\-2014 shared task on grammatical error correction\.InProceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task,H\. T\. Ng, S\. M\. Wu, T\. Briscoe, C\. Hadiwinoto, R\. H\. Susanto, and C\. Bryant \(Eds\.\),Baltimore, Maryland,pp\. 1–14\.External Links:[Link](https://aclanthology.org/W14-1701/),[Document](https://dx.doi.org/10.3115/v1/W14-1701)Cited by:[§4](https://arxiv.org/html/2606.13691#S4.SS0.SSS0.Px3.p1.1)\.
- I\. Nikolova\-Stoupak, S\. Bibauw, A\. Dumont, F\. Stas, P\. Watrin, and T\. François \(2024\)Generating contexts for ESP vocabulary exercises with LLMs\.InProceedings of the 13th Workshop on Natural Language Processing for Computer Assisted Language Learning,T\. Gaillat, C\. Mallart, F\. Moreau, J\. Li, G\. Drouet, D\. Alfter, E\. Volodina, and A\. Jönsson \(Eds\.\),Rennes, France,pp\. 153–175\.External Links:[Link](https://aclanthology.org/2024.nlp4call-1.12/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.6.5.2.1.1)\.
- K\. Omelianchuk, A\. Liubonko, O\. Skurzhanskyi, A\. Chernodub, O\. Korniienko, and I\. Samokhin \(2024\)Pillars of grammatical error correction: comprehensive inspection of contemporary approaches in the era of large language models\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 17–33\.External Links:[Link](https://aclanthology.org/2024.bea-1.3/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- OpenAI \(2026\)ChatGPT\.Large language model\.Note:[https://chat\.openai\.com](https://chat.openai.com/)Accessed: May 13, 2026Cited by:[§9](https://arxiv.org/html/2606.13691#S9.p1.1)\.
- R\. Östling, M\. Kurfali, and A\. Caines \(2025\)LLM\-based post\-editing as reference\-free GEC evaluation\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 213–224\.External Links:[Link](https://aclanthology.org/2025.bea-1.16/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.16),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- A\. Pack, A\. Barrett, and J\. Escalante \(2024\)Large language models and automated essay scoring of English language learner writing: insights into validity and reliability\.Computers and Education: Artificial Intelligence6,pp\. 100234\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.caeai.2024.100234)Cited by:[§2](https://arxiv.org/html/2606.13691#S2.p1.1)\.
- B\. Paddags, D\. Hershcovich, and V\. Savage \(2024\)Automated sentence generation for a spaced repetition software\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 351–364\.External Links:[Link](https://aclanthology.org/2024.bea-1.29/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.6.5.2.1.1)\.
- F\. Padovani, C\. Marchesi, E\. Pasqua, M\. Galletti, and D\. Nardi \(2024\)Automatic text simplification: a comparative study in Italian for children with language disorders\.InProceedings of the 13th Workshop on Natural Language Processing for Computer Assisted Language Learning,T\. Gaillat, C\. Mallart, F\. Moreau, J\. Li, G\. Drouet, D\. Alfter, E\. Volodina, and A\. Jönsson \(Eds\.\),Rennes, France,pp\. 176–186\.External Links:[Link](https://aclanthology.org/2024.nlp4call-1.13/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.4.3.2.1.1)\.
- S\. Pal Chowdhury, N\. Daheim, E\. Kochmar, J\. Macina, D\. Rooein, M\. Sachan, and S\. Sonkar \(2025a\)Large language models for education: understanding the needs of stakeholders, current capabilities and the path forward\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 1–10\.External Links:[Link](https://aclanthology.org/2025.bea-1.1/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.1),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.9.8.2.1.1)\.
- S\. Pal Chowdhury, T\. J\. Zhang, D\. Rooein, D\. Hovy, T\. Käser, and M\. Sachan \(2025b\)Educators’ perceptions of large language models as tutors: comparing human and AI tutors in a blind text\-only setting\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 356–374\.External Links:[Link](https://aclanthology.org/2025.bea-1.28/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.28),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.5.4.2.1.1)\.
- S\. Papert \(1980\)Mindstorms: children, computers, and powerful ideas\.Harvester Press\.Cited by:[§2](https://arxiv.org/html/2606.13691#S2.p1.1)\.
- N\. Parikh, A\. Scarlatos, N\. Fernandez, S\. Woodhead, and A\. Lan \(2025\)LookAlike: consistent distractor generation in math MCQs\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 294–311\.External Links:[Link](https://aclanthology.org/2025.bea-1.23/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.23),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.6.5.2.1.1)\.
- G\. Park, J\. Song, G\. Choi, J\. Sun, and H\. Kim \(2025\)K\-NLPers at BEA 2025 shared task: evaluating the quality of AI tutor responses with GPT\-4\.1\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 1145–1163\.External Links:[Link](https://aclanthology.org/2025.bea-1.90/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.90),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.5.4.2.1.1)\.
- J\. A\. Pérez\-Ortiz, M\. Esplà\-Gomis, V\. M\. Sánchez\-Cartagena, F\. Sánchez\-Martínez, R\. Chernysh, G\. Mora\-Rodríguez, and L\. Berezhnoy \(2024\)A conversational intelligent tutoring system for improving English proficiency of non\-native speakers via debriefing of online meeting transcriptions\.InProceedings of the 13th Workshop on Natural Language Processing for Computer Assisted Language Learning,T\. Gaillat, C\. Mallart, F\. Moreau, J\. Li, G\. Drouet, D\. Alfter, E\. Volodina, and A\. Jönsson \(Eds\.\),Rennes, France,pp\. 187–198\.External Links:[Link](https://aclanthology.org/2024.nlp4call-1.14/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.5.4.2.1.1)\.
- K\. Petukhova and E\. Kochmar \(2025\)Intent matters: enhancing AI tutoring with fine\-grained pedagogical intent annotation\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 860–872\.External Links:[Link](https://aclanthology.org/2025.bea-1.63/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.63),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- Y\. Poon, Q\. Wang, J\. S\. Y\. Lee, Y\. Y\. Lam, and S\. K\. W\. Chu \(2025\)PIRLS category\-specific question generation for reading comprehension\.InProceedings of the 14th Workshop on Natural Language Processing for Computer Assisted Language Learning,R\. Muñoz Sánchez, D\. Alfter, E\. Volodina, and J\. Kallas \(Eds\.\),Tallinn, Estonia,pp\. 72–80\.External Links:[Link](https://aclanthology.org/2025.nlp4call-1.6/),ISBN 978\-9908\-53\-112\-0Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.6.5.2.1.1)\.
- M\. Qiu, T\. M\. Nguyen, Z\. Huang, Z\. Li, Y\. Gu, Q\. Gao, S\. Liu, and J\. Park \(2025\)Multilingual grammatical error annotation: combining language\-agnostic framework with language\-specific flexibility\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 202–212\.External Links:[Link](https://aclanthology.org/2025.bea-1.15/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.15),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- M\. R\. Qorib, A\. F\. Aji, and H\. T\. Ng \(2024\)Efficient and interpretable grammatical error correction with mixture of experts\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 17127–17138\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.997/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.997)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- C\. Qwaider, B\. Alhafni, K\. Chirkunov, N\. Habash, and T\. Briscoe \(2025\)Enhancing Arabic automated essay scoring with synthetic data and error injection\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 549–563\.External Links:[Link](https://aclanthology.org/2025.bea-1.40/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.40),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- S\. Rezayi, L\. A\. Ha, Y\. Zhou, A\. Houriet, A\. D’Addario, P\. Baldwin, P\. Harik, A\. King, and V\. Yaneva \(2025\)Automated scoring of communication skills in physician\-patient interaction: balancing performance and scalability\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 891–897\.External Links:[Link](https://aclanthology.org/2025.bea-1.66/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.66),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- L\. Ribeiro\-Flucht, X\. Chen, and D\. Meurers \(2024\)Explainable AI in language learning: linking empirical evidence and theoretical concepts in proficiency and readability modeling of Portuguese\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 199–209\.External Links:[Link](https://aclanthology.org/2024.bea-1.17/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.4.3.2.1.1)\.
- L\. Ribeiro\-Flucht, X\. Chen, and D\. Meurers \(2025\)A framework for proficiency\-aligned grammar practice in LLM\-based dialogue systems\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 978–987\.External Links:[Link](https://aclanthology.org/2025.bea-1.74/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.74),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.5.4.2.1.1)\.
- D\. J\. Roaché \(2017\)Intercoder Reliability Techniques: Percent Agreement\.SAGE Publications, Inc,California\.External Links:ISBN 978\-1\-4833\-8143\-5 978\-1\-4833\-8141\-1,[Link](https://methods.sagepub.com/reference/the-sage-encyclopedia-of-communication-research-methods/i6953.xml),[Document](https://dx.doi.org/10.4135/9781483381411.n260)Cited by:[§3](https://arxiv.org/html/2606.13691#S3.SS0.SSS0.Px3.p1.1)\.
- D\. Rooein, P\. Röttger, A\. Shaitarova, and D\. Hovy \(2024\)Beyond flesch\-kincaid: prompt\-based metrics improve difficulty classification of educational texts\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 54–67\.External Links:[Link](https://aclanthology.org/2024.bea-1.5/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.4.3.2.1.1)\.
- A\. Rozovskaya \(2024\)Universal Dependencies for learner Russian\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\),N\. Calzolari, M\. Kan, V\. Hoste, A\. Lenci, S\. Sakti, and N\. Xue \(Eds\.\),Torino, Italia,pp\. 17112–17119\.External Links:[Link](https://aclanthology.org/2024.lrec-main.1486/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.7.6.2.1.1)\.
- A\. Sakunkoo and J\. Sakunkoo \(2025\)Name of thrones: how do LLMs rank student names in status hierarchies based on race and gender?\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 697–707\.External Links:[Link](https://aclanthology.org/2025.bea-1.50/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.50),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.9.8.2.1.1)\.
- I\. Sastre, L\. Alfonso, F\. Fleitas, F\. Gil, A\. Lucas, T\. Spoturno, S\. Góngora, A\. Rosá, and L\. Chiruzzo \(2024\)RETUYT\-INCO at MLSP 2024: experiments on language simplification using embeddings, classifiers and large language models\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 618–626\.External Links:[Link](https://aclanthology.org/2024.bea-1.56/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.4.3.2.1.1)\.
- A\. Säuberli, D\. Frassinelli, and B\. Plank \(2025\)Do LLMs give psychometrically plausible responses in educational assessments?\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 266–278\.External Links:[Link](https://aclanthology.org/2025.bea-1.21/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.21),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.6.5.2.1.1)\.
- N\. Scaria, S\. D\. Chenna, and D\. Subramani \(2024\)How good are Modern LLMs in generating relevant and high\-quality questions at different bloom’s skill levels for Indian high school social science curriculum?\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 1–10\.External Links:[Link](https://aclanthology.org/2024.bea-1.1/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.6.5.2.1.1)\.
- A\. Scarlatos, W\. Feng, D\. Smith, S\. Woodhead, and A\. Lan \(2024\)Improving automated distractor generation for math multiple\-choice questions with overgenerate\-and\-rank\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 222–231\.External Links:[Link](https://aclanthology.org/2024.bea-1.19/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.6.5.2.1.1)\.
- N\. Schaller, Y\. Ding, A\. Horbach, J\. Meyer, and T\. Jansen \(2024\)Fairness in automated essay scoring: a comparative analysis of algorithms on German learner essays from secondary education\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 210–221\.External Links:[Link](https://aclanthology.org/2024.bea-1.18/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- N\. Schaller, Y\. Ding, T\. Jansen, and A\. Horbach \(2025\)Don’t score too early\! evaluating argument mining models on incomplete essays\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 345–355\.External Links:[Link](https://aclanthology.org/2025.bea-1.27/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.27),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.7.6.2.1.1)\.
- V\. Schmalz and A\. Tack \(2025\)Can GPTZero’s AI vocabulary distinguish between LLM\-generated and student\-written essays?\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 937–952\.External Links:[Link](https://aclanthology.org/2025.bea-1.71/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.71),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.9.8.2.1.1)\.
- O\. Seminck, Y\. Dupont, M\. Dehouck, Q\. Wang, N\. Durandard, and M\. Novikov \(2025\)Lattice @MultiGEC\-2025: a spitful multilingual language error correction system using LLaMA\.InProceedings of the 14th Workshop on Natural Language Processing for Computer Assisted Language Learning,R\. Muñoz Sánchez, D\. Alfter, E\. Volodina, and J\. Kallas \(Eds\.\),Tallinn, Estonia,pp\. 34–41\.External Links:[Link](https://aclanthology.org/2025.nlp4call-1.2/),ISBN 978\-9908\-53\-112\-0Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- M\. Shardlow, F\. Alva\-Manchego, R\. Batista\-Navarro, S\. Bott, S\. Calderon Ramirez, R\. Cardon, T\. François, A\. Hayakawa, A\. Horbach, A\. Hülsing, Y\. Ide, J\. M\. Imperial, A\. Nohejl, K\. North, L\. Occhipinti, N\. P\. Rojas, N\. Raihan, T\. Ranasinghe, M\. S\. Salazar, S\. Štajner, M\. Zampieri, and H\. Saggion \(2024\)The BEA 2024 shared task on the multilingual lexical simplification pipeline\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 571–589\.External Links:[Link](https://aclanthology.org/2024.bea-1.51/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.4.3.2.1.1),[§4](https://arxiv.org/html/2606.13691#S4.SS0.SSS0.Px2.p1.1)\.
- M\. Sharma and J\. Zhang \(2025\)Decoding actionability: a computational analysis of teacher observation feedback\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 898–907\.External Links:[Link](https://aclanthology.org/2025.bea-1.67/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.67),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.7.6.2.1.1)\.
- K\. Shi and K\. Mangalam \(2025\)UPSC2M: benchmarking adaptive learning from two million MCQ attempts\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 931–936\.External Links:[Link](https://aclanthology.org/2025.bea-1.70/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.70),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.8.7.2.1.1)\.
- T\. Shibata and Y\. Miyamura \(2025\)LCES: zero\-shot automated essay scoring via pairwise comparisons using large language models\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 29988–30001\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1523/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1523),ISBN 979\-8\-89176\-332\-6Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- M\. Shimabukuro, D\. Panchal, and C\. Collins \(2025\)LangEye: toward ‘anytime’ learner\-driven vocabulary learning from real\-world objects\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 446–459\.External Links:[Link](https://aclanthology.org/2025.bea-1.33/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.33),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.9.8.2.1.1)\.
- A\. Singh, M\. Torrance, and E\. Chukharev \(2025\)EyeLLM: using lookback fixations to enhance human\-LLM alignment for text completion\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 841–849\.External Links:[Link](https://aclanthology.org/2025.bea-1.61/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.61),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.9.8.2.1.1)\.
- L\. Siyan, T\. Shao, J\. Hirschberg, and Z\. Yu \(2024\)Using adaptive empathetic responses for teaching English\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 34–53\.External Links:[Link](https://aclanthology.org/2024.bea-1.4/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.5.4.2.1.1)\.
- L\. Skidmore, M\. Felice, and K\. Dunn \(2025\)Transformer architectures for vocabulary test item difficulty prediction\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 160–174\.External Links:[Link](https://aclanthology.org/2025.bea-1.12/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.12),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.9.8.2.1.1)\.
- A\. Sorokin and R\. Nasyrova \(2025\)LLMs in alliance with edit\-based models: advancing in\-context learning for grammatical error correction by specific example selection\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 517–534\.External Links:[Link](https://aclanthology.org/2025.bea-1.38/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.38),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- K\. A\. Srivatsa, K\. Maurya, and E\. Kochmar \(2025a\)Can LLMs reliably simulate real students’ abilities in mathematics and reading comprehension?\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 988–1001\.External Links:[Link](https://aclanthology.org/2025.bea-1.75/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.75),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.8.7.2.1.1)\.
- K\. A\. Srivatsa, K\. K\. Maurya, and E\. Kochmar \(2025b\)LLMs cannot spot math errors, even when allowed to peek into the solution\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 10914–10928\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.553/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.553),ISBN 979\-8\-89176\-332\-6Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- M\. Stahl, L\. Biermann, A\. Nehring, and H\. Wachsmuth \(2024\)Exploring LLM prompting strategies for joint essay scoring and feedback generation\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 283–298\.External Links:[Link](https://aclanthology.org/2024.bea-1.23/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- F\. Stahlberg and S\. Kumar \(2024\)Synthetic data generation for low\-resource grammatical error correction with tagged corruption models\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 11–16\.External Links:[Link](https://aclanthology.org/2024.bea-1.2/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- R\. Staruch, F\. Gralinski, and D\. Dzienisiewicz \(2025\)Adapting LLMs for minimal\-edit grammatical error correction\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 118–128\.External Links:[Link](https://aclanthology.org/2025.bea-1.9/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.9),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- R\. Staruch \(2025\)UAM\-CSI at MultiGEC\-2025: parameter\-efficient LLM fine\-tuning for multilingual grammatical error correction\.InProceedings of the 14th Workshop on Natural Language Processing for Computer Assisted Language Learning,R\. Muñoz Sánchez, D\. Alfter, E\. Volodina, and J\. Kallas \(Eds\.\),Tallinn, Estonia,pp\. 42–49\.External Links:[Link](https://aclanthology.org/2025.nlp4call-1.3/),ISBN 978\-9908\-53\-112\-0Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- B\. Stearns, N\. Ballier, T\. Gaillat, A\. Simpkin, and J\. P\. McCrae \(2024\)Evaluating the generalisation of an artificial learner\.InProceedings of the 13th Workshop on Natural Language Processing for Computer Assisted Language Learning,T\. Gaillat, C\. Mallart, F\. Moreau, J\. Li, G\. Drouet, D\. Alfter, E\. Volodina, and A\. Jönsson \(Eds\.\),Rennes, France,pp\. 199–208\.External Links:[Link](https://aclanthology.org/2024.nlp4call-1.15/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.8.7.2.1.1)\.
- K\. Stowe, B\. Longwill, A\. Francis, T\. Aoyama, D\. Ghosh, and S\. Somasundaran \(2024\)Identifying fairness issues in automatically generated testing content\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 232–250\.External Links:[Link](https://aclanthology.org/2024.bea-1.20/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.6.5.2.1.1)\.
- D\. Strohmaier and P\. Buttery \(2024\)Semantic error prediction: estimating word production complexity\.InProceedings of the 13th Workshop on Natural Language Processing for Computer Assisted Language Learning,T\. Gaillat, C\. Mallart, F\. Moreau, J\. Li, G\. Drouet, D\. Alfter, E\. Volodina, and A\. Jönsson \(Eds\.\),Rennes, France,pp\. 209–225\.External Links:[Link](https://aclanthology.org/2024.nlp4call-1.16/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.4.3.2.1.1)\.
- J\. Su, Y\. Yan, F\. Fu, Z\. Han, J\. Ye, X\. Liu, J\. Huo, H\. Zhou, and X\. Hu \(2025\)EssayJudge: a multi\-granular benchmark for assessing automated essay scoring capabilities of multimodal large language models\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 6363–6389\.External Links:[Link](https://aclanthology.org/2025.findings-acl.329/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.329),ISBN 979\-8\-89176\-256\-5Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- H\. Sung, K\. Csuros, and M\. Sung \(2025\)Comparing human and LLM proofreading in L2 writing: impact on lexical and syntactic features\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 11–23\.External Links:[Link](https://aclanthology.org/2025.bea-1.2/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.2),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.7.6.2.1.1)\.
- H\. Suresh and J\. Guttag \(2021\)A framework for understanding sources of harm throughout the machine learning life cycle\.InEquity and Access in Algorithms, Mechanisms, and Optimization \(EAAMO\),External Links:[Link](https://doi.org/10.1145/3465416.3483305)Cited by:[§2](https://arxiv.org/html/2606.13691#S2.p2.1)\.
- A\. Tack, S\. Buseyne, C\. Chen, R\. D’hondt, M\. De Vrindt, A\. Gharahighehi, S\. Metwaly, F\. K\. Nakano, and A\. Noreillie \(2024\)ITEC at BEA 2024 shared task: predicting difficulty and response time of medical exam questions with statistical, machine learning, and language models\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 512–521\.External Links:[Link](https://aclanthology.org/2024.bea-1.43/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.4.3.2.1.1)\.
- A\. Tack \(2024\)ITEC at MLSP 2024: transferring predictions of lexical difficulty from non\-native readers\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 635–639\.External Links:[Link](https://aclanthology.org/2024.bea-1.58/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.4.3.2.1.1)\.
- C\. Tang, F\. Qu, and Y\. Wu \(2024\)Ungrammatical\-syntax\-based in\-context example selection for grammatical error correction\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\),Mexico City, Mexico,pp\. 1758–1770\.External Links:[Link](https://aclanthology.org/2024.naacl-long.99/),[Document](https://dx.doi.org/10.18653/v1/2024.naacl-long.99)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- R\. Tiwari and P\. Rastogi \(2025\)Phaedrus at BEA 2025 shared task: assessment of mathematical tutoring dialogues through tutor identity classification and actionability evaluation\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 1098–1107\.External Links:[Link](https://aclanthology.org/2025.bea-1.85/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.85),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- G\. Toussaint, Y\. Parmentier, and C\. Gardent \(2024\)GRAMEX: generating controlled grammar exercises from various sources\.InProceedings of the 13th Workshop on Natural Language Processing for Computer Assisted Language Learning,T\. Gaillat, C\. Mallart, F\. Moreau, J\. Li, G\. Drouet, D\. Alfter, E\. Volodina, and A\. Jönsson \(Eds\.\),Rennes, France,pp\. 226–234\.External Links:[Link](https://aclanthology.org/2024.nlp4call-1.17/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.6.5.2.1.1)\.
- N\. Tran, D\. Litman, B\. Pierce, R\. Correnti, and L\. C\. Matsumura \(2025\)Improving in\-context learning example retrieval for classroom discussion assessment with re\-ranking and label ratio regulation\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 752–764\.External Links:[Link](https://aclanthology.org/2025.bea-1.54/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.54),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- G\. Tyen, A\. Caines, and P\. Buttery \(2024\)LLM chatbots as a language practice tool: a user study\.InProceedings of the 13th Workshop on Natural Language Processing for Computer Assisted Language Learning,T\. Gaillat, C\. Mallart, F\. Moreau, J\. Li, G\. Drouet, D\. Alfter, E\. Volodina, and A\. Jönsson \(Eds\.\),Rennes, France,pp\. 235–247\.External Links:[Link](https://aclanthology.org/2024.nlp4call-1.18/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- T\. Überrück\-Fries, A\. Savary, and A\. Dryjańska \(2024\)Sailing through multiword expression identification with Wiktionary and linguse: a case study of language learning\.InProceedings of the 13th Workshop on Natural Language Processing for Computer Assisted Language Learning,T\. Gaillat, C\. Mallart, F\. Moreau, J\. Li, G\. Drouet, D\. Alfter, E\. Volodina, and A\. Jönsson \(Eds\.\),Rennes, France,pp\. 248–262\.External Links:[Link](https://aclanthology.org/2024.nlp4call-1.19/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.4.3.2.1.1)\.
- A\. Y\. Uluslu and G\. Schneider \(2025\)Investigating linguistic abilities of LLMs for native language identification\.InProceedings of the 14th Workshop on Natural Language Processing for Computer Assisted Language Learning,R\. Muñoz Sánchez, D\. Alfter, E\. Volodina, and J\. Kallas \(Eds\.\),Tallinn, Estonia,pp\. 81–88\.External Links:[Link](https://aclanthology.org/2025.nlp4call-1.7/),ISBN 978\-9908\-53\-112\-0Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.9.8.2.1.1)\.
- UNESCO \(2026\)Global report on teachers: addressing teacher shortages and transforming the profession\.External Links:[Link](https://www.unesco.org/en/articles/global-report-teachers-addressing-teacher-shortages-and-transforming-profession)Cited by:[§1](https://arxiv.org/html/2606.13691#S1.p1.1)\.
- F\. Urrutia, C\. Buc, R\. Araya, and V\. Barriere \(2025\)Unsupervised automatic short answer grading and essay scoring: a weakly supervised explainable approach\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 38–54\.External Links:[Link](https://aclanthology.org/2025.bea-1.4/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.4),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- M\. Vainikko, T\. Kamarik, K\. Kert, K\. Liin, S\. Maine, K\. Allkivi, A\. Kaivapalu, and M\. Fishel \(2025\)Paragraph\-level error correction and explanation generation: case study for Estonian\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 953–967\.External Links:[Link](https://aclanthology.org/2025.bea-1.72/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.72),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- H\. Veeramani, S\. Thapa, N\. B\. Shankar, and A\. Alwan \(2024\)Large language model\-based pipeline for item difficulty and response time estimation for educational assessments\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 561–566\.External Links:[Link](https://aclanthology.org/2024.bea-1.49/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.4.3.2.1.1)\.
- G\. Velentzas, A\. Caines, R\. Borgo, E\. Pacquetet, C\. Hamilton, T\. Arnold, D\. Nicholls, P\. Buttery, T\. Gaillat, N\. Ballier, and H\. Yannakoudakis \(2024\)Logging keystrokes in writing by English learners\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\),N\. Calzolari, M\. Kan, V\. Hoste, A\. Lenci, S\. Sakti, and N\. Xue \(Eds\.\),Torino, Italia,pp\. 10725–10746\.External Links:[Link](https://aclanthology.org/2024.lrec-main.938/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.7.6.2.1.1)\.
- A\. Vu, J\. Hou, A\. Katinskaia, C\. Sheu, and R\. Yangarber \(2025\)A Bayesian approach to inferring prerequisite structures and topic difficulty in language learning\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 737–751\.External Links:[Link](https://aclanthology.org/2025.bea-1.53/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.53),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.4.3.2.1.1)\.
- D\. Wang, C\. Yang, and G\. Chen \(2025a\)Wonderland\_EDU@HKU at BEA 2025 shared task: fine\-tuning large language models to evaluate the pedagogical ability of AI\-powered tutors\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 1040–1048\.External Links:[Link](https://aclanthology.org/2025.bea-1.79/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.79),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.5.4.2.1.1)\.
- J\. Wang, Y\. Dai, Y\. Zhang, Z\. Ma, W\. Li, and J\. Chai \(2025b\)Training turn\-by\-turn verifiers for dialogue tutoring agents: the curious case of LLMs as your coding tutors\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 12416–12436\.External Links:[Link](https://aclanthology.org/2025.findings-acl.642/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.642),ISBN 979\-8\-89176\-256\-5Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.5.4.2.1.1)\.
- J\. Wang, A\. Rutkiewicz, A\. Wang, and M\. Sachan \(2025c\)Generating pedagogically meaningful visuals for math word problems: a new benchmark and analysis of text\-to\-image models\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 11229–11257\.External Links:[Link](https://aclanthology.org/2025.findings-acl.586/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.586),ISBN 979\-8\-89176\-256\-5Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.6.5.2.1.1),[item 1](https://arxiv.org/html/2606.13691#S7.I1.i1.p1.1)\.
- X\. Wang, D\. Yuan, X\. Liu, Y\. Zhao, X\. Zhang, X\. Chen, and Y\. Lan \(2025d\)VisCGEC: benchmarking the visual Chinese grammatical error correction\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 5054–5068\.External Links:[Link](https://aclanthology.org/2025.naacl-long.261/),[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.261),ISBN 979\-8\-89176\-189\-6Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- X\. Wang, L\. Mu, J\. Zhang, and H\. Xu \(2024a\)Multi\-pass decoding for grammatical error correction\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 9904–9916\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.553/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.553)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.4.3.2.1.1)\.
- Y\. Wang, B\. Wang, Y\. Liu, D\. Wu, and W\. Che \(2024b\)LM\-combiner: a contextual rewriting model for Chinese grammatical error correction\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\),N\. Calzolari, M\. Kan, V\. Hoste, A\. Lenci, S\. Sakti, and N\. Xue \(Eds\.\),Torino, Italia,pp\. 10675–10685\.External Links:[Link](https://aclanthology.org/2024.lrec-main.934/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- Y\. Wang, R\. Hu, and Z\. Zhao \(2024c\)Beyond agreement: diagnosing the rationale alignment of automated essay scoring methods based on linguistically\-informed counterfactuals\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 8906–8925\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.520/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.520)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- World Inequality Lab \(2026\)World inequality report 2026\.Note:\[Accessed 31\-03\-2026\][https://wir2026\.wid\.world/www\-site/uploads/2026/01/World\_Inequality\_Report\_2026\.pdf](https://wir2026.wid.world/www-site/uploads/2026/01/World_Inequality_Report_2026.pdf)External Links:[Link](https://wir2026.wid.world/www-site/uploads/2026/01/World_Inequality_Report_2026.pdf)Cited by:[§1](https://arxiv.org/html/2606.13691#S1.p1.1)\.
- T\. Wu, H\. Chen, L\. Qin, Z\. Cao, and C\. Ai \(2024\)Improving copy\-oriented text generation via EDU copy mechanism\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\),N\. Calzolari, M\. Kan, V\. Hoste, A\. Lenci, S\. Sakti, and N\. Xue \(Eds\.\),Torino, Italia,pp\. 8768–8780\.External Links:[Link](https://aclanthology.org/2024.lrec-main.768/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- Y\. Yan, H\. Liu, and T\. Chau \(2025\)A systematic review of AI ethics in education: challenges, policy gaps, and future directions\.Journal of Global Information Management33\(1\),pp\. 1–50\.External Links:[Document](https://dx.doi.org/10.4018/JGIM.386381)Cited by:[§2](https://arxiv.org/html/2606.13691#S2.p3.1),[§2](https://arxiv.org/html/2606.13691#S2.p4.1)\.
- K\. P\. Yancey, A\. Runge, G\. LaFlair, and P\. Mulcaire \(2024\)BERT\-IRT: accelerating item piloting with BERT embeddings and explainable IRT models\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 428–438\.External Links:[Link](https://aclanthology.org/2024.bea-1.35/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- V\. Yaneva, K\. North, P\. Baldwin, L\. A\. Ha, S\. Rezayi, Y\. Zhou, S\. Ray Choudhury, P\. Harik, and B\. Clauser \(2024a\)Findings from the first shared task on automated prediction of difficulty and response time for multiple\-choice questions\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 470–482\.External Links:[Link](https://aclanthology.org/2024.bea-1.39/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.4.3.2.1.1),[§4](https://arxiv.org/html/2606.13691#S4.SS0.SSS0.Px2.p1.1)\.
- V\. Yaneva, K\. Y\. Suen, L\. A\. Ha, J\. Mee, M\. Quranda, and P\. Harik \(2024b\)Automated scoring of clinical patient notes: findings from the Kaggle competition and their translation into practice\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 87–98\.External Links:[Link](https://aclanthology.org/2024.bea-1.8/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- H\. Yang and X\. Quan \(2024\)Alirector: alignment\-enhanced Chinese grammatical error corrector\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 2531–2546\.External Links:[Link](https://aclanthology.org/2024.findings-acl.148/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.148)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- H\. Yang, Z\. Liu, and S\. Wulff \(2025\)Using NLI to identify potential collocation transfer in L2 English\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 687–696\.External Links:[Link](https://aclanthology.org/2025.bea-1.49/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.49),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.7.6.2.1.1)\.
- S\. Yarmohammadtoosky, Y\. Zhou, V\. Yaneva, P\. Baldwin, S\. Rezayi, B\. Clauser, and P\. Harik \(2025\)Enhancing security and strengthening defenses in automated short\-answer grading systems\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 830–840\.External Links:[Link](https://aclanthology.org/2025.bea-1.60/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.60),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- M\. Yasser, M\. Saeed, H\. Elkordi, and A\. Khalafallah \(2025\)Averroes at BEA 2025 shared task: verifying mistake identification in tutor, student dialogue\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 1121–1126\.External Links:[Link](https://aclanthology.org/2025.bea-1.87/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.87),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.5.4.2.1.1)\.
- J\. Ye, Z\. Xu, Y\. Li, L\. Song, Q\. Zhou, H\. Zheng, Y\. Shen, W\. Jiang, H\. Kim, R\. Liu, X\. Su, and Z\. Shan \(2025\)CLEME2\.0: towards interpretable evaluation by disentangling edits for grammatical error correction\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 204–222\.External Links:[Link](https://aclanthology.org/2025.acl-long.10/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.10),ISBN 979\-8\-89176\-251\-0Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.3.2.2.1.1)\.
- H\. Yoo, J\. Han, S\. Ahn, and A\. Oh \(2025\)DREsS: dataset for rubric\-based essay scoring on EFL writing\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 13439–13454\.External Links:[Link](https://aclanthology.org/2025.acl-long.659/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.659),ISBN 979\-8\-89176\-251\-0Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- M\. Yousefpoori\-Naeim, S\. Zargari, and Z\. Hatami \(2024\)Using machine learning to predict item difficulty and response time in medical tests\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 551–560\.External Links:[Link](https://aclanthology.org/2024.bea-1.48/)Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.4.3.2.1.1)\.
- F\. Zehner, H\. J\. Shin, E\. Kerzabi, A\. Horbach, S\. Gombert, F\. Goldhammer, T\. Zesch, and N\. Andersen \(2025\)Down the cascades of omethi: hierarchical automatic scoring in large\-scale assessments\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 660–671\.External Links:[Link](https://aclanthology.org/2025.bea-1.47/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.47),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- T\. Zesch, D\. Gardner, and M\. Bexte \(2025\)Transformer\-based real\-word spelling error feedback with configurable confusion sets\.InProceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2025\),E\. Kochmar, B\. Alhafni, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Vienna, Austria,pp\. 375–383\.External Links:[Link](https://aclanthology.org/2025.bea-1.29/),[Document](https://dx.doi.org/10.18653/v1/2025.bea-1.29),ISBN 979\-8\-89176\-270\-1Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.2.1.2.1.1)\.
- Y\. Zhao, S\. Guo, Z\. Yang, S\. Han, D\. Lin, and F\. Tan \(2025\)More data or better data? a critical analysis of data selection and synthesis for mathematical reasoning\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track,S\. Potdar, L\. Rojas\-Barahona, and S\. Montella \(Eds\.\),Suzhou \(China\),pp\. 618–629\.External Links:[Link](https://aclanthology.org/2025.emnlp-industry.43/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-industry.43),ISBN 979\-8\-89176\-333\-3Cited by:[Table 8](https://arxiv.org/html/2606.13691#A5.T8.1.7.6.2.1.1)\.

## Appendix AA Structured Taxonomy for Ethical and Stakeholder Review of EduNLP Research

Figure[10](https://arxiv.org/html/2606.13691#A1.F10)121212This figure was created in[app\.xmind\.com](https://arxiv.org/html/2606.13691v1/app.xmind.com)\.presents the complete taxonomy developed in this review, offered here as a standalone contribution\. The taxonomy is organised around the three research questions and a concluding recommendations dimension\. Beyond its role in this review, the taxonomy can be reused for future surveys of EdTech research\. It can also serve as a practical self\-audit tool: researchers can use it to situate themselves within the EduNLP space, and assess the rigour and inclusivity of their work before submission\.

![Refer to caption](https://arxiv.org/html/2606.13691v1/images/taxonomy_matrix.png)Figure 10:Detailed taxonomy for reviewing EduNLP research\.
## Appendix BACL Anthology Main Conference Search

We retrieve all papers with at least one of the following search terms in the title or abstract\. The venues included are: ACL, EACL, NAACL, EMNLP, LREC\-COLING, and Findings, for the years 2024 and 2025\. The search terms were developed through internal discussion and discussion with other researchers in the EduNLP field\. The search terms are as follows:

- •“automated essay scoring”,
- •“automated writing evaluation”,
- •“short answer grading”,
- •“automatic short answer grading”,
- •“open\-ended response assessment”,
- •“automated assessment of spoken responses”,
- •“spoken response scoring”,
- •“speech\-based assessment”,
- •“automatic speech scoring”,
- •“dialogue\-based tutoring”,
- •“spoken dialogue system education”,
- •“intelligent tutoring systems NLP”,
- •“student modeling”,
- •“learner modeling”,
- •“knowledge tracing”,
- •“learner cognition modeling”,
- •“educational data mining NLP”,
- •“learning analytics text”,
- •“game\-based learning assessment”,
- •“stealth assessment”,
- •“peer assessment NLP”,
- •“peer review automated feedback”,
- •“automated feedback generation”,
- •“formative feedback writing”,
- •“grammatical error correction”,
- •“grammar error detection”,
- •“lexical complexity prediction”,
- •“text simplification for learners”,
- •“multimodal learning analytics”,
- •“generative AI in education”,
- •“mathematic education”,
- •“math education”,
- •“math word problems”,
- •“mathematical reasoning”,
- •“student error in mathematics”,
- •“intelligent tutoring system math”,
- •“knowledge tracing mathematics”,
- •“misconception detection mathematics”\.

## Appendix CExtraction Schema

For extracting the entities relevant to our research questions, we used the following schema:

- •\[RQ2\]Author affiliations
- •\[RQ1\]Specific task worked on
- •\[RQ1\]Datasets used and availability
- •\[RQ1\]Explicit motivation for the paper and associated quotes
- •\[RQ2\]Stakeholders mentioned \(multi\-label; the options beingLearner/student,Teacher,School/university,Paper author,Other researcher,Domain expert,Parent,Governmental body,Inudustry,Non\-profitwhich includes large\-standardised testing providers, andNone / N\.A\.\), and associated quotes
- •\[RQ2\]Stakeholders included in the research \(multi\-label; same list as above\), as well as their respective level of inclusion \(multi\-label withHigh,MiddlingandLow\) and associated quotes for how they are included
- •\[RQ1\]Context in which the system deployed \(if any\) and relevant quotes
- •\[RQ2\]Explicit stakeholder incentives
- •\[RQ2\]Implicit stakeholder incentives
- •\[RQ3\]Risk, concerns and limitations raised, associated quotes, level of engagement \(multi\-label withHigh,MiddlingandLow\) and measures taken to address risk
- •\[RQ3\]Future directions/aspirations mentioned and relevant quotes
- •\[RQ1\]Entities acknowledged \(including funding\)

Once phase \(3\) was completed, the three annotators independently extracted high\-level categories and labels from the data:

- •\[RQ1\]Mapping specific tasks to high\-level tasks \(as reported in Figure[2](https://arxiv.org/html/2606.13691#S3.F2)\)
- •\[RQ2\]Mapping free\-text dataset names to unique labels \(as reported in Figure[3](https://arxiv.org/html/2606.13691#S4.F3)\) and their availability \(as reported in Figure[11](https://arxiv.org/html/2606.13691#A6.F11)\)
- •\[RQ1\]Categorising free\-text motivations to high\-level labels \(as reported in Figure[12](https://arxiv.org/html/2606.13691#A6.F12)\) and the mentioned stakeholders \(as reported in Figure[4](https://arxiv.org/html/2606.13691#S4.F4)\)
- •\[RQ1\]Mapping context deployment to high\-level labels \(as reported in Figure[5](https://arxiv.org/html/2606.13691#S4.F5)\)
- •\[RQ2\]Extracting standardised stakeholder types from explicit and implicit incentives \(as reported in Figure[9](https://arxiv.org/html/2606.13691#S5.F9)\)
- •\[RQ3\]Mapping risks, concerns and limitations to high\-level categories \(as reported in Figure[18](https://arxiv.org/html/2606.13691#A11.F18)\) and the authors’ level of engagement associated to each \(as reported in Figure[19](https://arxiv.org/html/2606.13691#A12.F19)\)
- •\[RQ3\]Mapping future directions and aspirations high\-level categories \(as reported in Figure[20](https://arxiv.org/html/2606.13691#A12.F20)\)

For each dimension, the mapping was made by one of the three annotators independently\. This was considered sufficient given that the high\-level categories were derived directly from the extracted free\-text data rather than applied to raw papers: the iterative reconciliation process in phases \(1\) and \(2\) had already established shared interpretive norms among annotators, and the categorisation task at this stage involved consolidating labels that were already grounded in agreed extractions rather than making independent judgements about unseen material\.

## Appendix DAgreement Computation

Table 1:Agreement for free\-text annotation dimensions\. For dimensions with an asterisk, we computed per\-paper percentage agreement\. For the other dimensions, we computed majority percentages \(i\.e\. 0 if none of the annotators agree, 0\.67 if 2/3 annotators agree, and 1 if 3/3 annotators agree\)\. The values reported correspond to the averages across all 25 papers in the shared batch\. Computation examples are included in Table[6](https://arxiv.org/html/2606.13691#A4.T6)and Table[7](https://arxiv.org/html/2606.13691#A4.T7)\.Table 2:Average percentage agreement \(PA\) and inter\-annotator agreement \(Krippendorff’sα\\alpha\) for “stakeholders mentioned”\.Table 3:Average percentage agreement \(PA\) and inter\-annotator agreement \(Krippendorff’sα\\alpha\) for “stakeholders included”\.Table 4:Average percentage agreement \(PA\) and inter\-annotator agreement \(Krippendorff’s alpha\) for “level of inclusion stakeholders included”\.Table 5:Average percentage agreement \(PA\) and inter\-annotator agreement \(Krippendorff’s alpha\) for “level of engagement risks/concerns”\.Table[1](https://arxiv.org/html/2606.13691#A4.T1)reports agreement for the free\-text dimensions\. Table[6](https://arxiv.org/html/2606.13691#A4.T6)includes examples of how percentage agreement \(PA\) was calculated for the following free\-text dimensions: “datasets used”, “risks/concerns”, “measures taken to address risks/concerns”, and “future directions”\. As these dimensions often included many different elements \(e\.g\., dataset names for the “datasets used” dimension and specific suggestions for future research for “future directions”\), PA was calculated in the form of pairwise agreements \(at the level of the paper, with the score reported in Table[1](https://arxiv.org/html/2606.13691#A4.T1)corresponding to the mean of these per\-paper pairwise agreement values\)\.

Table[7](https://arxiv.org/html/2606.13691#A4.T7)contains examples of how PA was calculated for the following free\-text dimensions: “task”, “dataset availability”, “methods”, “evaluation”, “motivation”, “deployment”, “explicit incentives”, “implicit incentives”, and “future deployment”\. As these dimensions virtually always included only a very limited number of core elements, PA was calculated in the form of majority percentages \(at the level of the paper, with the score reported in Table[1](https://arxiv.org/html/2606.13691#A4.T1)corresponding to the mean of these per\-paper majority percentage values\)\.

Tables[2](https://arxiv.org/html/2606.13691#A4.T2)to[5](https://arxiv.org/html/2606.13691#A4.T5)further present the agreement for the multilabel dimensions \(stakeholders mentioned, stakeholders included, level of inclusion stakeholders included, and level of engagement risks/concerns\)\.

Table 6:Examples of percentage agreement \(PA\) computation for free\-text annotation dimensions\. “PWA” stands for pairwise agreement among the annotators \(i\.e\. Annotator 1 compared to 2, Annotator 1 compared to 3, and Annotator 2 compared to 3\)\. Note thatabsenceof annotation also counts as agreement \(e\.g\., between Annotator 1 and 2 for the “English\_BEA” dataset\)\.Table 7:Examples of majority agreement computation for free\-text annotation dimensions\. Possible values: 0% \(none of the annotations fully overlap\), 66\.67% \(full overlap for two of the three annotations\), or 100% \(full overlap for all three annotations\)\.
## Appendix EMapping Papers to High\-Level Tasks

The mapping of the analysed papers to the eight high\-level tasks is presented in Table[8](https://arxiv.org/html/2606.13691#A5.T8)\.

Table 8:Table showing the mapping from high\-level tasks to individual papers and their specific tasks\.
## Appendix FAvailability of Datasets

The availability of the datasets used in the analysed papers is presented in Figure[11](https://arxiv.org/html/2606.13691#A6.F11)\.

![Refer to caption](https://arxiv.org/html/2606.13691v1/images/dataset_availability_v2_bar.png)Figure 11:Availability of datasets used in the surveyed papers\.![Refer to caption](https://arxiv.org/html/2606.13691v1/images/motivations_why_v2.png)Figure 12:Distribution of papers’ explicit motivations for the task; papers may belong to more than one category\.
## Appendix GExplicit Motivations

The distribution of the papers’ explicit motivations for the research conducted is visualised in Figure[12](https://arxiv.org/html/2606.13691#A6.F12)\.

## Appendix HStakeholders Included and Mentioned

As the authors are the primary stakeholders of the research, we present demographic information in terms of country of author affiliation in Figure[15](https://arxiv.org/html/2606.13691#A9.F15)\. Secondly, Figure[13](https://arxiv.org/html/2606.13691#A8.F13)visualises the level of inclusion of the stakeholders included in the research\.

![Refer to caption](https://arxiv.org/html/2606.13691v1/images/inclusion_level_overall_v2_bar.png)Figure 13:Overall level of inclusion of included stakeholders; we distinguish 3 levels:High\(integral to research design & completion\),Middling\(involved in data evaluation or annotation, but have no input on research design\), andLow\(test subjects in data collection only\)\.
## Appendix IAcknowledged Entities

Figure[14](https://arxiv.org/html/2606.13691#A9.F14)depicts the “acknowledged countries” \(i\.e\., the the number of papers per country of affiliation of entities acknowledged\), while Figure[16](https://arxiv.org/html/2606.13691#A9.F16)provides more details on the number of times entities were acknowledged\.

![Refer to caption](https://arxiv.org/html/2606.13691v1/images/acknowledgements_countries.png)Figure 14:Acknowledged countries \(i\.e\., the number of papers per country of affiliation of entities acknowledged in the surveyed papers\)\.![Refer to caption](https://arxiv.org/html/2606.13691v1/images/affiliations_countries.png)Figure 15:Author countries \(i\.e\., the number of papers per country of author affiliation\)\.![Refer to caption](https://arxiv.org/html/2606.13691v1/images/acknowledgements_specifics_v2.png)Figure 16:Acknowledged entities \(i\.e\., the number of times an entity was acknowledged in the surveyed papers\)\.
## Appendix JRelation between Tasks and Implicitly Benefitting Stakeholders

Figure[17](https://arxiv.org/html/2606.13691#A10.F17)presents a heat\-map that links the high\-level tasks to the combined explicitly and implicitly benefitting stakeholders\.

![Refer to caption](https://arxiv.org/html/2606.13691v1/images/incentives_who_task_correaltion.png)Figure 17:Heat\-map relating the high\-level tasks to the combined explicitly and implicitly benefitting stakeholders\.
## Appendix KRisks, Concerns and Limitations Breakdown

Figure[19](https://arxiv.org/html/2606.13691#A12.F19)distinguishes three levels of engagement with stated risks: High \(directly mitigated or discussed in substantial depth\), Middling \(discussed as part of future work\), and Low \(briefly mentioned only\)\. Across most risk categories, the majority of engagement is at a Low or Middling level\. Methodology limitations show 90% Middling engagement – they are widely acknowledged but rarely addressed in the current work\. Dataset limitations are 56% Middling\. Risk of bias, one of the most commonly cited concerns, is engaged at a High level in only 17% of papers that raise it; in 45% of cases it is Middling, and in 38% it is Low\.

High engagement is most consistently found in a small set of categories: fair compensation for stakeholders \(100% High, though only 6 papers raise this at all\), and to a lesser extent data protection \(65% High\)\. Risk of hallucination, lack of human evaluation, and the gap between research and real\-world application are predominantly engaged at Low or Middling levels – noted as future work, but rarely designed around\. This pattern suggests a community that is aware of the ethical dimensions of its work but has not yet developed consistent norms for acting on them within the scope of individual papers\.

![Refer to caption](https://arxiv.org/html/2606.13691v1/images/risks_raised.png)Figure 18:Risks, concerns and limitations explicitly raised by paper authors split across six high\-level categories \(showing the number of papers; note that a paper may report more than one area of risk\)\.
## Appendix LAreas of Future Work

Figure[20](https://arxiv.org/html/2606.13691#A12.F20)shows the areas of future work explicitly mentioned in the papers, split across five high\-level categories \(“stakeholder inclusion”, “technical development”, “expand scope”, “engage with issues”, and “none/not applicable\)\.

![Refer to caption](https://arxiv.org/html/2606.13691v1/images/risks_engagement.png)Figure 19:Engagement levels for the risks, concerns and limitations explicitly raised by paper authors; we distinguish 3 levels of risk engagement:High\(a risk, concern or limitation that is directly mitigated in the paper or discussed in great depth\),Middling\(discussed as part of future work\), andLow\(briefly mentioned only\)\.![Refer to caption](https://arxiv.org/html/2606.13691v1/images/future_work.png)Figure 20:Areas of future work explicitly mentioned in the papers split across four high\-level categories \(showing the number of papers; note that a paper may report more than one area of future work\)\.

Similar Articles

Introducing Edu for Countries

OpenAI Blog

OpenAI launches Education for Countries, a new initiative to embed AI tools and training into education systems across governments and universities to personalize learning and prepare students for AI-driven workforce changes. The program includes access to ChatGPT Edu and GPT-5.2, learning outcomes research, certifications, and partnerships with eight countries in the first cohort.

What Drives Interactive Improvement from Feedback?

arXiv cs.AI

This paper investigates whether natural-language feedback leads to improvement beyond repeated attempts alone in multi-turn language agent settings. Using a controlled student-teacher protocol across multiple benchmarks, the authors find that self-generated feedback adds little, while strong external teachers yield larger gains, and that the student's ability to act on feedback is a key bottleneck.