Encode Errors: Representational Retrieval of In-Context Demonstrations for Multilingual Grammatical Error Correction

arXiv cs.CL Papers

Summary

This paper introduces Grammatical Error Representation (GER), a novel method for retrieving in-context demonstrations based on error patterns rather than semantic similarity, significantly improving multilingual grammatical error correction performance in LLMs with in-context learning.

arXiv:2606.15416v1 Announce Type: new Abstract: Grammatical Error Correction (GEC) involves detecting and correcting the wrong usage of grammar. While large language models (LLMs) with in-context learning (ICL) capabilities have shown significant progress on various natural language processing (NLP) tasks, their few-shot performance on GEC remains suboptimal. This is mainly due to the challenge of retrieving suitable in-context demonstrations that capture error patterns instead of semantic similarity. In this paper, we demonstrate that LLMs can inherently capture information related to grammatical errors through their internal states. From these states, we extract the Grammatical Error Representation (GER), an informative and semantically neutral encoding of grammatical errors. Our novel GER-based retrieval method significantly boosts performance in ICL settings on multilingual GEC datasets, improving the precision of correction. For high-resource languages, our results on 8B-sized open-source models match those of closed-source models such as Deepseek2.5 and GPT-4o-mini. For low-resource languages, our $F_{0.5}$ scores surpass the baseline by up to a factor of 1.20. This method provides a more precise and resource-efficient solution for multilingual GEC, offering a promising direction for interpretable GEC research.
Original Article
View Cached Full Text

Cached at: 06/16/26, 11:48 AM

# Encode Errors: Representational Retrieval of In-Context Demonstrations for Multilingual Grammatical Error Correction
Source: [https://arxiv.org/html/2606.15416](https://arxiv.org/html/2606.15416)
Guangyue Peng, Wei Li, Wen Luo, Houfeng Wang State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University \{agy,wanghf\}@pku\.edu\.cn weili22@stu\.pku\.edu\.cn,llvvvv22222@gmail\.com

###### Abstract

Grammatical Error Correction \(GEC\) involves detecting and correcting the wrong usage of grammar\. While large language models \(LLMs\) with in\-context learning \(ICL\) capabilities have shown significant progress on various natural language processing \(NLP\) tasks, their few\-shot performance on GEC remains suboptimal\. This is mainly due to the challenge of retrieving suitable in\-context demonstrations that capture error patterns instead of semantic similarity\. In this paper, we demonstrate that LLMs can inherently capture information related to grammatical errors through their internal states\. From these states, we extract the Grammatical Error Representation \(GER\), an informative and semantically neutral encoding of grammatical errors\. Our novel GER\-based retrieval method significantly boosts performance in ICL settings on multilingual GEC datasets, improving the precision of correction\. For high\-resource languages, our results on 8B\-sized open\-source models match those of closed\-source models such as Deepseek2\.5 and GPT\-4o\-mini\. For low\-resource languages, ourF0\.5F\_\{0\.5\}scores surpass the baseline by up to a factor of 1\.20\. This method provides a more precise and resource\-efficient solution for multilingual GEC, offering a promising direction for interpretable GEC research\.111Code is publicly available at[https://github\.com/viniferagy/GER](https://github.com/viniferagy/GER)\.

Encode Errors: Representational Retrieval of In\-Context Demonstrations for Multilingual Grammatical Error Correction

Guangyue Peng, Wei Li, Wen Luo, Houfeng Wang††thanks:Corresponding authorState Key Laboratory of Multimedia Information Processing,School of Computer Science, Peking University\{agy,wanghf\}@pku\.edu\.cnweili22@stu\.pku\.edu\.cn,llvvvv22222@gmail\.com

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.15416v1/x1.png)Figure 1:A minimal working example demonstrating the workflow of representational retrieval\. Given an erroneous input with predictions containing both under\-correction \(marked inred\) and over\-correction \(marked inblue\), we first transform the error information detected by the model into the Grammatical Error Representation \(GER\)\. Then, we retrieve GER\-adjacent demonstrations from the error database, which exhibit error patterns similar to those in the input\. These demonstrations guide the model to make more precise corrections and alleviate over\-corrections\.Grammatical Error Correction \(GEC\) is an important research field in natural language processing \(NLP\), as it requires language models to understand the syntax, semantics, and pragmatics underlying the subtle structures of natural sentences\(Bryantet al\.,[2023](https://arxiv.org/html/2606.15416#bib.bib1)\)\. Initially considered a specific case of machine translation\(Yuan and Briscoe,[2016](https://arxiv.org/html/2606.15416#bib.bib5); Junczys\-Dowmuntet al\.,[2018](https://arxiv.org/html/2606.15416#bib.bib6)\), GEC has evolved with two dominant approaches\. Text\-to\-text methods\(Katsumata and Komachi,[2020](https://arxiv.org/html/2606.15416#bib.bib61); Sunet al\.,[2021](https://arxiv.org/html/2606.15416#bib.bib80); Ingólfsdóttiret al\.,[2023](https://arxiv.org/html/2606.15416#bib.bib72)\)construct pairs of erroneous input and corrected output sentences and train encoder\-decoder models, while text\-to\-edit approaches\(Stahlberg and Kumar,[2020](https://arxiv.org/html/2606.15416#bib.bib82); Omelianchuket al\.,[2020](https://arxiv.org/html/2606.15416#bib.bib8)\)rely on the encoder’s capabilities to identify errors and make corrections\.

As Large Language Models \(LLMs\) come to prominence, they have achieved considerable results in GEC\(Maenget al\.,[2023](https://arxiv.org/html/2606.15416#bib.bib36); Zenget al\.,[2024](https://arxiv.org/html/2606.15416#bib.bib37)\)\. However, LLMs that are not specifically adapted for GEC tasks face two main challenges: misalignment and over\-correction\(Loemet al\.,[2023](https://arxiv.org/html/2606.15416#bib.bib35)\)\. These models often produce corrections misaligned with human\-annotated labels, and they may over\-correct error\-free parts, rewriting them into more fluent forms\. This behavior violates the Minimum Edit Distance principle\(Nagata and Sakaguchi,[2016](https://arxiv.org/html/2606.15416#bib.bib84)\)that humans are accustomed to following when correcting grammatical errors\.

Since few\-shot inference is widely used to bridge alignment gaps in downstream tasks through in\-context learning \(ICL\), LLM\-based GEC systems have leveraged correction examples from databases to improve performance and interpretability\(Daviset al\.,[2024](https://arxiv.org/html/2606.15416#bib.bib34); Songet al\.,[2024](https://arxiv.org/html/2606.15416#bib.bib39)\)\. However, vanilla retrieval methods based on sentence embedding or k\-nearest neighbors \(kNN\) struggle to meet the unique needs of grammatical error selection\(Vasselli and Watanabe,[2023](https://arxiv.org/html/2606.15416#bib.bib16)\)\. Grammatical errors are typically localized structural issues that are independent of word meanings, but model embeddings combine syntax and semantics into a single vector, making it fail to retrieve samples with similar error patterns\.

In this paper, we argue that despite the alignment problem in GEC tasks, language\-proficient models can smoothly distinguish wrong from right and identify error patterns\. This suggests that we should focus less on the generation capabilities of LLMs, but more on their internal knowledge about grammatical errors\. We probe for two key questions:How does a language model encode grammatical errors internally?andcan we extract grammatical error representations that are disentangled from semantics?

To answer, we introduce a novel method to extract the Grammatical Error Representations \(GER\), a precise and interpretable representation of grammatical errors with less semantic noise, for guiding the retrieval of in\-context demonstrations\. Specifically, we compute error vectors \(EV\) by applying PCA to the difference between the hidden states of erroneous and correct tokens\. We then project the hidden states of errors onto the EV to obtain the GER\. As shown in[Figure˜1](https://arxiv.org/html/2606.15416#S1.F1), our GER preserves the proximity of fine\-grained errors: during retrieval, each detected error aligns with similar error patterns\. Additionally, over\-corrected tokens are queried for similar over\-correction cases in the database, improving the precision of the correction process\. During inference, the number of retrieved examples dynamically adjusts based on the detected errors in the sentence, allowing for more efficient use of computational resources\.

We conduct extensive experiments to demonstrate our consistent outperformance on five GEC datasets across four languages\. Without additional training or generation, we obtain high\-quality and interpretable demonstrations for ICL\. Our results surpass state\-of\-the\-art \(SOTA\) GEC retrieval methods, increasingF0\.5F\_\{0\.5\}by up to 9\.46 points for high\-resource languages like English, and by a factor of 1\.20 for low\-resource languages like Estonian\. On open\-source 8B\-sized models, our approach yields results comparable to closed\-source LLM baselines such as Deepseek2\.5 and GPT\-4o\-mini, as reported byLiet al\.\([2025](https://arxiv.org/html/2606.15416#bib.bib67)\)\.

Our contributions are summarized as follows:

- •We introduce a novel method to disentangle grammatical errors from semantic information and into grammatical error representations \(GER\), a high\-quality encoding for grammatical errors\.
- •We develop an effective retriever to query examples with similar error patterns based on GER, enabling powerful ICL with LLMs across multilingual datasets\.
- •To the best of our knowledge, we are the first to explore the relationship between grammatical errors and LLM representations, offering new insights for utilizing LLMs’ representations to guide GEC tasks\.

## 2Related Works

### 2\.1Grammatical Error Correction

Grammatical Error Correction \(GEC\) systems have wide applications in proofreading, education, and second language acquisition\(Kanekoet al\.,[2022](https://arxiv.org/html/2606.15416#bib.bib70); Caineset al\.,[2023](https://arxiv.org/html/2606.15416#bib.bib2); Lianget al\.,[2023](https://arxiv.org/html/2606.15416#bib.bib71)\)\. Research has primarily focused on two Transformer\-based approaches: sequence\-to\-sequence generation\(Yuan and Briscoe,[2016](https://arxiv.org/html/2606.15416#bib.bib5); Junczys\-Dowmuntet al\.,[2018](https://arxiv.org/html/2606.15416#bib.bib6); Liet al\.,[2022](https://arxiv.org/html/2606.15416#bib.bib83)\)and sequence\-to\-edit tagging\(Awasthiet al\.,[2019](https://arxiv.org/html/2606.15416#bib.bib7); Omelianchuket al\.,[2020](https://arxiv.org/html/2606.15416#bib.bib8)\)\. Given the local and sparse nature of grammatical errors, researchers often generate synthetic data\(Stahlberg and Kumar,[2024](https://arxiv.org/html/2606.15416#bib.bib68)\), incorporate additional information\(Zhanget al\.,[2022](https://arxiv.org/html/2606.15416#bib.bib13); Feiet al\.,[2023](https://arxiv.org/html/2606.15416#bib.bib20)\), or add extra processing steps during inferenceLaiet al\.\([2022](https://arxiv.org/html/2606.15416#bib.bib9)\); Zhouet al\.\([2023](https://arxiv.org/html/2606.15416#bib.bib14)\); Zhanget al\.\([2023](https://arxiv.org/html/2606.15416#bib.bib12)\); Li and Wang \([2024](https://arxiv.org/html/2606.15416#bib.bib10)\)to boost performance\. Recent work also explores LLMs for GEC, either through direct correction generation\(Loemet al\.,[2023](https://arxiv.org/html/2606.15416#bib.bib35)\)or instruction tuning\(Fanet al\.,[2023](https://arxiv.org/html/2606.15416#bib.bib38)\)\. Despite challenges like over\-correction and misalignment in LLMs\(Vasselli and Watanabe,[2023](https://arxiv.org/html/2606.15416#bib.bib16)\), human evaluations often rate their corrections highly\(Zenget al\.,[2024](https://arxiv.org/html/2606.15416#bib.bib37)\)\.

### 2\.2Interpretable Representations in LLMs

Although LLMs are often seen as black boxes due to their vast number of parameters, recent research has shown that they develop emergent structures within their representations\(Elhageet al\.,[2021](https://arxiv.org/html/2606.15416#bib.bib79); Zouet al\.,[2023](https://arxiv.org/html/2606.15416#bib.bib73)\)\. In the simplest case, a single dimension within the model is sufficient to characterize a specific behavior\(Arditiet al\.,[2024](https://arxiv.org/html/2606.15416#bib.bib77); Shenget al\.,[2024](https://arxiv.org/html/2606.15416#bib.bib78)\); more complex circuits may involve dozens of neurons distributed across different layers interacting to form meaningful components\(Wanget al\.,[2023](https://arxiv.org/html/2606.15416#bib.bib74)\)\. These interpretable components can be understood and controlled through techniques like adding, deleting, replacing, or tuning\(Liuet al\.,[2024](https://arxiv.org/html/2606.15416#bib.bib75); Wuet al\.,[2024](https://arxiv.org/html/2606.15416#bib.bib76)\)\. Our work is the first to explore and utilize LLMs’ representations related to grammatical errors\.

### 2\.3In\-Context Learning in GEC

LLMs have demonstrated the ability to align their generated results to the knowledge domain and style of several in\-context examples\(Brownet al\.,[2020](https://arxiv.org/html/2606.15416#bib.bib55); Saakyan and Muresan,[2024](https://arxiv.org/html/2606.15416#bib.bib86)\)\. The few\-shot inference paradigm avoids the additional parameters and computational costs of fine\-tuning with downstream tasks\.

The selection of examples in the prompt largely affects the performance of ICL\. Researchers have increased retrieval results by filtering the data,\(Heet al\.,[2021](https://arxiv.org/html/2606.15416#bib.bib85); Penget al\.,[2023](https://arxiv.org/html/2606.15416#bib.bib26)\)or optimizing query encodings and retrieval algorithms\(Li and Qiu,[2023](https://arxiv.org/html/2606.15416#bib.bib27); Wanget al\.,[2024](https://arxiv.org/html/2606.15416#bib.bib28)\)\. The most helpful examples usually share similar encodings to the query, along with sufficient diversity to increase information entropy\. However, for GEC tasks, the selection goal is hard to achieve\. Due to the entanglement of syntax and semantics, the error encodings tend to retrieve examples with similar meanings instead of analogous error types\(Vasselli and Watanabe,[2023](https://arxiv.org/html/2606.15416#bib.bib16); Songet al\.,[2024](https://arxiv.org/html/2606.15416#bib.bib39)\)\. Recent works tackle this entanglement by having models write error explanations, which are then used to retrieve errors based on the explanation embeddings\(Liet al\.,[2025](https://arxiv.org/html/2606.15416#bib.bib67)\)\. Despite the improved retrieval performance, these methods still suffer from coarse sentence\-level granularity and the semantic noise introduced by generated explanations\. Moreover, no work has yet addressed the issue of over\-correction\.

## 3Methods

In this section, we describe a novel method for extracting vectors that characterize grammatical error information and using them to create semantically neutral grammatical error representations \(GER\)\. GER from the training dataset is stored in a database, where each error is associated with its original and corrected texts\. During inference, the model retrieves similar correction examples based on GER to guide corrections, with the flexibility to dynamically adjust the number of examples depending on the complexity of the input sentence\. The final GEC prediction is generated by combining the retrieved examples with a correction template\.

![Refer to caption](https://arxiv.org/html/2606.15416v1/x2.png)Figure 2:The pipeline for proposed representational retrieval for few\-shot GEC\. Left: The hidden states that best reflect the error information are extracted and transformed through PCA to obtain error vectors \(EV\)\. The projections onto EV, denoted as grammatical error representations \(GER\), are stored as keys in the database\. Right: During inference, GER of the test input serves as the query to retrieve similar error patterns to aid correction\.### 3\.1Extraction of Error Vectors

Given a GEC dataset𝒮=\{\(x\(k\),y\(k\)\)\}k=1N\\mathcal\{S\}=\\\{\(x^\{\(k\)\},y^\{\(k\)\}\)\\\}\_\{k=1\}^\{N\}, each sample consists of a potentially erroneous textxxand its parallel corrected textyy\.xxis prompted with an initial correction prompt, which can be zero\-shot or filled with random initial demonstrations222The selection of examples in the initial prompt is discussed in[Section5\.3](https://arxiv.org/html/2606.15416#S5.SS3)\.\. During the generation of the initial predictiony^\\hat\{y\}, we extract the hidden state at theii\-th position from thett\-th layer of the model, denoted as𝐡i\(t\)\\mathbf\{h\}\_\{i\}^\{\(t\)\}, obtaining the setℋ\(t\)\\mathcal\{H\}^\{\(t\)\}\. The choice of the specific layerttis discussed in[5\.2](https://arxiv.org/html/2606.15416#S5.SS2)\. For simplicity, the subsequent formulas omit the layer index\.

y^=LLM​\(promptinit​\(x\)\)\\hat\{y\}=\\text\{LLM\}\\big\(\\text\{prompt\}\_\{\\text\{init\}\}\(x\)\\big\)\(1\)
ℋ\(t\)=\{𝐡i\(t\)∣∀i∈\{1,…,\|y^\|\}\}\\mathcal\{H\}^\{\(t\)\}=\\left\\\{\\mathbf\{h\}\_\{i\}^\{\(t\)\}\\mid\\forall i\\in\\\{1,\\dots,\|\\hat\{y\}\|\\\}\\right\\\}\(2\)
By comparingxxandy^\\hat\{y\}, we identify all edits made by the LLM and collect the set of edited positionsℰ\\mathcal\{E\}and unedited positions𝒰\\mathcal\{U\}\. The corresponding hidden states,ℋℰ\\mathcal\{H\}\_\{\\mathcal\{E\}\}andℋ𝒰\\mathcal\{H\}\_\{\\mathcal\{U\}\}, contain the information necessary for the model to decide whether to correct\. The difference between these sets captures the directions that guide the model from copying the original text to making corrections \- precisely the information related to grammatical errors\. We multiply this difference by a random sign variableαe,u∈\{−1,1\}\\alpha\_\{e,u\}\\in\\\{\-1,1\\\}, which randomly changes the sign to enhance the weight of the error\-related directions in the principal components\.

ℰ\\displaystyle\\mathcal\{E\}=\{i∣Align​\(x,y^\)​\[i\]=Edited\}i=1\|y^\|\\displaystyle=\\left\\\{i\\mid\\text\{Align\}\(x,\\hat\{y\}\)\[i\]=\\text\{Edited\}\\right\\\}\_\{i=1\}^\{\|\\hat\{y\}\|\}\(3\)𝒰\\displaystyle\\mathcal\{U\}=\{i∣Align​\(x,y^\)​\[i\]=Unedited\}i=1\|y^\|\\displaystyle=\\left\\\{i\\mid\\text\{Align\}\(x,\\hat\{y\}\)\[i\]=\\text\{Unedited\}\\right\\\}\_\{i=1\}^\{\|\\hat\{y\}\|\}
ℋℰ\\displaystyle\\mathcal\{H\}\_\{\\mathcal\{E\}\}=\{𝐡i∣∀i∈ℰ\}\\displaystyle=\\left\\\{\\mathbf\{h\}\_\{i\}\\mid\\forall i\\in\{\\mathcal\{E\}\}\\right\\\}\(4\)ℋ𝒰\\displaystyle\\mathcal\{H\}\_\{\\mathcal\{U\}\}=\{𝐡i∣∀i∈𝒰\}\\displaystyle=\\left\\\{\\mathbf\{h\}\_\{i\}\\mid\\forall i\\in\{\\mathcal\{U\}\}\\right\\\}
Δ​𝐇=\{αe,u​\(𝐡e−𝐡u\)∣∀e∈ℰ,∀u∈𝒰\}\\Delta\\mathbf\{H\}=\\left\\\{\\alpha\_\{e,u\}\(\\mathbf\{h\}\_\{e\}\-\\mathbf\{h\}\_\{u\}\)\\mid\\forall e\\in\{\\mathcal\{E\}\},\\forall u\\in\{\\mathcal\{U\}\}\\right\\\}\(5\)
We apply Principal Component Analysis \(PCA\) to the differenceΔ​𝐇\\Delta\\mathbf\{H\}, yielding a set of principal components𝐑\\mathbf\{R\}\. As shown in[Section˜5\.1](https://arxiv.org/html/2606.15416#S5.SS1),𝐑\\mathbf\{R\}encapsulates information related to grammatical errors, with the first principal component𝐫1\\mathbf\{r\}\_\{1\}representing the simplicity of the error, indicating how easy it can be corrected\. The first two principal components are sufficient for encoding simple error types disentangled from the text’s meaning\. We designate𝐑\\mathbf\{R\}as theerror vectors \(EV\)of the model\.

Δ​𝐇=𝐔​𝚺​𝐑⊤\\Delta\\mathbf\{H\}=\\mathbf\{U\}\\mathbf\{\\Sigma\}\\mathbf\{R\}^\{\\top\}\(6\)

### 3\.2Construction of GER Database

For each correctione∈ℰe\\in\\mathcal\{E\}, we average the difference between𝐡e\\mathbf\{h\}\_\{e\}and all corresponding𝐡u∈ℋ𝒰\\mathbf\{h\}\_\{u\}\\in\\mathcal\{H\}\_\{\\mathcal\{U\}\}in the same sentence, canceling out noise from token meanings and positional embeddings\. We then apply PCA, projecting ontommprincipal components333The choice of dimensions for GER is discussed in[Section5\.1\.3](https://arxiv.org/html/2606.15416#S5.SS1.SSS3)\.to obtain thegrammatical error representation \(GER\)𝐩e\(m\)\\mathbf\{p\}\_\{e\}^\{\(m\)\}\. We omit dimension labeling where it is not necessary\. GER serves as the key, with the corresponding pair\(x,y\)\(x,y\)as the label, to construct the GER database𝒟\\mathcal\{D\}\.

Δ​𝐡¯e=1\|𝒰\|​∑u∈𝒰\(𝐡e−𝐡u\)\\Delta\\mathbf\{\\bar\{h\}\}\_\{e\}=\\frac\{1\}\{\|\\mathcal\{U\}\|\}\\sum\_\{u\\in\\mathcal\{U\}\}\(\\mathbf\{h\}\_\{e\}\-\\mathbf\{h\}\_\{u\}\)\(7\)
𝐩e\(m\)=\[𝐫1,𝐫2,…,𝐫m\]⊤​Δ​𝐡¯e,∀e∈ℰ\\mathbf\{p\}\_\{e\}^\{\(m\)\}=\\begin\{bmatrix\}\\mathbf\{r\}\_\{1\},\\mathbf\{r\}\_\{2\},\.\.\.,\\mathbf\{r\}\_\{m\}\\end\{bmatrix\}^\{\\top\}\\Delta\\mathbf\{\\bar\{h\}\}\_\{e\},\\forall e\\in\\mathcal\{E\}\(8\)
𝒟=\{\(𝐩e→\(x,y\)\)∣∀\(x,y\)∈𝒮,∀e∈ℰ\}\\mathcal\{D\}=\\left\\\{\\left\(\\mathbf\{p\}\_\{e\}\\to\(x,y\)\\right\)\\mid\\forall\(x,y\)\\in\\mathcal\{S\},\\forall e\\in\\mathcal\{E\}\\right\\\}\(9\)

### 3\.3Retrieval of In\-Context Demonstrations

During inference, the test inputx~∈𝒮~\\widetilde\{x\}\\in\\widetilde\{\\mathcal\{S\}\}undergoes the pipeline from[Equation˜1](https://arxiv.org/html/2606.15416#S3.E1)\-[Equation˜5](https://arxiv.org/html/2606.15416#S3.E5)to obtain GER for every edit, which is then used as the query𝐪e\\mathbf\{q\}\_\{e\}to retrieve theKeK\_\{e\}nearest neighbors from𝒟\\mathcal\{D\}\.

𝒩​\(𝐪e\)=\{\(𝐩e→\(x,y\)\)\(j\)\}j=1Ke⊆𝒟\\mathcal\{N\}\(\\mathbf\{q\}\_\{e\}\)=\\left\\\{\\left\(\\mathbf\{p\}\_\{e\}\\to\(x,y\)\\right\)^\{\(j\)\}\\right\\\}\_\{j=1\}^\{K\_\{e\}\}\\subseteq\\mathcal\{D\}\(10\)
Thanks to the fine\-grained error encoding, we dynamically allocate the number of retrieved demonstrationsKsK\_\{s\}based on the complexity of each sentence’s errors\. Sentences deemed error\-free by the model are not assigned examples, saving computational resources for sentences with more errors\. We further reveal in[Section˜5\.1](https://arxiv.org/html/2606.15416#S5.SS1)that the magnitude of the first dimension of GER\|𝐩e\(1\)\|\|\\mathbf\{p\}\_\{e\}^\{\(1\)\}\|correlates with the simplicity of the error\. Therefore, we prioritize retrieval for errors that have small\|𝐩e\(1\)\|\|\\mathbf\{p\}\_\{e\}^\{\(1\)\}\|, further optimizing resource allocation444We describe the exact logic of dynamic selection in[SectionA\.5](https://arxiv.org/html/2606.15416#A1.SS5)\.\.

The retrieved examples are concatenated and combined with a few\-shot correction template to prompt the final GEC prediction\. The inference pipeline is illustrated in[Figure˜2](https://arxiv.org/html/2606.15416#S3.F2), and the prompts used are listed in[Section˜A\.4](https://arxiv.org/html/2606.15416#A1.SS4)\.

## 4Experiments

### 4\.1Datasets, Models, and Metrics

We evaluate the proposed method on five GEC datasets across four languages to test GER’s ability to encode and retrieve errors\. Following the multilingual setup inLiet al\.\([2025](https://arxiv.org/html/2606.15416#bib.bib67)\), we process the training dataset and use LlamaIndex\(Liu,[2022](https://arxiv.org/html/2606.15416#bib.bib66)\)to construct the database and retriever\.

For high\-resource English \(EN\), we use the W&I\+LOCNESS\(Bryantet al\.,[2019](https://arxiv.org/html/2606.15416#bib.bib43)\)as the training dataset, and the CoNLL\-14\(Nget al\.,[2013](https://arxiv.org/html/2606.15416#bib.bib42)\)and BEA\-19\(Bryantet al\.,[2019](https://arxiv.org/html/2606.15416#bib.bib43)\)datasets for testing\. For medium\-resource German \(DE\), we use the Falko\-Merlin\(Boyd,[2018](https://arxiv.org/html/2606.15416#bib.bib50)\)dataset for both training and testing\. To showcase the generalizability of our method, we also include low\-resource Romanian \(RO\) and Estonian \(ET\)\. For Romanian, we choose the RONACC\(Cotetet al\.,[2020](https://arxiv.org/html/2606.15416#bib.bib48)\)training and test datasets; for Estonian, we use the Tartu L2 learner corpus\(Rummo and Praakli,[2017](https://arxiv.org/html/2606.15416#bib.bib49)\)as the database and the L1 \(Tartu\-L1\) as the test data\.555The detailed statistics of GEC datasets are placed in[SectionA\.1](https://arxiv.org/html/2606.15416#A1.SS1)\.

Since GER requires the model’s internal states, all experiments are conducted using recent open\-source multilingual LLMs, including Meta’s Llama3\.1\-8B\-Instruct\(Dubeyet al\.,[2024](https://arxiv.org/html/2606.15416#bib.bib57)\)and Tongyi’s Qwen2\.5\-7B\-Instruct\(Yanget al\.,[2024](https://arxiv.org/html/2606.15416#bib.bib59)\)\. Adhering to the dataset\-specific evaluation pipeline for each language, we use the ERRANT toolkit\(Bryantet al\.,[2017](https://arxiv.org/html/2606.15416#bib.bib65)\)to align edits between initial and final predictions\. For evaluation, we apply M2Scorer\(Dahlmeier and Ng,[2012](https://arxiv.org/html/2606.15416#bib.bib64)\)for CoNLL\-14, Falko\-Merlin, and Tartu\-L1, while ERRANT for BEA\-19 and RONACC\.

Our method is compared with the following baselines:

- •Random: Random selection of in\-context demonstrations from the database;
- •Semantic: kNN retrieval based on input text embeddings\(Khandelwalet al\.,[2021](https://arxiv.org/html/2606.15416#bib.bib15)\);
- •BM25: A term\-based ranking function widely used in information retrieval\(Robertsonet al\.,[2009](https://arxiv.org/html/2606.15416#bib.bib63)\);
- •Explanation: Retrieval based on the similarity of LLM\-generated explanations for erroneous sentences\(Liet al\.,[2025](https://arxiv.org/html/2606.15416#bib.bib67)\)\.

All experiments are conducted in an 8\-shot setting\. For all baseline methods, we retrieve 4 erroneous and 4 correct examples, followingLiet al\.\([2025](https://arxiv.org/html/2606.15416#bib.bib67)\)\. Since our method dynamically determines the number of examples needed for each sentence, we retrieve 4 examples for each error and ensure that the average demonstration number is 8\.

### 4\.2Main Results

ModelMethodEnglishGermanRomanianEstonianCoNLL\-14BEA\-19Falko\-MerlinRONACCTartu\-L1PRF0\.5PRF0\.5PRF0\.5PRF0\.5PRF0\.5Llama3\.1\(8B\)Random54\.0252\.6053\.7344\.2063\.4347\.0559\.6254\.5358\.5335\.6440\.7036\.5512\.5522\.3413\.76Semantic55\.2151\.5654\.4445\.5162\.8448\.1760\.0354\.1558\.7539\.3343\.7740\.1412\.7422\.52\*13\.95BM2554\.5851\.5853\.9544\.1862\.9546\.9859\.6555\.6358\.8040\.3245\.4541\.25\-\-\-Explanation55\.0053\.0454\.6045\.2463\.2647\.9760\.3554\.7959\.1538\.6444\.7839\.7213\.3823\.0914\.61GER\-Vanilla58\.60\*55\.3357\.92\*51\.42\*65\.67\*53\.75\*64\.35\*55\.88\*62\.46\*45\.08\*46\.1445\.29\*16\.18\*19\.4516\.74\*GER\-IPE60\.1154\.75\*58\.9655\.6367\.2857\.6365\.5457\.3463\.7148\.5345\.61\*47\.9216\.3720\.5717\.07Qwen2\.5\(7B\)Random54\.4353\.5054\.2444\.8463\.6247\.6555\.2548\.0653\.6529\.7326\.0628\.917\.1116\.358\.02Semantic55\.2752\.6554\.7345\.4863\.4048\.2157\.8148\.5755\.6935\.7630\.4334\.556\.9319\.307\.95BM2554\.1152\.2553\.7344\.6763\.89\*47\.5357\.2150\.18\*55\.6536\.2834\.21\*35\.84\-\-\-Explanation55\.6751\.6054\.8147\.2262\.3149\.6257\.3347\.6355\.0830\.1729\.5330\.047\.1619\.10\*8\.18GER\-Vanilla55\.78\*56\.9456\.00\*49\.12\*63\.2451\.41\*61\.0948\.1557\.97\*36\.58\*34\.3636\.11\*8\.59\*12\.519\.16\*GER\-IPE57\.5355\.62\*57\.1352\.3767\.3754\.8160\.31\*51\.9058\.4237\.7532\.6936\.629\.1913\.509\.82

Table 1:Results on multilingual GEC datasets by different retrieval methods\. "Random" refers to retrieval baseline by random selection; "Semantic", "BM25", and "Explanation" retrieve demonstrations based on text embedding, BM25 matching, and LLM\-generated explanations, respectively\. "GER\-Vanilla" refers to our representation\-based retrieval methods, and "GER\-IPE" refers to GER with Initial Prompt Enhancement\. The best results are marked in bold, and the second\-best results are marked with an asterisk \(\*\)\.During preliminary experiments, we found that the construction of examples in the initial prompt significantly affects results\. Thus, we present results in two configurations: "GER\-Vanilla" refers to generating the initial predictions using the vanilla initial prompt, and "GER\-IPE" \(GER with Initial Prompt Enhancement\) adds 8 randomly chosen examples into the initial prompt\.

As[Table˜1](https://arxiv.org/html/2606.15416#S4.T1)demonstrates, our GER\-based retrieval methods consistently outperform other baseline methods in both prompt settings\. In theGER\-IPEsetting, our method exceeds theexplanation\-basedSOTA by 4\.36 and 4\.56 points on the English CoNLL\-14 and German Falko\-Merlin datasets, respectively\. Moreover, the BEA\-19 dataset achieves a 9\.46\-point higherF0\.5F\_\{0\.5\}than thesemanticSOTA, nearly a20%20\\%improvement\.GER\-Vanillastill results in an improvement of around 3\-5\.6 points above SOTA, testifying to the effectiveness of our GER extraction and retrieval process\.

On low\-resource languages, GER retrieval yields even better results\. For Romanian, theF0\.5F\_\{0\.5\}score improves by 6\.67 points, while Estonian shows a 2\.46 points improvement \(nearly17%17\\%\)\. InGER\-Vanilla, results are about 1 point lower but still surpass the SOTA\. We hypothesize that low\-resource languages benefit more from examples to help the model grasp syntax and generate corrections, as discussed in[Section˜5\.3](https://arxiv.org/html/2606.15416#S5.SS3)\.

On the Qwen2\.5 model, the results follow a similar trend to Llama3\.1, confirming the generalizability of our approach across models\. However, the advantage is slightly lower for low\-resource languages, likely due to Qwen2\.5’s smaller pre\-trained corpus for these languages\.

### 4\.3Comparison with SOTA

BackboneMethodLangENDEETF0\.5Fine\-tuned GEC Single ModelgT5 xxlRotheet al\.\([2021](https://arxiv.org/html/2606.15416#bib.bib30)\)Mono65\.776\.0\-NLLBLuhtaruet al\.\([2024](https://arxiv.org/html/2606.15416#bib.bib33)\)Multi65\.273\.963\.2BARTZhouet al\.\([2023](https://arxiv.org/html/2606.15416#bib.bib14)\)Mono69\.6\-\-Inference of LLMsGPT\-3\.5\-TurboDaviset al\.\([2024](https://arxiv.org/html/2606.15416#bib.bib34)\)\-57\.2\-\-GPT\-3\.5\-TurboTanget al\.\([2024](https://arxiv.org/html/2606.15416#bib.bib29)\)\-58\.8\-\-Deepseek2\.5Liet al\.\([2025](https://arxiv.org/html/2606.15416#bib.bib67)\)\-59\.463\.422\.7GPT\-4o\-miniLiet al\.\([2025](https://arxiv.org/html/2606.15416#bib.bib67)\)\-58\.765\.619\.9\*Llama3\.1 \(8B\)Ours\-59\.0\*63\.7\*17\.1

Table 2:The comparison of state\-of\-the\-art \(SOTA\) models on multilingual GEC datasets\. "EN", "DE", and "ET" stand for the CoNLL\-14, Falko\-Merlin, and Tartu\-L1 datasets, respectively\. Fine\-tuned language models are labeled with their training data in the "Lang" column, where the "Mono" models are tuned separately for each language, and the "Multi" models with multilingual mixed data\. Within each block, the best results are marked in bold, and the second\-best results are marked with an asterisk \(\*\)\.Current datasets reveal a persistent performance disparity in GEC tasks: while fine\-tuned specialist models achieve state\-of\-the\-art \(SOTA\) results across multilingual benchmarks \(see[Table˜2](https://arxiv.org/html/2606.15416#S4.T2)\), in\-context learning \(ICL\) with LLMs exhibits significant accuracy gaps\. Our representational retrieval method manages to achieve results comparable to some closed\-source models on high\-resource English and German, including the Deepseek2\.5 and GPT\-4o\-mini baselines reported byLiet al\.\([2025](https://arxiv.org/html/2606.15416#bib.bib67)\)\. These promising results demonstrate the potential of utilizing interpretable components within the model to better align with human concepts and annotations of grammatical errors\.

### 4\.4Over\-correction mitigation

To clarify the mechanism behind our method’s effectiveness, we report the True Positive \(TP\), False Positive \(FP\), and False Negative \(FN\) statistics from representative Llama3\.1\-8B runs in[Table˜3](https://arxiv.org/html/2606.15416#S4.T3)\. Compared to the best\-performing baseline, our GER method reduces FP by nearly30%30\\%\(e\.g\., from 1603 to 1153 in RONACC\)\. This indicates that the performance improvement stems primarily from substantial gains in precision, driven by a significant reduction in FP, while recall remains relatively stable \(i\.e\., with only modest increases in TP\)\. The mitigation of over\-correction is particularly pronounced in low\-resource languages such as Romanian, where models exhibit a higher propensity for overcorrecting\.

MethodENDEROTP\(↑\\uparrow\)FP\(↓\\downarrow\)FN\(↓\\downarrow\)TP\(↑\\uparrow\)FP\(↓\\downarrow\)FN\(↓\\downarrow\)TP\(↑\\uparrow\)FP\(↓\\downarrow\)FN\(↓\\downarrow\)Random15291315138932392227269497017521413BM25148412351393331122372652108016031300Expl\.151512441350325821212712106716941316GER161310981348342318072540108111531296

Table 3:TP/FP/FN counts across datasets from representative Llama3\.1\-8B runs\. "Expl\." stands for theExplanationbaseline\. For TP, the larger the better; For FP/FN, the smaller the better\.
### 4\.5Model Scalability

To further demonstrate the effectiveness of our method on larger models, we applied GER to Qwen2\.5\-14B\-Instruct\(Yanget al\.,[2024](https://arxiv.org/html/2606.15416#bib.bib59)\)\. The results are presented in[Table˜4](https://arxiv.org/html/2606.15416#S4.T4)\. Larger models exhibit a tendency towards excessive corrections, which can improve recall but reduce precision\. By primarily mitigating over\-correction, our method ensures robust performance generalization on larger models\.

MethodENDEETPRF0\.5PRF0\.5PRF0\.5Random49\.258\.050\.751\.850\.651\.66\.518\.17\.5Expl\.50\.656\.251\.652\.952\.152\.76\.720\.37\.7GER54\.358\.555\.155\.252\.954\.79\.014\.29\.7

Table 4:Results for the CoNLL\-14, Falko\-Merlin, and Tartu\- L1 datasets on Qwen2\.5\-14B\. "Expl\." stands for theExplanationbaseline\.

## 5GER Analysis

### 5\.1Encoding Capacity of GER

The different principal components calculated by PCA, referred to as error vectors \(EVs\), capture various levels of error\-related information in natural sentences\. Our preliminary exploration of the first few EVs shows that the first EV represents the model’s recognition and ranking of grammatical errors, while the second EV captures simple information about error types, such as tense issues\. In the following analysis section, unless stated otherwise, we use theGER\-IPEsetup with Llama3\.1\-8B\.

#### 5\.1\.1The First EV: Error Detector

![Refer to caption](https://arxiv.org/html/2606.15416v1/x3.png)Figure 3:Distribution of the first GER component with respect to error/correct \(up\) and confusion matrix \(down\)\.We illustrate the first component of GER \(first GER\) obtained from the English training dataset in[Figure˜3](https://arxiv.org/html/2606.15416#S5.F3)\. The figure presents a clear boundary between erroneous and correct tokens along the direction of the first EV, achieving classification accuracy over98%98\\%for correct tokens and over65%65\\%for erroneous tokens, on par with SOTA LMs and superior to LLMs in end\-to\-end GED tasks\(Luhtaruet al\.,[2024](https://arxiv.org/html/2606.15416#bib.bib33)\)\. The first GER can thus serve as an effective error detector\.

Moreover, the magnitude of the first GER quantifies correction simplicity in a relatively quantitative manner\. We classify predicted tokens using the confusion matrix and plot the distributions of True Positive \(TP\), False Positive \(FP\), True Negative \(TN\), and False Negative \(FN\) in[Figure˜3](https://arxiv.org/html/2606.15416#S5.F3)\. Cases with a larger first GER magnitude are more likely to represent precise corrections, whereas those with smaller values often correspond to failed corrections \(FP, including over\-corrections and incorrect corrections\)\.

Consequently, we design a dynamic demonstration selection method that prioritizes errors with small first GER values for demonstration allocation\. This approach conserves computational resources for errors prone to failed corrections, which require reference to examples for successful resolution\. In[Table˜5](https://arxiv.org/html/2606.15416#S5.T5), we conduct an ablation study on this selection method by comparing random example selection \(Random\) with prioritizing retrieval for errors having a large first GER \(Reverse\)\. The results validate the efficacy of our dynamic selection method\.

MethodENDEETPRF0\.5PRF0\.5PRF0\.5Dynamic60\.154\.859\.065\.557\.363\.715\.120\.115\.9Random59\.852\.658\.264\.155\.562\.213\.920\.014\.8Reverse60\.750\.358\.365\.254\.662\.814\.417\.815\.0

Table 5:Ablation of different demonstration selection methods of GER\.
#### 5\.1\.2The Second EV: Simple Error Classifier

On the first EV, we can distinguish between the wrong and the correct, but one dimension fails to provide detailed information\. Introducing the second EV enables recognition of basic grammatical patterns\. To validate this progression, we create a specialized test set666Specific samples of the test set are placed in[AppendixC](https://arxiv.org/html/2606.15416#A3)\.containing:

- •Sport\-domain sentences with present perfect progressive \(ppp\) tense errors;
- •Art\-domain sentences with simple past \(sp\) tense errors\.

Cross\-domain probes are designed as:

- •Art\-domain samples with ppp errors;
- •Sport\-domain samples with sp errors\.

[Figure˜4](https://arxiv.org/html/2606.15416#S5.F4)shows that while semantic embeddings retrieve semantically similar but error\-mismatched examples, our 2\-dimensional GER successfully clusters analogous errors across domains, demonstrating the proximity and semantic neutrality of GER\.

![Refer to caption](https://arxiv.org/html/2606.15416v1/x4.png)Figure 4:Distribution of different encoding methods on a manually created test set\. "sport"/"art" refers to sentences in the sport/art domain, and "ppp"/"sp" refers to present perfect progressive/simple past tense errors\. Cross\-domain probes are marked as stars\.
#### 5\.1\.3Dimensionality Trade\-offs in GER

Dim\.ENDEETPRF0\.5PRF0\.5PRF0\.512859\.554\.558\.465\.257\.363\.414\.419\.415\.225659\.753\.658\.465\.257\.263\.415\.120\.115\.951259\.854\.358\.665\.557\.363\.714\.720\.115\.5102460\.154\.859\.065\.457\.463\.614\.920\.415\.8204860\.054\.458\.865\.156\.963\.314\.320\.715\.2

Table 6:Results across different dimensional configurations of GER\.Increasing the dimensionality of GER \(mmin𝐩e\(m\)\\mathbf\{p\}\_\{e\}^\{\(m\)\}\) enhances its ability to encode fine\-grained error patterns, but simultaneously amplifies the semantic noise it contains, causing GER to extract examples with semantic similarities over those sharing similar error types\. Experimental results across different dimensional configurations are presented in[Table˜6](https://arxiv.org/html/2606.15416#S5.T6): the more resources the model has about a particular language, the more dimensions it needs to encode errors in that language\. At reduced dimensions, GER fails to distinguish complex errors; on the other hand, when the dimensions are too large, GER can identify some nuanced error cases but introduce more error\-irrelevant samples, resulting in higher recall and lower precision\.

### 5\.2Layer Selection

![Refer to caption](https://arxiv.org/html/2606.15416v1/x5.png)Figure 5:Upper: The explained variance ratio of the first principal component in PCA \(first EVR\) for layers\. Lower: Accuracy of grammatical error detection task in each layer\. We observe similar patterns for the trend of first EVR and error detection accuracy in Llama3\.1 \(left\) and Qwen2\.5 \(right\)\.We select the layer used to extract GER based on the performance of grammatical error detection\. The error detection performance with respect to each layer of the model is juxtaposed with the explained variance ratio of the first principal component in PCA \(first EVR\) in[Figure˜5](https://arxiv.org/html/2606.15416#S5.F5)\. From the upper figures, a spike of the first EVR is clearly depicted, coinciding with the most accurate layer in the lower images\. The specific choice of layer differs with each model but remains highly consistent across languages within the same model, and the selected layers are in the middle of each model \(the 21st layer for 32\-layer Llama3\.1, and the 12th layer for 28\-layer Qwen2\.5\)\. This suggests to us that there are specific components within the layer that are responsible for understanding and processing grammatical error information\. We leave further research to future work\.

### 5\.3Demonstration Selection for Initial Prompt

![Refer to caption](https://arxiv.org/html/2606.15416v1/x6.png)Figure 6:EVR increments of n\-shot initial demonstrations relative to 0\-shot\.As observed in[Section˜4\.2](https://arxiv.org/html/2606.15416#S4.SS2), even randomly selected examples in the initial prompt significantly improve results, although they affect the initial prediction and not the final output\. We attribute this improvement to two factors: first, the few\-shot initial prompt helps activate the model’s correction capability and aligns the generated outputs with the example format\. This alignment is particularly noticeable in low\-resource languages such as Estonian, where zero\-shot predictions usually include English tokens, introducing noise that hinders the PCA process for extracting EV\. Second, from within the model, the initial prompt aligns EV inside the model toward the actual error space\.[Figure˜6](https://arxiv.org/html/2606.15416#S5.F6)reveals that the first explained variance ratio \(EVR\) increases as more initial examples are added, indicating that the model is refining its error space with each new demonstration\. This suggests that the examples selected by GER may help the model better characterize the error space, which can be used iteratively in another round of generation to optimize EV\. We leave this iterative approach for future work\.

## 6Conclusion

In this paper, we delve into the internals of LLMs and develop a novel method for extracting precise and interpretable grammatical error representations \(GER\) with less semantic noise\. The effectiveness of GER in encoding fine\-grained error patterns enables the retrieval of high\-quality error demonstrations, improving the few\-shot performance of LLMs on GEC across diverse language settings\.

Our preliminary exploration and successful utilization of LLMs’ internal states highlight the potential of utilizing the model’s inherent knowledge to strengthen GEC performance, alignment, and interpretability, all without the need for additional components or training resources\.

## Limitations

Our work explores and leverages the knowledge related to error correction within large models\. However, the few\-shot GEC capabilities of LLMs are far from fully realized\. The latter dimensions of our proposed error vectors contain detailed, fine\-grained knowledge about error classification and correction, but they are difficult to separate, visualize, and utilize effectively\. In addition, we did not address the scenario where long sentences with multiple errors outpace the utility of the 8\-shot examples\. In such cases, slicing the long sentence into smaller segments may yield better performance\.

While we have encoded errors and used them for example retrieval in this work, the error information could be applied more broadly in the model’s prediction pipeline, such as in controlling the decoding process\. Future work could investigate simpler ways of representing error information, or develop methods to comprehensively combine and summarize this information for more effective manipulation of model\-generated grammatical error corrections\.

## Acknowledgments

This work was supported by National Natural Science Foundation of China \(62036001\) and National Science and Technology Major Project \(No\. 2022ZD0116308\) \. The corresponding author is Houfeng Wang\.

## References

- A\. Arditi, O\. Obeso, A\. Syed, D\. Paleka, N\. Panickssery, W\. Gurnee, and N\. Nanda \(2024\)Refusal in language models is mediated by a single direction\.InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,A\. Globersons, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. M\. Tomczak, and C\. Zhang \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/f545448535dfde4f9786555403ab7c49-Abstract-Conference.html)Cited by:[§2\.2](https://arxiv.org/html/2606.15416#S2.SS2.p1.1)\.
- A\. Awasthi, S\. Sarawagi, R\. Goyal, S\. Ghosh, and V\. Piratla \(2019\)Parallel iterative edit models for local sequence transduction\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),K\. Inui, J\. Jiang, V\. Ng, and X\. Wan \(Eds\.\),Hong Kong, China,pp\. 4260–4270\.External Links:[Document](https://dx.doi.org/10.18653/v1/D19-1435),[Link](https://aclanthology.org/D19-1435)Cited by:[§2\.1](https://arxiv.org/html/2606.15416#S2.SS1.p1.1)\.
- A\. Boyd \(2018\)Using Wikipedia edits in low resource grammatical error correction\.InProceedings of the 2018 EMNLP Workshop W\-NUT: The 4th Workshop on Noisy User\-generated Text,W\. Xu, A\. Ritter, T\. Baldwin, and A\. Rahimi \(Eds\.\),Brussels, Belgium,pp\. 79–84\.External Links:[Document](https://dx.doi.org/10.18653/v1/W18-6111),[Link](https://aclanthology.org/W18-6111)Cited by:[§4\.1](https://arxiv.org/html/2606.15416#S4.SS1.p2.1)\.
- T\. B\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell, S\. Agarwal, A\. Herbert\-Voss, G\. Krueger, T\. Henighan, R\. Child, A\. Ramesh, D\. M\. Ziegler, J\. Wu, C\. Winter, C\. Hesse, M\. Chen, E\. Sigler, M\. Litwin, S\. Gray, B\. Chess, J\. Clark, C\. Berner, S\. McCandlish, A\. Radford, I\. Sutskever, and D\. Amodei \(2020\)Language models are few\-shot learners\.InAdvances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6\-12, 2020, virtual,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\. Balcan, and H\. Lin \(Eds\.\),External Links:[Link](https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html)Cited by:[§2\.3](https://arxiv.org/html/2606.15416#S2.SS3.p1.1)\.
- C\. Bryant, M\. Felice, Ø\. E\. Andersen, and T\. Briscoe \(2019\)The BEA\-2019 shared task on grammatical error correction\.InProceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications,H\. Yannakoudakis, E\. Kochmar, C\. Leacock, N\. Madnani, I\. Pilán, and T\. Zesch \(Eds\.\),Florence, Italy,pp\. 52–75\.External Links:[Document](https://dx.doi.org/10.18653/v1/W19-4406),[Link](https://aclanthology.org/W19-4406)Cited by:[§4\.1](https://arxiv.org/html/2606.15416#S4.SS1.p2.1)\.
- C\. Bryant, M\. Felice, and T\. Briscoe \(2017\)Automatic annotation and evaluation of error types for grammatical error correction\.InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),R\. Barzilay and M\. Kan \(Eds\.\),Vancouver, Canada,pp\. 793–805\.External Links:[Document](https://dx.doi.org/10.18653/v1/P17-1074),[Link](https://aclanthology.org/P17-1074)Cited by:[§4\.1](https://arxiv.org/html/2606.15416#S4.SS1.p3.1)\.
- C\. Bryant, Z\. Yuan, M\. R\. Qorib, H\. Cao, H\. T\. Ng, and T\. Briscoe \(2023\)Grammatical error correction: a survey of the state of the art\.Computational Linguistics,pp\. 643–701\.External Links:[Document](https://dx.doi.org/10.1162/coli%5Fa%5F00478),[Link](https://aclanthology.org/2023.cl-3.4)Cited by:[§1](https://arxiv.org/html/2606.15416#S1.p1.1)\.
- A\. Caines, L\. Benedetto, S\. Taslimipoor, C\. Davis, Y\. Gao, Ø\. E\. Andersen, Z\. Yuan, M\. Elliott, R\. Moore, C\. Bryant, M\. Rei, H\. Yannakoudakis, A\. Mullooly, D\. Nicholls, and P\. Buttery \(2023\)On the application of large language models for language teaching and assessment technology\.InProceedings of the Workshop on Empowering Education with LLMs \- the Next\-Gen Interface and Content Generation 2023 co\-located with 24th International Conference on Artificial Intelligence in Education \(AIED 2023\), Tokyo, Japan, July 7, 2023,S\. Moore, J\. C\. Stamper, R\. J\. Tong, C\. Cao, Z\. Liu, X\. Hu, Y\. Lu, J\. Liang, H\. Khosravi, P\. Denny, A\. Singh, and C\. Brooks \(Eds\.\),CEUR Workshop Proceedings, Vol\.3487,pp\. 173–197\.External Links:[Link](https://ceur-ws.org/Vol-3487/paper12.pdf)Cited by:[§2\.1](https://arxiv.org/html/2606.15416#S2.SS1.p1.1)\.
- T\. Cotet, S\. Ruseti, and M\. Dascalu \(2020\)Neural grammatical error correction for romanian\.In2020 IEEE 32nd International Conference on Tools with Artificial Intelligence \(ICTAI\),pp\. 625–631\.Cited by:[§4\.1](https://arxiv.org/html/2606.15416#S4.SS1.p2.1)\.
- D\. Dahlmeier and H\. T\. Ng \(2012\)Better evaluation for grammatical error correction\.InProceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,E\. Fosler\-Lussier, E\. Riloff, and S\. Bangalore \(Eds\.\),Montréal, Canada,pp\. 568–572\.External Links:[Link](https://aclanthology.org/N12-1067)Cited by:[§4\.1](https://arxiv.org/html/2606.15416#S4.SS1.p3.1)\.
- C\. Davis, A\. Caines, Ø\. E\. Andersen, S\. Taslimipoor, H\. Yannakoudakis, Z\. Yuan, C\. Bryant, M\. Rei, and P\. Buttery \(2024\)Prompting open\-source and commercial language models for grammatical error correction of english learner text\.InFindings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11\-16, 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),pp\. 11952–11967\.External Links:[Document](https://dx.doi.org/10.18653/V1/2024.FINDINGS-ACL.711),[Link](https://doi.org/10.18653/v1/2024.findings-acl.711)Cited by:[§A\.4](https://arxiv.org/html/2606.15416#A1.SS4.p1.1),[§1](https://arxiv.org/html/2606.15416#S1.p3.1),[Table 2](https://arxiv.org/html/2606.15416#S4.T2.1.1.8.2)\.
- A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Yang, A\. Fan,et al\.\(2024\)The llama 3 herd of models\.ArXiv preprintabs/2407\.21783\.External Links:[Link](https://arxiv.org/abs/2407.21783)Cited by:[§4\.1](https://arxiv.org/html/2606.15416#S4.SS1.p3.1)\.
- N\. Elhage, N\. Nanda, C\. Olsson, T\. Henighan, N\. Joseph, B\. Mann, A\. Askell, Y\. Bai, A\. Chen, T\. Conerly, N\. DasSarma, D\. Drain, D\. Ganguli, Z\. Hatfield\-Dodds, D\. Hernandez, A\. Jones, J\. Kernion, L\. Lovitt, K\. Ndousse, D\. Amodei, T\. Brown, J\. Clark, J\. Kaplan, S\. McCandlish, and C\. Olah \(2021\)A mathematical framework for transformer circuits\.Transformer Circuits Thread\.External Links:[Link](https://transformer-circuits.pub/2021/framework/index.html)Cited by:[§2\.2](https://arxiv.org/html/2606.15416#S2.SS2.p1.1)\.
- Y\. Fan, F\. Jiang, P\. Li, and H\. Li \(2023\)GrammarGPT: exploring open\-source llms for native chinese grammatical error correction with supervised fine\-tuning\.InNatural Language Processing and Chinese Computing \- 12th National CCF Conference, NLPCC 2023, Foshan, China, October 12\-15, 2023, Proceedings, Part III,F\. Liu, N\. Duan, Q\. Xu, and Y\. Hong \(Eds\.\),Lecture Notes in Computer Science, Vol\.14304,pp\. 69–80\.External Links:[Document](https://dx.doi.org/10.1007/978-3-031-44699-3%5F7),[Link](https://doi.org/10.1007/978-3-031-44699-3%5C_7)Cited by:[§2\.1](https://arxiv.org/html/2606.15416#S2.SS1.p1.1)\.
- Y\. Fei, L\. Cui, S\. Yang, W\. Lam, Z\. Lan, and S\. Shi \(2023\)Enhancing grammatical error correction systems with explanations\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 7489–7501\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.413),[Link](https://aclanthology.org/2023.acl-long.413)Cited by:[§2\.1](https://arxiv.org/html/2606.15416#S2.SS1.p1.1)\.
- J\. He, G\. Neubig, and T\. Berg\-Kirkpatrick \(2021\)Efficient nearest neighbor language models\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,M\. Moens, X\. Huang, L\. Specia, and S\. W\. Yih \(Eds\.\),Online and Punta Cana, Dominican Republic,pp\. 5703–5714\.External Links:[Document](https://dx.doi.org/10.18653/v1/2021.emnlp-main.461),[Link](https://aclanthology.org/2021.emnlp-main.461)Cited by:[§2\.3](https://arxiv.org/html/2606.15416#S2.SS3.p2.1)\.
- S\. L\. Ingólfsdóttir, P\. Ragnarsson, H\. Jónsson, H\. Simonarson, V\. Thorsteinsson, and V\. Snæbjarnarson \(2023\)Byte\-level grammatical error correction using synthetic and curated corpora\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 7299–7316\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.402),[Link](https://aclanthology.org/2023.acl-long.402)Cited by:[§1](https://arxiv.org/html/2606.15416#S1.p1.1)\.
- M\. Junczys\-Dowmunt, R\. Grundkiewicz, S\. Guha, and K\. Heafield \(2018\)Approaching neural grammatical error correction as a low\-resource machine translation task\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long Papers\),M\. Walker, H\. Ji, and A\. Stent \(Eds\.\),New Orleans, Louisiana,pp\. 595–606\.External Links:[Document](https://dx.doi.org/10.18653/v1/N18-1055),[Link](https://aclanthology.org/N18-1055)Cited by:[§1](https://arxiv.org/html/2606.15416#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.15416#S2.SS1.p1.1)\.
- M\. Kaneko, S\. Takase, A\. Niwa, and N\. Okazaki \(2022\)Interpretability for language learners using example\-based grammatical error correction\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 7176–7187\.External Links:[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.496),[Link](https://aclanthology.org/2022.acl-long.496)Cited by:[§2\.1](https://arxiv.org/html/2606.15416#S2.SS1.p1.1)\.
- S\. Katsumata and M\. Komachi \(2020\)Stronger baselines for grammatical error correction using a pretrained encoder\-decoder model\.InProceedings of the 1st Conference of the Asia\-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing,K\. Wong, K\. Knight, and H\. Wu \(Eds\.\),Suzhou, China,pp\. 827–832\.External Links:[Link](https://aclanthology.org/2020.aacl-main.83)Cited by:[§1](https://arxiv.org/html/2606.15416#S1.p1.1)\.
- U\. Khandelwal, A\. Fan, D\. Jurafsky, L\. Zettlemoyer, and M\. Lewis \(2021\)Nearest neighbor machine translation\.In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3\-7, 2021,External Links:[Link](https://openreview.net/forum?id=7wCBOfJ8hJM)Cited by:[2nd item](https://arxiv.org/html/2606.15416#S4.I1.i2.p1.1)\.
- S\. Lai, Q\. Zhou, J\. Zeng, Z\. Li, C\. Li, Y\. Cao, and J\. Su \(2022\)Type\-driven multi\-turn corrections for grammatical error correction\.InFindings of the Association for Computational Linguistics: ACL 2022,S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 3225–3236\.External Links:[Document](https://dx.doi.org/10.18653/v1/2022.findings-acl.254),[Link](https://aclanthology.org/2022.findings-acl.254)Cited by:[§2\.1](https://arxiv.org/html/2606.15416#S2.SS1.p1.1)\.
- J\. Li, J\. Guo, Y\. Zhu, X\. Sheng, D\. Jiang, B\. Ren, and L\. Xu \(2022\)Sequence\-to\-action: grammatical error correction with action guided sequence generation\.InThirty\-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty\-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 \- March 1, 2022,pp\. 10974–10982\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/21345)Cited by:[§2\.1](https://arxiv.org/html/2606.15416#S2.SS1.p1.1)\.
- W\. Li, W\. Luo, G\. Peng, and H\. Wang \(2025\)Explanation based in\-context demonstrations retrieval for multilingual grammatical error correction\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 4881–4897\.External Links:ISBN 979\-8\-89176\-189\-6,[Link](https://aclanthology.org/2025.naacl-long.251/)Cited by:[§A\.4](https://arxiv.org/html/2606.15416#A1.SS4.p1.1),[1st item](https://arxiv.org/html/2606.15416#A2.I1.i1.p1.1),[§1](https://arxiv.org/html/2606.15416#S1.p6.1),[§2\.3](https://arxiv.org/html/2606.15416#S2.SS3.p2.1),[4th item](https://arxiv.org/html/2606.15416#S4.I1.i4.p1.1),[§4\.1](https://arxiv.org/html/2606.15416#S4.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.15416#S4.SS1.p5.1),[§4\.3](https://arxiv.org/html/2606.15416#S4.SS3.p1.1),[Table 2](https://arxiv.org/html/2606.15416#S4.T2.1.1.10.2),[Table 2](https://arxiv.org/html/2606.15416#S4.T2.1.1.11.2)\.
- W\. Li and H\. Wang \(2024\)Detection\-correction structure via general language model for grammatical error correction\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 1748–1763\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.96),[Link](https://aclanthology.org/2024.acl-long.96/)Cited by:[§2\.1](https://arxiv.org/html/2606.15416#S2.SS1.p1.1)\.
- X\. Li and X\. Qiu \(2023\)Finding support examples for in\-context learning\.InFindings of the Association for Computational Linguistics: EMNLP 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 6219–6235\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.411),[Link](https://aclanthology.org/2023.findings-emnlp.411)Cited by:[§2\.3](https://arxiv.org/html/2606.15416#S2.SS3.p2.1)\.
- K\. Liang, S\. Davidson, X\. Yuan, S\. Panditharatne, C\. Chen, R\. Shea, D\. Pham, Y\. Tan, E\. Voss, and L\. Fryer \(2023\)ChatBack: investigating methods of providing grammatical error feedback in a GUI\-based language learning chatbot\.InProceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2023\),E\. Kochmar, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, N\. Madnani, A\. Tack, V\. Yaneva, Z\. Yuan, and T\. Zesch \(Eds\.\),Toronto, Canada,pp\. 83–99\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.bea-1.7),[Link](https://aclanthology.org/2023.bea-1.7)Cited by:[§2\.1](https://arxiv.org/html/2606.15416#S2.SS1.p1.1)\.
- H\. Liu, H\. Zhang, Z\. Guo, K\. Dong, X\. Li, Y\. Q\. Lee, C\. Zhang, and Y\. Liu \(2024\)CtrlA: adaptive retrieval\-augmented generation via probe\-guided control\.ArXiv preprintabs/2405\.18727\.External Links:[Link](https://arxiv.org/abs/2405.18727)Cited by:[§2\.2](https://arxiv.org/html/2606.15416#S2.SS2.p1.1)\.
- J\. Liu \(2022\)LlamaIndex\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.1234),[Link](https://github.com/jerryjliu/llama_index)Cited by:[§4\.1](https://arxiv.org/html/2606.15416#S4.SS1.p1.1)\.
- M\. Loem, M\. Kaneko, S\. Takase, and N\. Okazaki \(2023\)Exploring effectiveness of GPT\-3 in grammatical error correction: a study on performance and controllability in prompt\-based methods\.InProceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2023\),E\. Kochmar, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, N\. Madnani, A\. Tack, V\. Yaneva, Z\. Yuan, and T\. Zesch \(Eds\.\),Toronto, Canada,pp\. 205–219\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.bea-1.18),[Link](https://aclanthology.org/2023.bea-1.18)Cited by:[§1](https://arxiv.org/html/2606.15416#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.15416#S2.SS1.p1.1)\.
- A\. Luhtaru, E\. Korotkova, and M\. Fishel \(2024\)No error left behind: multilingual grammatical error correction with pre\-trained translation models\.InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),Y\. Graham and M\. Purver \(Eds\.\),St\. Julian’s, Malta,pp\. 1209–1222\.External Links:[Link](https://aclanthology.org/2024.eacl-long.73)Cited by:[§A\.2](https://arxiv.org/html/2606.15416#A1.SS2.p1.1),[Table 2](https://arxiv.org/html/2606.15416#S4.T2.1.1.5.2),[§5\.1\.1](https://arxiv.org/html/2606.15416#S5.SS1.SSS1.p1.2)\.
- J\. Maeng, J\. Gu, and S\. Kim \(2023\)Effectiveness of ChatGPT in Korean grammatical error correction\.InProceedings of the 37th Pacific Asia Conference on Language, Information and Computation,C\. Huang, Y\. Harada, J\. Kim, S\. Chen, Y\. Hsu, E\. Chersoni, P\. A, W\. H\. Zeng, B\. Peng, Y\. Li, and J\. Li \(Eds\.\),Hong Kong, China,pp\. 464–472\.External Links:[Link](https://aclanthology.org/2023.paclic-1.46)Cited by:[§1](https://arxiv.org/html/2606.15416#S1.p2.1)\.
- R\. Nagata and K\. Sakaguchi \(2016\)Phrase structure annotation and parsing for learner English\.InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),K\. Erk and N\. A\. Smith \(Eds\.\),Berlin, Germany,pp\. 1837–1847\.External Links:[Document](https://dx.doi.org/10.18653/v1/P16-1173),[Link](https://aclanthology.org/P16-1173)Cited by:[§1](https://arxiv.org/html/2606.15416#S1.p2.1)\.
- H\. T\. Ng, S\. M\. Wu, Y\. Wu, C\. Hadiwinoto, and J\. Tetreault \(2013\)The CoNLL\-2013 shared task on grammatical error correction\.InProceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task,H\. T\. Ng, J\. Tetreault, S\. M\. Wu, Y\. Wu, and C\. Hadiwinoto \(Eds\.\),Sofia, Bulgaria,pp\. 1–12\.External Links:[Link](https://aclanthology.org/W13-3601)Cited by:[§4\.1](https://arxiv.org/html/2606.15416#S4.SS1.p2.1)\.
- K\. Omelianchuk, V\. Atrasevych, A\. Chernodub, and O\. Skurzhanskyi \(2020\)GECToR – grammatical error correction: tag, not rewrite\.InProceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications,J\. Burstein, E\. Kochmar, C\. Leacock, N\. Madnani, I\. Pilán, H\. Yannakoudakis, and T\. Zesch \(Eds\.\),Seattle, WA, USA → Online,pp\. 163–170\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.bea-1.16),[Link](https://aclanthology.org/2020.bea-1.16)Cited by:[§1](https://arxiv.org/html/2606.15416#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.15416#S2.SS1.p1.1)\.
- G\. Peng, T\. Ge, S\. Chen, F\. Wei, and H\. Wang \(2023\)Semiparametric language models are scalable continual learners\.ArXiv preprintabs/2303\.01421\.External Links:[Link](https://arxiv.org/abs/2303.01421)Cited by:[§2\.3](https://arxiv.org/html/2606.15416#S2.SS3.p2.1)\.
- S\. Robertson, H\. Zaragoza,et al\.\(2009\)The probabilistic relevance framework: bm25 and beyond\.Foundations and Trends® in Information Retrieval3\(4\),pp\. 333–389\.Cited by:[3rd item](https://arxiv.org/html/2606.15416#S4.I1.i3.p1.1)\.
- S\. Rothe, J\. Mallinson, E\. Malmi, S\. Krause, and A\. Severyn \(2021\)A simple recipe for multilingual grammatical error correction\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 2: Short Papers\),C\. Zong, F\. Xia, W\. Li, and R\. Navigli \(Eds\.\),Online,pp\. 702–707\.External Links:[Document](https://dx.doi.org/10.18653/v1/2021.acl-short.89),[Link](https://aclanthology.org/2021.acl-short.89)Cited by:[Table 2](https://arxiv.org/html/2606.15416#S4.T2.1.1.4.2)\.
- I\. Rummo and K\. Praakli \(2017\)TU eesti keele \(voorkeelena\) osakonna oppijakeele tekstikorpus \[the language learners corpus of the department of estonian language of the university of tartu\]\.Proc EAAL\.Cited by:[§4\.1](https://arxiv.org/html/2606.15416#S4.SS1.p2.1)\.
- A\. Saakyan and S\. Muresan \(2024\)ICLEF: in\-context learning with expert feedback for explainable style transfer\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 16141–16163\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.854),[Link](https://aclanthology.org/2024.acl-long.854/)Cited by:[§2\.3](https://arxiv.org/html/2606.15416#S2.SS3.p1.1)\.
- S\. Sheng, Y\. Xu, T\. Zhang, Z\. Shen, L\. Fu, J\. Ding, L\. Zhou, X\. Gan, X\. Wang, and C\. Zhou \(2024\)RepEval: effective text evaluation with LLM representation\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12\-16, 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),pp\. 7019–7033\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.398)Cited by:[§2\.2](https://arxiv.org/html/2606.15416#S2.SS2.p1.1)\.
- Y\. Song, K\. Krishna, R\. Bhatt, K\. Gimpel, and M\. Iyyer \(2024\)GEE\! grammar error explanation with large language models\.InFindings of the Association for Computational Linguistics: NAACL 2024,K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\),Mexico City, Mexico,pp\. 754–781\.External Links:[Link](https://aclanthology.org/2024.findings-naacl.49)Cited by:[§1](https://arxiv.org/html/2606.15416#S1.p3.1),[§2\.3](https://arxiv.org/html/2606.15416#S2.SS3.p2.1)\.
- F\. Stahlberg and S\. Kumar \(2020\)Seq2Edits: sequence transduction using span\-level edit operations\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),B\. Webber, T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Online,pp\. 5147–5159\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-main.418),[Link](https://aclanthology.org/2020.emnlp-main.418)Cited by:[§1](https://arxiv.org/html/2606.15416#S1.p1.1)\.
- F\. Stahlberg and S\. Kumar \(2024\)Synthetic data generation for low\-resource grammatical error correction with tagged corruption models\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),E\. Kochmar, M\. Bexte, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, A\. Tack, V\. Yaneva, and Z\. Yuan \(Eds\.\),Mexico City, Mexico,pp\. 11–16\.External Links:[Link](https://aclanthology.org/2024.bea-1.2)Cited by:[§A\.2](https://arxiv.org/html/2606.15416#A1.SS2.p1.1),[§2\.1](https://arxiv.org/html/2606.15416#S2.SS1.p1.1)\.
- X\. Sun, T\. Ge, F\. Wei, and H\. Wang \(2021\)Instantaneous grammatical error correction with shallow aggressive decoding\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),C\. Zong, F\. Xia, W\. Li, and R\. Navigli \(Eds\.\),Online,pp\. 5937–5947\.External Links:[Document](https://dx.doi.org/10.18653/v1/2021.acl-long.462),[Link](https://aclanthology.org/2021.acl-long.462)Cited by:[§1](https://arxiv.org/html/2606.15416#S1.p1.1)\.
- C\. Tang, F\. Qu, and Y\. Wu \(2024\)Ungrammatical\-syntax\-based in\-context example selection for grammatical error correction\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\),Mexico City, Mexico,pp\. 1758–1770\.External Links:[Link](https://aclanthology.org/2024.naacl-long.99)Cited by:[§A\.4](https://arxiv.org/html/2606.15416#A1.SS4.p1.1),[Table 2](https://arxiv.org/html/2606.15416#S4.T2.1.1.9.2)\.
- J\. Vasselli and T\. Watanabe \(2023\)A closer look at k\-nearest neighbors grammatical error correction\.InProceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2023\),E\. Kochmar, J\. Burstein, A\. Horbach, R\. Laarmann\-Quante, N\. Madnani, A\. Tack, V\. Yaneva, Z\. Yuan, and T\. Zesch \(Eds\.\),Toronto, Canada,pp\. 220–231\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.bea-1.19),[Link](https://aclanthology.org/2023.bea-1.19)Cited by:[§1](https://arxiv.org/html/2606.15416#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.15416#S2.SS1.p1.1),[§2\.3](https://arxiv.org/html/2606.15416#S2.SS3.p2.1)\.
- K\. R\. Wang, A\. Variengien, A\. Conmy, B\. Shlegeris, and J\. Steinhardt \(2023\)Interpretability in the wild: a circuit for indirect object identification in GPT\-2 small\.InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1\-5, 2023,External Links:[Link](https://openreview.net/pdf?id=NpsVSN6o4ul)Cited by:[§2\.2](https://arxiv.org/html/2606.15416#S2.SS2.p1.1)\.
- L\. Wang, N\. Yang, and F\. Wei \(2024\)Learning to retrieve in\-context examples for large language models\.InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),Y\. Graham and M\. Purver \(Eds\.\),St\. Julian’s, Malta,pp\. 1752–1767\.External Links:[Link](https://aclanthology.org/2024.eacl-long.105)Cited by:[§2\.3](https://arxiv.org/html/2606.15416#S2.SS3.p2.1)\.
- M\. Wu, W\. Liu, X\. Wang, T\. Li, C\. Lv, Z\. Ling, J\. Zhu, C\. Zhang, X\. Zheng, and X\. Huang \(2024\)Advancing parameter efficiency in fine\-tuning via representation editing\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2024, Bangkok, Thailand, August 11\-16, 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),pp\. 13445–13464\.External Links:[Document](https://dx.doi.org/10.18653/V1/2024.ACL-LONG.726),[Link](https://doi.org/10.18653/v1/2024.acl-long.726)Cited by:[§2\.2](https://arxiv.org/html/2606.15416#S2.SS2.p1.1)\.
- A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei,et al\.\(2024\)Qwen2\. 5 technical report\.ArXiv preprintabs/2412\.15115\.External Links:[Link](https://arxiv.org/abs/2412.15115)Cited by:[§4\.1](https://arxiv.org/html/2606.15416#S4.SS1.p3.1),[§4\.5](https://arxiv.org/html/2606.15416#S4.SS5.p1.1)\.
- Z\. Yuan and T\. Briscoe \(2016\)Grammatical error correction using neural machine translation\.InProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,K\. Knight, A\. Nenkova, and O\. Rambow \(Eds\.\),San Diego, California,pp\. 380–386\.External Links:[Document](https://dx.doi.org/10.18653/v1/N16-1042),[Link](https://aclanthology.org/N16-1042)Cited by:[§1](https://arxiv.org/html/2606.15416#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.15416#S2.SS1.p1.1)\.
- M\. Zeng, J\. Kuang, M\. Qiu, J\. Song, and J\. Park \(2024\)Evaluating prompting strategies for grammatical error correction based on language proficiency\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\),N\. Calzolari, M\. Kan, V\. Hoste, A\. Lenci, S\. Sakti, and N\. Xue \(Eds\.\),Torino, Italia,pp\. 6426–6430\.External Links:[Link](https://aclanthology.org/2024.lrec-main.569)Cited by:[§1](https://arxiv.org/html/2606.15416#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.15416#S2.SS1.p1.1)\.
- Y\. Zhang, H\. Kamigaito, and M\. Okumura \(2023\)Bidirectional transformer reranker for grammatical error correction\.InFindings of the Association for Computational Linguistics: ACL 2023,A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 3801–3825\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.findings-acl.234),[Link](https://aclanthology.org/2023.findings-acl.234)Cited by:[§2\.1](https://arxiv.org/html/2606.15416#S2.SS1.p1.1)\.
- Y\. Zhang, B\. Zhang, Z\. Li, Z\. Bao, C\. Li, and M\. Zhang \(2022\)SynGEC: syntax\-enhanced grammatical error correction with a tailored GEC\-oriented parser\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,Y\. Goldberg, Z\. Kozareva, and Y\. Zhang \(Eds\.\),Abu Dhabi, United Arab Emirates,pp\. 2518–2531\.External Links:[Document](https://dx.doi.org/10.18653/v1/2022.emnlp-main.162),[Link](https://aclanthology.org/2022.emnlp-main.162)Cited by:[§2\.1](https://arxiv.org/html/2606.15416#S2.SS1.p1.1)\.
- H\. Zhou, Y\. Liu, Z\. Li, M\. Zhang, B\. Zhang, C\. Li, J\. Zhang, and F\. Huang \(2023\)Improving Seq2Seq grammatical error correction via decoding interventions\.InFindings of the Association for Computational Linguistics: EMNLP 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 7393–7405\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.495),[Link](https://aclanthology.org/2023.findings-emnlp.495)Cited by:[§2\.1](https://arxiv.org/html/2606.15416#S2.SS1.p1.1),[Table 2](https://arxiv.org/html/2606.15416#S4.T2.1.1.6.2)\.
- A\. Zou, L\. Phan, S\. Chen, J\. Campbell, P\. Guo, R\. Ren, A\. Pan, X\. Yin, M\. Mazeika, A\. Dombrowski, S\. Goel, N\. Li, M\. J\. Byun, Z\. Wang, A\. Mallen, S\. Basart, S\. Koyejo, D\. Song, M\. Fredrikson, J\. Z\. Kolter, and D\. Hendrycks \(2023\)Representation engineering: A top\-down approach to AI transparency\.ArXiv preprintabs/2310\.01405\.External Links:[Link](https://arxiv.org/abs/2310.01405)Cited by:[§2\.2](https://arxiv.org/html/2606.15416#S2.SS2.p1.1)\.

## Appendix AExperimental Settings

### A\.1Dataset Statistics

Our dataset usage is shown in[Table˜7](https://arxiv.org/html/2606.15416#A1.T7)\. The training data samples used to construct the database are initially filtered by length with a minimum of 10 to ensure quality\.

Training Dataset \(As Database\)Test DatasetLanguageName\#\\\#Erroneous\#\\\#CorrectName\#\\\#TotalEnglishW&I\+LOCNESS201856839CoNLL\-141312BEA\-194477GermanFalko\-Merlin118011916Falko\-Merlin2337RomanianRONACC6974108RONACC1519EstonianTartu\-L2\-Corpus71564Tartu\-L1\-Corpus1453Table 7:The statistics of GEC dataset used in experiments\. For the training datasets,\#\\\#Erroneous represents the number of erroneous samples, and\#\\\#Correct refers to the number of correct samples\. For the test datasets,\#\\\#Total indicates the total number of samples\.
### A\.2Language Diversity

Our language selection aligns with prior multilingual GEC studies\(Luhtaruet al\.,[2024](https://arxiv.org/html/2606.15416#bib.bib33); Stahlberg and Kumar,[2024](https://arxiv.org/html/2606.15416#bib.bib68)\), taking into account the diversity of language families\.

- •Germanic \(English, German\) and Romance \(Romanian\) languages: Both Indo\-European, but from different branches\.
- •Uralic \(Estonian\): a non\-Indo\-European language with agglutinative grammar and no grammatical gender, unlike the others\. As a linguistically distant and low\-resource language, Estonian showcases the breadth of GER’s applicability\.

We acknowledge the value of testing additional languages \(e\.g\., Czech, Chinese\) and will explore this in future work\.

### A\.3Model Settings

To ensure reproducibility, we applied deterministic decoding \(with temperature set to 0 and top\_p set to 1\.0\) during inference\. For the "Random" baseline, samples were selected using three different random seeds, and the results were averaged\.

### A\.4Prompt Settings

You are a language expert who is responsible for grammatical, lexical, and orthographic error correctionsgiven an input sentence\. Your job is to fix grammatical mistakes, awkward phrases, spelling errors, etc\.following standard written usage conventions, but your corrections must be conservative\.Please keep the original sentence \(words, phrases, and structure\) as much as possible\.The ultimate goal of this task is to make the given sentence sound natural to native speakerswithout making unnecessary changes\. Corrections are not required when the sentence is alreadygrammatical and sounds natural\.There is an erroneous sentence between ’<erroneous sentence\>’ and ’</erroneous sentence\>’\.Then grammatical errors in the erroneous sentence will be corrected\.The corrected version will be between ’<corrected sentence\>’ and ’</corrected sentence\>’\.<erroneous sentence\> text</erroneous sentence\><corrected sentence\> label</corrected sentence\>…<erroneous sentence\> text</erroneous sentence\><corrected sentence\> label</corrected sentence\><erroneous sentence\> source</erroneous sentence\><corrected sentence\>

Table 8:The prompts for the proposed method\. \{text\} and \{label\} denote the input text and correct sentence \(label\) for labeled GEC data\. \{source\} represents the test input text\.Throughout the entire experiment pipeline, we use the same prompt for GEC task as prior works\(Tanget al\.,[2024](https://arxiv.org/html/2606.15416#bib.bib29); Daviset al\.,[2024](https://arxiv.org/html/2606.15416#bib.bib34); Liet al\.,[2025](https://arxiv.org/html/2606.15416#bib.bib67)\), to form a fair comparison\. The correction prompt is shown in[Table˜8](https://arxiv.org/html/2606.15416#A1.T8)\.

### A\.5Dynamic Selection Setting

Dynamic example selection was introduced to ensure fair benchmarking against prior 8\-shot baselines\. During inference:

- •Given a test set of sizeNNandKeK\_\{e\}retrieved samples per edit, we obtain the GER for each edit in the test set and sort them in ascending order based on the first dimension of GER\.
- •Then, we select the topN∗K/KeN\*K/K\_\{e\}edits and use their corresponding samples to extract demonstrations\.

## Appendix BTime Efficiency

Our GER method can be divided into two parts:

- •Example Selection: Requires one forward pass over test data to extract GER\. Compared to previous methods \(e\.g\.,Liet al\.\([2025](https://arxiv.org/html/2606.15416#bib.bib67)\)\), which need to generate explicit explanations, our approach achieves a 50x speedup \(average explanation lengthL≈50L\\approx 50inLiet al\.\([2025](https://arxiv.org/html/2606.15416#bib.bib67)\)\)\.
- •Few\-shot Inference: With selected demonstrations, our inference latency matches that of standard 8\-shot inference, without additional overhead\.

## Appendix CCross\-domain demonstration set

DomainError TypeCaseSportpppInput: I have jogged along the riverbank for 45 minutes\.Label: Ihave been joggingalong the riverbank for 45 minutes\.spInput: Yesterday, she try to hold her breath underwater\.Label: Yesterday, shetriedto hold her breath underwater\.ArtpppInput: Marcel Duchamp submits a urinal to an art show in 1917\.Label: Marcel Duchampsubmitteda urinal to an art show in 1917\.spInput: For the entire week, Georgia O’Keeffe has painted her first giant flower close\-up\.Label: For the entire week, Georgia O’Keeffehas been paintingher first giant flower close\-up\.

Table 9:Examples from the manually constructed test set used in[Section˜5\.1\.2](https://arxiv.org/html/2606.15416#S5.SS1.SSS2)\.In[Section˜5\.1\.2](https://arxiv.org/html/2606.15416#S5.SS1.SSS2), we used the web version of[Deepseek\-v3](https://chat.deepseek.com/)to build 100 sport\-domain sentences with present perfect progressive \(ppp\) tense errors, and 100 art\-domain sentences with simple past \(sp\) tense errors\. We then created cross\-domain probes such as art\-domain samples with ppp errors and sport\-domain samples with sp errors to show the proximity and semantic neutrality of our GER\. The created cases are demonstrated in[Table˜9](https://arxiv.org/html/2606.15416#A3.T9)\.

Similar Articles

Refining Word-Based Grammatical Error Annotation for L2 Korean

arXiv cs.CL

This paper refines word-based grammatical error annotation for L2 Korean by addressing problems in existing resources, including surface target realization and single-reference evaluation, and demonstrates improvements using KoBART-based correction.