CoCoGEC: Counterfactual Generation for Robust Grammatical Error Correction

arXiv cs.CL 06/16/26, 04:00 AM Papers
Summary
Proposes CoCoGEC, a counterfactual generation framework that alters error-irrelevant contexts in GEC training data to improve model robustness, achieving significant F0.5 gains on perturbed benchmarks.
arXiv:2606.15069v1 Announce Type: new Abstract: Grammatical error correction (GEC) systems are usually trained and evaluated on GEC benchmarks, but their performance often drops sharply once the surrounding context is slightly perturbed or extended. This indicates that the existing GEC models usually fail to understand the error patterns in the varying contexts. In this paper, we thoroughly investigate the counterfactuals for GEC tasks, where the subtle changes to the contexts could lead to the label flipping issue. We propose CoCoGEC, a counterfactual generation framework that creates copies of training instances with error-irrelevant contexts altered. Our framework systematically generates counterfactuals by (1) generating intra- and inter-sentence counterfactuals that maintain the error patterns as well as syntax of the original instances by altering the word-level and sentence-level contexts; (2) revising the generated counterfactuals by selecting the instances with flipped labels and high GEC Mutual Information (MI) coefficient. Extensive experiments show that our method substantially improves the stability of GEC models, outperforming a set of data augmentation baselines. Particularly, it could achieve absolute F0.5 gains of +9.9, +11.3, and +20.8 points on the perturbed BEA-19*,CoNLL-14*, and TEM-8* data set.Our code is released at https://github.com/Quinnok/CoCoGEC
Original Article
View Cached Full Text
Cached at: 06/16/26, 11:45 AM
# CoCoGEC: Counterfactual Generation for Robust Grammatical Error Correction
Source: [https://arxiv.org/html/2606.15069](https://arxiv.org/html/2606.15069)
Qianyu Wang,Xiaoman Wang,Yuanyuan Liang,Xinyuan Li,Yunshi Lan East China Normal University \{wangqianyu, xmwang, leonyuany, xyli\}@stu\.ecnu\.edu\.cn, yslan@dase\.ecnu\.edu\.cn

###### Abstract

Grammatical error correction \(GEC\) systems are usually trained and evaluated on GEC benchmarks, but their performance often drops sharply once the surrounding context is slightly perturbed or extended\. This indicates that the existing GEC models usually fail to understand the error patterns in the varying contexts\. In this paper, we thoroughly investigate the counterfactuals for GEC tasks, where the subtle changes to the contexts could lead to the label flipping issue\. We proposeCoCoGEC, a counterfactual generation framework that creates copies of training instances with error\-irrelevant contexts altered\. Our framework systematically generates counterfactuals by \(1\) generating intra\- and inter\-sentence counterfactuals that maintain the error patterns as well as syntax of the original instances by altering the word\-level and sentence\-level contexts; \(2\) revising the generated counterfactuals by selecting the instances with flipped labels and high GEC Mutual Information \(MI\) coefficient\. Extensive experiments show that our method substantially improves the stability of GEC models, outperforming a set of data augmentation baselines\. Particularly, it could achieve absoluteF0\.5F\_\{0\.5\}gains of\+9\.9\+9\.9,\+11\.3\+11\.3, and\+20\.8\+20\.8points on the perturbed BEA\-19\*,CoNLL\-14\*, and TEM\-8\* data set\.Our code is released at[https://github\.com/Quinnok/CoCoGEC](https://github.com/Quinnok/CoCoGEC)\.

CoCoGEC: Counterfactual Generation for Robust Grammatical Error Correction

Qianyu Wang, Xiaoman Wang, Yuanyuan Liang, Xinyuan Li, Yunshi Lan††thanks:Corresponding authorEast China Normal University\{wangqianyu, xmwang, leonyuany, xyli\}@stu\.ecnu\.edu\.cn, yslan@dase\.ecnu\.edu\.cn

## 1Introduction

Grammatical Error Correction \(GEC\) aims to automatically detect and correct grammatical errors in text, supporting applications such as intelligent writing assistants and computer\-assisted language learning\. It has attracted increasing attention from both academia and industry in recent years\(Katinskaia and Yangarber,[2023](https://arxiv.org/html/2606.15069#bib.bib43),[2024](https://arxiv.org/html/2606.15069#bib.bib45); Liet al\.,[2025](https://arxiv.org/html/2606.15069#bib.bib46); Kovalchuket al\.,[2025](https://arxiv.org/html/2606.15069#bib.bib47)\)\. However, we observe a substantial gap between the well\-trained GEC models and their applications to the real world\.

![Refer to caption](https://arxiv.org/html/2606.15069v1/x1.png)Figure 1:Motivation forCoCoGEC\. \(a\) Context shift between standard benchmarks and real\-world inputs\. \(b\) Robustness drop on TEM\-8 under perturbations with GPT\-4\. \(c\) The illustrative examples of two types of counterfactuals\.Figure[1](https://arxiv.org/html/2606.15069#S1.F1)\(a\) illustrates an example from the BEA\-19 task where “least” should be corrected to “most” and “facilitate” to “facilitating” in the context of “Discovery learning”\. While GPT\-4 correctly revises this sentence on standard GEC benchmarks, GEC models often fail when encountering similar errors in diverse real\-world contexts\. For example, in a longer passage about “Active participation”, the same erroneous phrase reappears\. However, GPT\-4 leaves it unchanged and thus exhibits under\-correction errors\.

To quantify this robustness gap, we conduct a preliminary experiment\. We generate the "real\-world" data via two augmentation methods: \(1\) word\-level perturbation, involving random token alterations in test sentences from TEM\-8\(Yang,[2017](https://arxiv.org/html/2606.15069#bib.bib97)\); and \(2\) sentence\-level perturbation, constructed by randomly combining sentences from the test set111Implementation details of the preliminary experiment can be found in Appendix[A](https://arxiv.org/html/2606.15069#A1)\.\. As shown in Figure[1](https://arxiv.org/html/2606.15069#S1.F1)\(b\), there is an observable performance drop between the original GEC data and "real\-world" data, especially for sentence\-level perturbation\.

Prior studies\(Zhanget al\.,[2023](https://arxiv.org/html/2606.15069#bib.bib64); Wanget al\.,[2024a](https://arxiv.org/html/2606.15069#bib.bib52)\)have revealed that current GEC models are vulnerable to seemingly harmless perturbations\. But these studies mainly focus on noise\-based attacks or broad augmentation, rather than interpreting the potential perturbations that fundamentally affect GEC models\. For example,[Wanet al\.](https://arxiv.org/html/2606.15069#bib.bib49)\([2020](https://arxiv.org/html/2606.15069#bib.bib49)\) and[Parket al\.](https://arxiv.org/html/2606.15069#bib.bib71)\([2023](https://arxiv.org/html/2606.15069#bib.bib71)\) use noise injection,[Lichtargeet al\.](https://arxiv.org/html/2606.15069#bib.bib72)\([2019](https://arxiv.org/html/2606.15069#bib.bib72)\) and[Stahlberg and Kumar](https://arxiv.org/html/2606.15069#bib.bib73)\([2021](https://arxiv.org/html/2606.15069#bib.bib73)\) generate pseudo corpora, and[Wanget al\.](https://arxiv.org/html/2606.15069#bib.bib52)\([2024a](https://arxiv.org/html/2606.15069#bib.bib52)\) and[Li and Lan](https://arxiv.org/html/2606.15069#bib.bib74)\([2025](https://arxiv.org/html/2606.15069#bib.bib74)\) propose contextual augmentation\. These approaches expand or re\-distribute training data, but the perturbations are often random or coarsely controlled, limiting their ability to explain the performance gap\.

In this paper, we address robustness to word\- and sentence\-level perturbations by asking:how can we make a GEC model focus more on error patterns while ignoring varying context?To this end, we propose a novelCoCoGECmethod, inspired by counterfactual analysis\. The intuition behindCoCoGECis to create copies of training instances with their error\-irrelevant contexts altered\. We identify two types of decoupled counterfactuals for GEC data, which aim to alter the word\-level and sentence\-level contexts without influencing the original error patterns in a sentence, but could confuse the prediction of a GEC model\. We display the "counterfactual" GEC in Figure[1](https://arxiv.org/html/2606.15069#S1.F1)\(c\)\. With the counterfactuals, the GEC model would learn to put more emphasis on the error patterns when learning how to correct a sentence\.

CoCoGECuses large language models \(LLMs\) to generate span\-controlled intra\-sentence variants that substitute error\-irrelevant spans while keeping the gold correction edits valid\. It also constructs inter\-sentence variants by attaching coherent, error\-free prefixes and suffixes to emulate discourse\-level context shifts\. We enforce an edit\-level fidelity constraint to filter invalid candidates, then rank the remaining counterfactuals with a GEC mutual\-information score and keep the most challenging ones for augmentation\. The experimental results consistently verify thatCoCoGECimproves the robustness of GEC models\.

The main contributions of this work are:

- •To the best of our knowledge, this is the first study to explore counterfactuals for GEC tasks by characterizing potential perturbations with three criteria\.
- •We introduceCoCoGEC, a counterfactual generation pipeline tailored to contexts in GEC, which systematically constructs intra\-sentence and inter\-sentence variants without influencing the original error pattern, but could confuse the prediction of a GEC model\.
- •We propose a novel GEC mutual information coefficient that captures the dependence between the varying context and the model predictions for identifying high\-quality counterfactuals\.
- •We demonstrate on the RobustGEC benchmark that COCOGEC consistently improves robustness under both intra\- and inter\-level perturbations, without sacrificing performance on standard test settings\.

## 2Related Work

#### Robust GEC\.

Modern GEC systems mainly fall into sequence\-to\-sequence generation\(Vaswaniet al\.,[2017](https://arxiv.org/html/2606.15069#bib.bib14); Junczys\-Dowmuntet al\.,[2018](https://arxiv.org/html/2606.15069#bib.bib27)\), sequence\-to\-edit correction\(Awasthiet al\.,[2019](https://arxiv.org/html/2606.15069#bib.bib28); Stahlberg and Kumar,[2020](https://arxiv.org/html/2606.15069#bib.bib29); Omelianchuket al\.,[2020](https://arxiv.org/html/2606.15069#bib.bib16); Qorib and Ng,[2023](https://arxiv.org/html/2606.15069#bib.bib95)\)\(including hybrid detection–correction variants\(Liet al\.,[2023](https://arxiv.org/html/2606.15069#bib.bib65); Li and Wang,[2024](https://arxiv.org/html/2606.15069#bib.bib89)\)\), and recent LLM\-based pipelines with prompting or light supervision\(Loemet al\.,[2023](https://arxiv.org/html/2606.15069#bib.bib58); Coyneet al\.,[2023](https://arxiv.org/html/2606.15069#bib.bib59); Katinskaia and Yangarber,[2024](https://arxiv.org/html/2606.15069#bib.bib45); Tanget al\.,[2024](https://arxiv.org/html/2606.15069#bib.bib104)\)\. Despite strong benchmark performance, existing models can be brittle under small contextual shifts, motivating robustness\-oriented training and data construction\. Robustness is typically pursued along two complementary directions\. Model\-centric methods improve stability by explicitly regularizing invariance—via adversarial objectives\(Danget al\.,[2021](https://arxiv.org/html/2606.15069#bib.bib40)\), distillation\-style constraints\(Xiaet al\.,[2022](https://arxiv.org/html/2606.15069#bib.bib41)\), or consistency\-based post\-training on constructed variants and hard cases \(e\.g\., RobustGEC/TemplateGEC/CLEME2\.0/CSA\)\(Zhanget al\.,[2023](https://arxiv.org/html/2606.15069#bib.bib64); Liet al\.,[2023](https://arxiv.org/html/2606.15069#bib.bib65); Yeet al\.,[2024](https://arxiv.org/html/2606.15069#bib.bib57); Tanget al\.,[2023](https://arxiv.org/html/2606.15069#bib.bib23)\)\. Data\-centric methods instead broaden supervision by synthesizing training pairs through noise injection\(Solymanet al\.,[2023](https://arxiv.org/html/2606.15069#bib.bib26); Sunet al\.,[2023](https://arxiv.org/html/2606.15069#bib.bib25)\), back\-translation\(Fanget al\.,[2023](https://arxiv.org/html/2606.15069#bib.bib22)\), and contextual or edit\-based augmentation\(Wanget al\.,[2024a](https://arxiv.org/html/2606.15069#bib.bib52); Yeet al\.,[2023](https://arxiv.org/html/2606.15069#bib.bib50)\), sometimes coupled with robustness\-oriented annotation or curricula\(Li and Lan,[2025](https://arxiv.org/html/2606.15069#bib.bib74); Zhanget al\.,[2025](https://arxiv.org/html/2606.15069#bib.bib107)\)\. However, much of this augmentation primarily targets error diversity or reweighting, and indiscriminate synthetic data may even degrade GEC performance\(Parket al\.,[2023](https://arxiv.org/html/2606.15069#bib.bib71)\)\. In contrast, our approach is data\-centric: we generate context\-decoupled counterfactuals with an edit\-subset constraint \(E′⊆EE^\{\\prime\}\\subseteq E\) to target context robustness, and they can be used with various GEC backbones\.

#### Counterfactual Analysis Beyond GEC\.

Counterfactual data augmentation \(CDA\) improves robustness by generating controlled perturbations that preserve or systematically modify labels, encouraging models to rely on invariant features and generalize out of distribution\(Wanget al\.,[2024b](https://arxiv.org/html/2606.15069#bib.bib108); Jianget al\.,[2024](https://arxiv.org/html/2606.15069#bib.bib109)\)\. Recent progress largely comes from strengthening controllability and label fidelity in generation: diffusion\-based frameworks provide a powerful mechanism for robust synthesis and transfer\(Xinet al\.,[2024](https://arxiv.org/html/2606.15069#bib.bib75); Chenet al\.,[2024](https://arxiv.org/html/2606.15069#bib.bib79); Baeet al\.,[2025](https://arxiv.org/html/2606.15069#bib.bib80); Wang and Wan,[2022](https://arxiv.org/html/2606.15069#bib.bib78)\), while optimization\-driven formulations enforce invariance through reinforcement learning, information bottlenecks, and contrastive objectives\(Chenet al\.,[2021](https://arxiv.org/html/2606.15069#bib.bib83); Sreedharet al\.,[2025](https://arxiv.org/html/2606.15069#bib.bib90); Changet al\.,[2024](https://arxiv.org/html/2606.15069#bib.bib56); Choiet al\.,[2022](https://arxiv.org/html/2606.15069#bib.bib84)\)\. In parallel, counterfactuals have shifted from rule\-based edits to more controllable generative pipelines, including distillation and LLM\-driven synthesis\(Chenet al\.,[2023b](https://arxiv.org/html/2606.15069#bib.bib51); Youssefet al\.,[2024](https://arxiv.org/html/2606.15069#bib.bib91); Howardet al\.,[2022](https://arxiv.org/html/2606.15069#bib.bib36); Trevisoet al\.,[2023](https://arxiv.org/html/2606.15069#bib.bib87); Zhouet al\.,[2023](https://arxiv.org/html/2606.15069#bib.bib88)\), as well as explanation\-oriented designs that improve interpretability and faithfulness\(Yanget al\.,[2024](https://arxiv.org/html/2606.15069#bib.bib85); Anet al\.,[2025](https://arxiv.org/html/2606.15069#bib.bib86)\)\.

Collectively, these studies offer broad tools for building context\-invariant NLP models, but they focus on classification tasks, leaving counterfactual generation for structured prediction and GEC comparatively underexplored\. Our work brings CDA to GEC by designing contextual counterfactuals tailored to error correction, rather than label\-flipping counterfactuals commonly used in discriminative settings\.

## 3Method

![Refer to caption](https://arxiv.org/html/2606.15069v1/x2.png)Figure 2:Overview ofCoCoGEC: we generate span\-controlled intra\-sentence variants and attach coherent prefix/suffix to form long\-context counterfactuals, then filter by edit\-set consistency and rank by GEC mutual information to keep the most confusing yet valid augmentations\.### 3\.1Definition of Counterfactuals for GEC

Existing studiesWanget al\.\([2024c](https://arxiv.org/html/2606.15069#bib.bib67)\); Vermaet al\.\([2024](https://arxiv.org/html/2606.15069#bib.bib68)\)have made a formal definition of counterfactuals in general machine learning tasks\. A counterfactual exampleccusually disturbs a model to predict an instancexxas an alternative classy′y^\{\\prime\}instead of its original classyyby makingminimal yet necessarychanges toxxas follows:

argmincdist\(x,c\)\\displaystyle arg\\operatorname\*\{min\}\_\{c\}\\quad\\text\{dist\}\(x,c\)s\.t\.f\(c\)≠f\(x\)\\displaystyle\\text\{s\.t\.\}\\quad f\(c\)\\neq f\(x\)
whereffis a task\-specific modelf:𝚇∈ℝd→𝚈f:\\mathtt\{X\}\\in\\mathbb\{R\}^\{d\}\\rightarrow\\mathtt\{Y\}to bridge the mapping fromxxtoyyanddist\(⋅,⋅\)dist\{\(\\cdot,\\cdot\)\}is a distance function that measures the cost of changes required to alter the prediction\.

The above definition outlines the fundamental principles of the counterfactual generation problem\. Considering the distinct problem formulation of the GEC task, we identify the counterfactuals for GEC as conducting a subtle change to the source text, but resulting in a subset of the original edits\. Motivated by the example in Figure[1](https://arxiv.org/html/2606.15069#S1.F1), we mainly focus on the counterfactuals for intra\-sentence and inter\-sentence\. We consider a counterfactual to be shown in the formc=p⊕s′⊕qc=p\\oplus s^\{\\prime\}\\oplus q, wheres′s^\{\\prime\}minimally modifies thess, andppandqqare a prefix and a suffix to the source text, respectively\. As a result, we have:

argmincdist\(s,c\)\\displaystyle\\operatorname\*\{argmin\}\_\{c\}\\quad\\text\{dist\}\(s,c\)=syntax\_dist\(s,s′\)\+semantic\_dist\(s′,p⊕q\)\\displaystyle=\\text\{syntax\\\_dist\}\(s,s^\{\\prime\}\)\+\\text\{semantic\\\_dist\}\(s^\{\\prime\},p\\oplus q\)s\.t\.ℰ′⊆ℰ\\displaystyle\\text\{s\.t\.\}\\quad\\mathcal\{E\}^\{\\prime\}\\subseteq\\mathcal\{E\}\(1\)
Formally,\(s,t\)\(s,t\)denotes the annotated source and target text for grammatical error correction\. Ideally,f\(⋅\)f\(\\cdot\)is a GEC model which takes the source text as the input and perfectly produces the corrected text, such thatf\(s\)=tf\(s\)=tandf\(c\)=t′f\(c\)=t^\{\\prime\}\. The edit mapping of the original texts→ts\\rightarrow tand counterfactual textc→t′c\\rightarrow t^\{\\prime\}are denoted asℰ\\mathcal\{E\}andℰ′\\mathcal\{E\}^\{\\prime\}, respectively\.

Regarding Equation \([1](https://arxiv.org/html/2606.15069#S3.E1)\), we interpret the countertfactuals for GEC as follows:

- •Minimal syntactic revision to source text\. For the revision in intra\-sentence, we make minor verbal adjustments, which may change the semantics of the sentence rather than its syntax\. We denote it as minimalsyntax\_dist\(s,s′\)syntax\\\_dist\(s,s^\{\\prime\}\)\.
- •Semantic coherence to the revision\. For the revision in inter\-sentence, we append a prefix and a suffix to the revised source text, but keep semantic coherence to the revised source text, which prevents the illogical flow of the counterfactuals\. We define minimalsemantic\_dist\(s′,p⊕q\)semantic\\\_dist\(s^\{\\prime\},p\\oplus q\)\.
- •Flipped edit labels\. We consider that a good counterfactual for GEC could flip the prediction of a GEC model, where the original edits cannot be recognized well\. To avoid intertwined grammatical errors, new edits cannot be introduced\. Hence, we deemℰ′\\mathcal\{E\}^\{\\prime\}as a subset ofℰ\\mathcal\{E\}as the outcomes of the counterfactuals\.

### 3\.2Counterfactual Generation with LLMs

Next, we generate counterfactuals following the above definition\. We first target at constructing the syntactically similars′s^\{\\prime\}toss, which could flip the original error annotations\. However, it is not trivial to control the perturbation of the source sentence\. Unlike the label\-flipping objective in discriminative settings, our goal in GEC is topreservethe original grammatical errors andnot includenew errors\. Thus, we would like the error\-irrelevant context to be perturbed while keeping the syntax unchanged\.

We propose a pipeline of counterfactuals generation for GEC with the integration of intra\-sentence and inter\-sentence perturbation, which corresponds tos′s^\{\\prime\}andp⊕qp\\oplus qdefined in Equation \([1](https://arxiv.org/html/2606.15069#S3.E1)\), accordingly\. Specifically, we first distill counterfactuals fors′s^\{\\prime\}generation with LLMs in a controllable manner\. Then, we generateppandqqfor the purpose of the inter\-sentence perturbation\. At last, we extract the counterfactuals satisfying our objective of flipping the original edits while not introducing new edits\. This pipeline results in an integrated counterfactual in the format ofp⊕s′⊕qp\\oplus s^\{\\prime\}\\oplus q\.

#### Intra\-sentence counterfactual generation ofs′s^\{\\prime\}\.

To generates′s^\{\\prime\}, we target at minimal syntactic revision to source textss\. Prior studyChenet al\.\([2023b](https://arxiv.org/html/2606.15069#bib.bib51)\)introduces𝒟ℐ𝒮𝒞𝒪\\mathcal\{DISCO\}method to prompt LLMs via in\-context learning to generate counterfactuals for natural language inference, which controls the spans for variant generation\. We follow the principle of𝒟ℐ𝒮𝒞𝒪\\mathcal\{DISCO\}and distill intra\-sentence counterfactuals with LLMs as follows:

- •We first segment each source sentencessinto a sequence of candidate spans𝒲=\{w\|w∈s\}\\mathcal\{W\}=\\\{w\|w\\in s\\\}using a Flair\-based chunker\(Akbiket al\.,[2019](https://arxiv.org/html/2606.15069#bib.bib8)\)\. To prevent unexpected edits to the erroneous regions, we discard all spans overlapping withℰ\\mathcal\{E\}, keeping only spans outside the gold edits\.
- •For each retained spanw∈𝒲w\\in\\mathcal\{W\}, we sample a masked variant by replacingwwwith a special token\[BLANK\], and feed into an LLM with an instruction asking the model to fill the blank with an alternative phrase, which achieves insertion, deletion, or replacement forww\. We generate multiple variants for eachss\.

As shown in Figure[2](https://arxiv.org/html/2606.15069#S3.F2), the illustrative example “American popular music, along with its many variants, are selled to a demand audience\.” can have multiple intra\-sentence counterfactuals with the unchanged syntax\. To preserve alignment between the perturbed source and its correction, we apply LLM filling only to non\-error spans\. Whenever a spanwwin the source sentencessis replaced by an LLM\-generated fragmentw~\\tilde\{w\}, we simultaneously replace the aligned span in the gold targetttwith the samew~\\tilde\{w\}\. As a result, we collect a set of counterfactual sentence pairs, denoted as\(s′,t′\)\(s^\{\\prime\},t^\{\\prime\}\)\.

#### Inter\-sentence counterfactual generation ofp⊕qp\\oplus q\.

To probe long\-range context and expose GEC models to paragraph\-level correction, we generate prefix and suffix fors′s^\{\\prime\}\. To ensure the semantic coherence of the revision, for each pair\(s′,t′\)\(s^\{\\prime\},t^\{\\prime\}\), we automatically attach error\-free context on both sides by sampling a short grammatical prefixppand suffixqqfrom an LLM\. As the example shown in Figure[2](https://arxiv.org/html/2606.15069#S3.F2), the intra\-sentence counterfactual is mentioned in the middle of long contexts\. Eventually, we obtainc=p⊕s′⊕qc=p\\oplus s^\{\\prime\}\\oplus qas defined in Equation \([1](https://arxiv.org/html/2606.15069#S3.E1)\)\. We attach the same prefix and suffix tot′t^\{\\prime\}, and the sentence pair\(c,t′\)\(c,t^\{\\prime\}\)forms a pre\-defined counterfactual\. It is worth noting that sinceppandqqare error\-free contexts, the edit mappingc→t′c\\rightarrow t^\{\\prime\}will not introduce new errors beyond the spanss\. We denote the set of generated counterfactuals as𝒞=\{\(ci,ti′\)\}i=1n\\mathcal\{C\}=\\\{\(c\_\{i\},t\_\{i\}^\{\\prime\}\)\\\}\_\{i=1\}^\{n\}

### 3\.3Counterfactual Revision with GEC Mutual Information

#### Edit\-subset counterfactual filtering\.

To ensure that label flipping occurs inccwithin the spanss, we conduct filtering\. Specifically, we employ ERRANT\(Bryantet al\.,[2017b](https://arxiv.org/html/2606.15069#bib.bib66)\)to conduct the edit mapping ofs→ts\\rightarrow tandc→t′c\\rightarrow t^\{\\prime\}to produceℰ\\mathcal\{E\}andℰ′\\mathcal\{E\}^\{\\prime\}, respectively\. If a sentence pair satisfiesℰ′⊆ℰ\\mathcal\{E\}^\{\\prime\}\\subseteq\\mathcal\{E\}, we keep it; otherwise, we abandon it\.This results in a set of counterfactuals that follows our pre\-defined principle\.

#### Mutual\-information scoring and selection\.

The above steps ensure a counterfactual to bevalidbut notoptimal\. An optimal counterfactual in GEC tasks should be a hard negative that is highly related to the correct form but particularly challenging for a GEC model to distinguish\. In other words, the flipped edits should be difficult to detect via a GEC model\. In Figure[2](https://arxiv.org/html/2606.15069#S3.F2)\(b\), we show two counterfactual candidates derived from the source sentence about “American popular music”\. The retained candidate makes only a minor contextual insertion \(e\.g\., “generally”\) while keeping the original subject, forming a challenging near\-miss that remains compatible with the gold target\. By contrast, another candidate changes the subject to “Consumer products”, which becomes semantically misaligned with the original correction target and is therefore ranked low and filtered out\. To turn the large pool of generated counterfactuals into high\-quality examples\. We propose a novel Mutual\-Information–based scoring function to measure the GEC mutual information of a counterfactual\. Previous studies on counterfactual data augmentation\(Chenet al\.,[2018](https://arxiv.org/html/2606.15069#bib.bib69); Plyler and Chi,[2025](https://arxiv.org/html/2606.15069#bib.bib70)\)adapted Mutual Information \(MI\) to counterfactual generation, holding the following assumption:

- •Seeking a counterfactual is referred to as the maximum mutual information criterion: I\(c;y\)=𝔼𝙲,𝚈\[log⁡P𝙲,𝚈\(c,y\)P𝙲\(c\)P𝚈\(y\)\]\\displaystyle I\(c;y\)=\\mathbb\{E\}\_\{\\mathtt\{C\},\\mathtt\{Y\}\}\[\\log\\frac\{P\_\{\\mathtt\{C,Y\}\}\(c,y\)\}\{P\_\{\\mathtt\{C\}\}\(c\)P\_\{\\mathtt\{Y\}\}\(y\)\}\] whereccdenotes the counterfactual andyydenotes the original prediction\.I\(c;y\)I\(c;y\)measures the dependence between the counterfactual and the original prediction\.

When it comes to the GEC task, we tend to measure the dependence between the erroneous source textccand the original target textttas follows:

argmaxc∈𝒞I\(c;t\)\\displaystyle\\text\{argmax\}\_\{c\\in\\mathcal\{C\}\}I\(c;t\)=𝔼𝙲,𝚃\[log⁡P𝙲,𝚃\(c,t\)P𝙲\(c\)P𝚃\(t\)\]\\displaystyle=\\mathbb\{E\}\_\{\\mathtt\{C\},\\mathtt\{T\}\}\[\\log\\frac\{P\_\{\\mathtt\{C,T\}\}\(c,t\)\}\{P\_\{\\mathtt\{C\}\}\(c\)P\_\{\\mathtt\{T\}\}\(t\)\}\]=𝔼𝙲,𝚃\[log⁡P𝚃\|𝙲\(t\|c\)−log⁡P𝚃\(t\)\]\\displaystyle=\\mathbb\{E\}\_\{\\mathtt\{C\},\\mathtt\{T\}\}\[\\log P\_\{\\mathtt\{T\|C\}\}\(t\|c\)\-\\log P\_\{\\mathtt\{T\}\}\(t\)\]
As we can see from the formula, a good counterfactualccin the GEC task should have a high probabilityP𝚃\|𝙲\(t\|c\)P\_\{\\mathtt\{T\|C\}\}\(t\|c\)of transferring a counterfactualccto the original target texttt, where the perturbed errors inccdo not alter the original prediction effectively\. This is also influenced by the fluency of the predictiontt, a smallerlog⁡P𝚃\(t\)\\log P\_\{\\mathtt\{T\}\}\(t\)indicates a less fluenttt, which makes the GEC model more confused\.

In GEC tasks, we approximate these two terms using neural network\-based GEC models\. We employ a GEC model with a Seq2seq framework and compute the joint probability of the sequential tokens in the target text:

log⁡P𝚃\|𝙲\(t\|c\)=1\|t\|∑i=1\|t\|log⁡GECSeq2seq\(wi∣c,w<i\),\\displaystyle\\log P\_\{\\mathtt\{T\|C\}\}\(t\|c\)=\\frac\{1\}\{\|t\|\}\\sum\_\{i=1\}^\{\|t\|\}\\log\\text\{GEC\}\_\{\\text\{Seq2seq\}\}\\bigl\(w\_\{i\}\\mid c,w\_\{<i\}\\bigr\),wherewiw\_\{i\}denotes the generation ofii\-th token intt\.

For the GEC model with the Seq2Edit framework, we approximate with the joint probability of the sequential operations leading to the target text:

log⁡P𝚃\|𝙲\(t\|c\)=1\|c\|∑i=1\|c\|log⁡GECSeq2edit\(ei∣c,e<i\)\\displaystyle\\log P\_\{\\mathtt\{T\|C\}\}\(t\|c\)=\\frac\{1\}\{\|c\|\}\\sum\_\{i=1\}^\{\|c\|\}\\log\\text\{GEC\}\_\{\\text\{Seq2edit\}\}\\bigl\(e\_\{i\}\\mid c,e\_\{<i\}\\bigr\)whereeie\_\{i\}denotes the operation toii\-th token incc\.

Regardinglog⁡P𝚃\(t\)\\log P\_\{\\mathtt\{T\}\}\(t\), we employ GPT\-2\-medium\(Radfordet al\.,[2019](https://arxiv.org/html/2606.15069#bib.bib13)\)to compute the joint probability of the tokens in the target text\.

log⁡P𝚃\(t\)=1\|t\|∑i=1\|t\|log⁡LM\(wi∣w<i\)\\displaystyle\\log P\_\{\\mathtt\{T\}\}\(t\)=\\frac\{1\}\{\|t\|\}\\sum\_\{i=1\}^\{\|t\|\}\\log\\text\{LM\}\\bigl\(w\_\{i\}\\mid w\_\{<i\}\\bigr\)wherewiw\_\{i\}denotes the generation ofii\-th token intt\.

We compute an MI score for each counterfactual in𝒞\\mathcal\{C\}, sort them in descending order, and select the topkkpercent to form the final set\. MI scoring reflects our desiderata: a high\-quality counterfactual is a close near\-miss to the correct form rather than a random, malformed utterance\.

Table 1:Main results on RobustGEC test sets\. We denote the perturbed data subsets in RobustGEC as BEA\-19\*, CoNLL\-14\*, and TEM\-8\*\.CoCoGECconsistently improvesF0\.5F\_\{0\.5\}over CPR,𝒟ℐ𝒮𝒞𝒪\\mathcal\{DISCO\}generation, and TypeDA across all backbone–dataset pairs, with the largest absolute gains on the long\-context TEM\-8\* data\.Table 2:Performance breakdown on TEM\-8 across disturbance settings\.Δ↓\\Delta\\downarrowdenotes the drop from the Source setting, computed asF0\.5Source−F0\.5PerturbedF\_\{0\.5\}^\{\\text\{Source\}\}\-F\_\{0\.5\}^\{\\text\{Perturbed\}\}; lowerΔ\\Deltaindicates better robustness\.

## 4Experimental Setup

### 4\.1Dataset

We conduct all experiments on RobustGEC\(Zhanget al\.,[2023](https://arxiv.org/html/2606.15069#bib.bib64)\), which augments BEA\-19, CoNLL\-14, and TEM\-8 with robustness\-oriented perturbations to error\-irrelevant context\. Each original sentence pair has several human\-generated variants, and we split the data by case into training, development, and test sets in a 7:1:2 ratio, keeping all variants of a case in the same split\. For long\-context evaluation, we use GPT\-4222[https://openai\.com/research/gpt\-4](https://openai.com/research/gpt-4)to add a shared prefixppand suffixqqto each source–target pair, yielding long\-context test sets denoted BEA\-19∗, CoNLL\-14∗, and TEM\-8∗\. Unless otherwise noted, models are trained and tuned only on the original sentence pairs\.

Table 3:Robustness to perturbation number from11to55\.CoCoGECmaintains a consistent advantage over the baseline across most attack settings, indicating improved robustness under stronger perturbations\.Table 4:Ablation of counterfactual components on TEM\-8 with GECToR, showing that span\-/document\-level edits and MI\-based selection \(k=30%k=30\\%\) give the bestF0\.5F\_\{0\.5\}and robustness\.
### 4\.2Evaluation Metrics

We report standard edit\-based metrics \(precision, recall, andF0\.5F\_\{0\.5\}\) computed with ERRANT\(Bryantet al\.,[2017a](https://arxiv.org/html/2606.15069#bib.bib9)\)\. For RobustGEC, we additionally report CRS and P\-CRS\(Zhanget al\.,[2023](https://arxiv.org/html/2606.15069#bib.bib64)\)to quantify correction consistency across context\-perturbed variants \(case\-level and pairwise, respectively\)\. For attacked sets, following CSA\(Tanget al\.,[2023](https://arxiv.org/html/2606.15069#bib.bib23)\), we further report SR and TR, measuring recovery at the sentence and token levels; higher values indicate better robustness\.

### 4\.3Comparative Methods

We compareCoCoGECwith representative robustness and augmentation baselines\. As backbones, we use GECToR\-large\(Omelianchuket al\.,[2020](https://arxiv.org/html/2606.15069#bib.bib16)\)for Seq2Edit, T5\-large\(Raffelet al\.,[2020](https://arxiv.org/html/2606.15069#bib.bib98)\)for Seq2Seq, and Qwen3\-8B for LLM\-based GEC\. CPR\(Zhanget al\.,[2023](https://arxiv.org/html/2606.15069#bib.bib64)\)improves contextual consistency via KL\-divergence regularization\.𝒟ℐ𝒮𝒞𝒪\\mathcal\{DISCO\}\(Chenet al\.,[2023a](https://arxiv.org/html/2606.15069#bib.bib94)\)distills counterfactual data with LLMs\. TypeDA\(Li and Lan,[2025](https://arxiv.org/html/2606.15069#bib.bib74)\)augments training data with type\-aware LLM annotation through masked modeling and error filling\. We also include zero\-shot GEC baselines with GPT\-4o\(Achiamet al\.,[2023](https://arxiv.org/html/2606.15069#bib.bib19)\)and LLaMA3\-8B as representative general\-purpose LLMs\.

### 4\.4Implementation Details

We generate counterfactuals by chunking candidate spans with Flair333[https://github\.com/flairNLP/flair](https://github.com/flairNLP/flair)and enforcing edit constraints using gold edit sets extracted by ERRANT444[https://github\.com/chrisjbryant/errant](https://github.com/chrisjbryant/errant), followed by basic hygiene filtering \(length control, empty/degenerate removal, and de\-duplication\)\. We estimate the likelihood termlog⁡P\(t∣c\)\\log P\(t\\mid c\)with an ensemble of GEC scorers, and implement fine\-tuning with LLaMA\-Factory and LoRA\(Zhenget al\.,[2024](https://arxiv.org/html/2606.15069#bib.bib100); Huet al\.,[2022](https://arxiv.org/html/2606.15069#bib.bib101)\)\. Templates and remaining details are in Appendix[B\.3](https://arxiv.org/html/2606.15069#A2.SS3)\.

## 5Results and Analyses

### 5\.1Main Results on RobustGEC

Table[1](https://arxiv.org/html/2606.15069#S3.T1)reports the performance ofCoCoGECand data\-augmentation baselines on the three RobustGEC benchmarks\. We summarize three main observations:

- •CoCoGECimproves robustness across different backbones\.Across Seq2Edit \(GECToR\-large\), Seq2Seq \(T5\-large\), and LLM\-based \(Qwen3\-8B\) backbones,CoCoGECconsistently improvesF0\.5F\_\{0\.5\}on all three benchmarks\. For Qwen3\-8B in particular,CoCoGECyields absoluteF0\.5F\_\{0\.5\}gains of\+9\.9\+9\.9,\+11\.3\+11\.3, and\+20\.8\+20\.8points on BEA\-19\*, CoNLL\-14\*, and TEM\-8\*, respectively, indicating more context\-invariant correction behavior, especially in long\-context scenarios\.
- •CoCoGECoutperforms existing augmentation methods\.Compared with noise injection \(CPR\), distillation\-based augmentation \(𝒟ℐ𝒮𝒞𝒪\\mathcal\{DISCO\}\), and type\-aware augmentation \(TypeDA\),CoCoGECattains the highestF0\.5F\_\{0\.5\}in every backbone–dataset pair in Table[1](https://arxiv.org/html/2606.15069#S3.T1)\. On the context\-perturbed TEM\-8\* benchmark in particular, it surpasses all baselines by a clear margin, suggesting that controllable context counterfactuals provide stronger training signals than random or loosely controlled perturbations, even when using fewer counterfactual instances\.
- •CoCoGECnarrows the gap between compact models and large LLM baselines\.Although large LLMs \(e\.g\., GPT\-4o, LLaMA3\-8B, and Qwen3\-235B\) are strong zero\-shot baselines, they can still be sensitive to contextual perturbations\. When fine\-tuned withCoCoGEC, Qwen3\-8B matches or slightly surpasses these zero\-shot LLM baselines on RobustGEC perturbed subsets, suggesting that counterfactual, data\-centric optimization can be a parameter\-efficient alternative to simply scaling model size\.

### 5\.2Breakdown by Perturbation Type

Since Section[5\.1](https://arxiv.org/html/2606.15069#S5.SS1)shows consistent gains across GECToR, T5, and Qwen3, we use GECToR\-large as a representative Seq2Edit backbone for a detailed robustness analysis on TEM\-8\. Table[2](https://arxiv.org/html/2606.15069#S3.T2)further decomposes performance into the unperturbed*Source*setting, word\-level, sentence\-level, and combined perturbations\. Vanilla GECToR suffers the largest drop inF0\.5F\_\{0\.5\}under sentence\-level and combined perturbations, while word\-level perturbation alone is less harmful, suggesting that long\-range contextual shifts are the main source of brittleness\. Robustness\-oriented augmentations reduce this drop, andCoCoGECachieves the smallestF0\.5F\_\{0\.5\}degradation in all settings while maintaining strong source performance, indicating that training with context\-decoupled counterfactuals stabilizes GEC in long\-context scenarios\.

### 5\.3Ablation Studies

We ablateCoCoGECon TEM\-8 with GECToR to disentangle the effects of counterfactual generation and revision\. As shown in Table[4](https://arxiv.org/html/2606.15069#S4.T4), adding only global context rewriting \(\+ GPT\-basedcc\) yields small but consistent improvements, while intra\-sentence variantss′s^\{\\prime\}and inter\-sentence expansionp⊕qp\\oplus qeach bring larger gains inF0\.5F\_\{0\.5\}and robustness metrics\. The revision stage further improves performance: selecting a subset of counterfactuals with MI\-based ranking \(k=30%k=30\\%\) outperforms keeping all generated instances \(k=100%k=100\\%\) in bothF0\.5F\_\{0\.5\}and CRS/P\-CRS555Definitions of CRS and P\-CRS are given in Appendix[B\.6](https://arxiv.org/html/2606.15069#A2.SS6)\.\. Overall, this shows that selective filtering is crucial, as keeping all instances can dilute the gains\. These results indicate that robustness gains come from controllable span edits combined with selective filtering, rather than simply enlarging the training set\.

![Refer to caption](https://arxiv.org/html/2606.15069v1/x3.png)Figure 3:Robustness to perturbation length on Qwen3\-8B on TEM\-8\*\. As context grows, the vanilla model degrades, whereas theCoCoGEC\-trained model keeps higher and more stableF0\.5F\_\{0\.5\}, indicating better long\-range robustness\.
### 5\.4Robustness to Perturbation Length and Number

To evaluate robustness under varying perturbation length and number, we use the LLM\-based Qwen3\-8B as the backbone\. We test on TEM\-8 with sentence\-level perturbations and increasing context length\. Specifically, for each error\-containing sentence, we concatenatekkpreceding andkkfollowing sentences \(k∈\{1,…,10\}k\\in\\\{1,\\ldots,10\\\}\) as additional context, and we further apply word\-level attacks atmmdifferent positions \(m∈\{1,…,5\}m\\in\\\{1,\\ldots,5\\\}\) to vary the perturbation number\. We compare the vanilla Qwen3\-8B model with Qwen3\-8B trained withCoCoGEC\. Figure[3](https://arxiv.org/html/2606.15069#S5.F3)shows that the vanilla model degrades as the context window grows, whereasCoCoGECmaintains higher and more stableF0\.5F\_\{0\.5\}, indicating reduced sensitivity to long\-range context\. Table[3](https://arxiv.org/html/2606.15069#S4.T3)reports results under word\-level perturbations with increasing attack number:CoCoGECconsistently achieves higherF0\.5F\_\{0\.5\}, TR, and SR across attack budgets, showing stronger tolerance to error\-position shifts and denser error layouts\.

## 6Conclusion

We examined the robustness gap in grammatical error correction and proposedCoCoGEC, a contextual counterfactual generation framework that preserves intended corrections while varying the surrounding discourse\. Results on robustness\-oriented and standard GEC benchmarks indicate thatCoCoGECimproves correction accuracy and markedly enhances resilience to word\-level and sentence\-level contextual variations, pointing to contextual counterfactual generation as an effective data\-centric approach to robust GEC\.

## Limitations

CoCoGECcurrently relies on external large language models \(LLMs\) to generate span\-controlled counterfactuals\. This introduces a dependency on the particular LLM and prompting setup used for augmentation, and future work could explore lighter\-weight or fully self\-contained generators to further reduce this reliance\.

## Acknowledgement

The authors would like to thank the anonymous reviewers for their insightful comments\. This work is supported by the Chenguang Program of Shanghai Education Development Foundation and Shanghai Municipal Education Commission under Grant 24CGA26\.

## References

- J\. Achiam, S\. Adler, S\. Agarwal,et al\.\(2023\)GPT\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§4\.3](https://arxiv.org/html/2606.15069#S4.SS3.p1.1)\.
- A\. Akbik, T\. Bergmann, D\. Blythe, K\. Rasul, S\. Schweter, and R\. Vollgraf \(2019\)FLAIR: an easy\-to\-use framework for state\-of\-the\-art NLP\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics \(Demonstrations\),pp\. 54–59\.Cited by:[1st item](https://arxiv.org/html/2606.15069#S3.I2.i1.p1.3)\.
- D\. An, F\. Wang, J\. Lu, and S\. Zhang \(2025\)Self\-explaining counterfactual data augmentation for nlp\.In2025 IEEE 6th International Seminar on Artificial Intelligence, Networking and Information Technology \(AINIT\),pp\. 1–6\.Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Awasthi, S\. Sarawagi, R\. Goyal, S\. Ghosh, and V\. Piratla \(2019\)Parallel iterative edit models for local sequence transduction\.InProceedings of EMNLP\-IJCNLP,Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Bae, Y\. Choi, H\. Kim, and J\. Lee \(2025\)Salad: improving robustness and generalization through contrastive learning with structure\-aware and llm\-driven augmented data\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 12724–12738\.Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Bryant, M\. Felice, and T\. Briscoe \(2017a\)Automatic annotation and evaluation of error types for grammatical error correction\.InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 793–805\.Cited by:[§4\.2](https://arxiv.org/html/2606.15069#S4.SS2.p1.1)\.
- C\. Bryant, M\. Felice, and T\. Briscoe \(2017b\)Automatic annotation and evaluation of grammatical error correction\.InProceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications,pp\. 127–136\.Cited by:[§3\.3](https://arxiv.org/html/2606.15069#S3.SS3.SSS0.Px1.p1.7)\.
- M\. Chang, M\. Yang, Q\. Jiang, and R\. Xu \(2024\)Counterfactual\-enhanced information bottleneck for aspect\-based sentiment analysis\.Proceedings of the AAAI Conference on Artificial Intelligence,pp\. 17736–17744\.Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Chen, R\. Xia, and J\. Yu \(2021\)Reinforced counterfactual data augmentation for dual sentiment classification\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,pp\. 269–278\.Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Chen, L\. Song, M\. J\. Wainwright, and M\. I\. Jordan \(2018\)Learning to explain: an information\-theoretic perspective on model interpretation\.InProceedings of the 35th International Conference on Machine Learning,Cited by:[§3\.3](https://arxiv.org/html/2606.15069#S3.SS3.SSS0.Px2.p1.1)\.
- X\. Chen, T\. Gao, and A\. Bosselut \(2023a\)DISCO: distilling counterfactuals with large language models\.InProceedings of EMNLP,Cited by:[§4\.3](https://arxiv.org/html/2606.15069#S4.SS3.p1.1)\.
- Z\. Chen, Q\. Gao, A\. Bosselut, A\. Sabharwal, and K\. Richardson \(2023b\)DISCO: distilling counterfactuals with large language models\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 5514–5528\.Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2606.15069#S3.SS2.SSS0.Px1.p1.4)\.
- Z\. Chen, L\. Wang, Y\. Wu, X\. Liao, Y\. Tian, and J\. Zhong \(2024\)An effective deployment of diffusion lm for data augmentation in low\-resource sentiment classification\.arXiv preprint arXiv:2409\.03203\.Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Choi, M\. Jeong, H\. Han, and S\. Hwang \(2022\)C2l: causally contrastive learning for robust text classification\.InProceedings of the AAAI conference on artificial intelligence,pp\. 10526–10534\.Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Coyne, K\. Sakaguchi, D\. Galvan\-Sosa, M\. Zock, and K\. Inui \(2023\)Analyzing the performance of GPT\-3\.5 and GPT\-4 in grammatical error correction\.arXiv preprint arXiv:2303\.14342\.Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px1.p1.1)\.
- K\. Dang, J\. Xie, and J\. Liu \(2021\)Leveraging adversarial training to facilitate grammatical error correction\.InArtificial Neural Networks and Machine Learning – ICANN 2021 \(LNCS 12891–12895\), Part I,pp\. 67–78\.Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Fang, X\. Liu, D\. F\. Wong, R\. Zhan, L\. Ding, L\. S\. Chao, D\. Tao, and M\. Zhang \(2023\)TransGEC: improving grammatical error correction with translationese\.InFindings of the Association for Computational Linguistics: ACL 2023,pp\. 3614–3633\.Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px1.p1.1)\.
- P\. Howard, G\. Singer, V\. Lal, Y\. Choi, and S\. Swayamdipta \(2022\)NeuroCounterfactuals: beyond minimal\-edit counterfactuals for richer data augmentation\.InFindings of EMNLP,pp\. 5056–5072\.Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px2.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,Cited by:[§B\.3](https://arxiv.org/html/2606.15069#A2.SS3.p1.1),[§4\.4](https://arxiv.org/html/2606.15069#S4.SS4.p1.1)\.
- J\. Jiang, F\. Leofante, A\. Rago, and F\. Toni \(2024\)Robust counterfactual explanations in machine learning: a survey\.External Links:2402\.01928Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Junczys\-Dowmunt, R\. Grundkiewicz, S\. Guha, and K\. Heafield \(2018\)Approaching neural grammatical error correction as a low\-resource machine translation task\.InProceedings of NAACL\-HLT,Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Katinskaia and R\. Yangarber \(2023\)Grammatical error correction for sentence\-level assessment in language learning\.InProceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2023\),pp\. 488–502\.Cited by:[§1](https://arxiv.org/html/2606.15069#S1.p1.1)\.
- A\. Katinskaia and R\. Yangarber \(2024\)GPT\-3\.5 for grammatical error correction\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\),pp\. 7831–7843\.Cited by:[§1](https://arxiv.org/html/2606.15069#S1.p1.1),[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Kovalchuk, M\. Romanyshyn, and P\. Ivaniuk \(2025\)Introducing omnigec: a silver multilingual dataset for grammatical error correction\.arXiv preprint arXiv:2509\.14504\.Cited by:[§1](https://arxiv.org/html/2606.15069#S1.p1.1)\.
- W\. Li, W\. Luo, G\. Peng, and H\. Wang \(2025\)Explanation based in\-context demonstrations retrieval for multilingual grammatical error correction\.InProceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 4881–4897\.Cited by:[§1](https://arxiv.org/html/2606.15069#S1.p1.1)\.
- W\. Li and H\. Wang \(2024\)Detection\-correction structure via general language model for grammatical error correction\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 1748–1763\.Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px1.p1.1)\.
- X\. Li and Y\. Lan \(2025\)Large language models are good annotators for type\-aware data augmentation in grammatical error correction\.InProceedings of the 31st International Conference on Computational Linguistics,pp\. 199–213\.Cited by:[§1](https://arxiv.org/html/2606.15069#S1.p4.1),[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px1.p1.1),[§4\.3](https://arxiv.org/html/2606.15069#S4.SS3.p1.1)\.
- Y\. Li, X\. Liu, S\. Wang, P\. Gong, D\. F\. Wong, Y\. Gao, H\. Huang, and M\. Zhang \(2023\)TemplateGEC: improving grammatical error correction with detection template\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 6878–6892\.Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Lichtarge, C\. Alberti, S\. Kumar, N\. Shazeer, N\. Parmar, and S\. Tong \(2019\)Corpora generation for grammatical error correction\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),pp\. 3291–3301\.Cited by:[§1](https://arxiv.org/html/2606.15069#S1.p4.1)\.
- M\. Loem, M\. Kaneko, S\. Takase, and N\. Okazaki \(2023\)Exploring effectiveness of GPT\-3 in grammatical error correction: a study on performance and controllability in prompt\-based methods\.InProceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2023\),pp\. 205–219\.Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px1.p1.1)\.
- K\. Omelianchuk, V\. Atrasevych, A\. Chernodub, and O\. Skurzhanskyi \(2020\)GECToR – grammatical error correction: tag, not rewrite\.InProceedings of BEA Workshop at ACL,pp\. 163–170\.Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px1.p1.1),[§4\.3](https://arxiv.org/html/2606.15069#S4.SS3.p1.1)\.
- C\. Park, S\. Koo, S\. Lee, J\. Seo, S\. Eo, H\. Moon, and H\. Lim \(2023\)Synthetic alone: exploring the dark side of synthetic data for grammatical error correction\.CoRR\.Cited by:[§1](https://arxiv.org/html/2606.15069#S1.p4.1),[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Plyler and M\. Chi \(2025\)Iterative counterfactual data augmentation\.InProceedings of the AAAI Conference on Artificial Intelligence,Cited by:[§3\.3](https://arxiv.org/html/2606.15069#S3.SS3.SSS0.Px2.p1.1)\.
- M\. R\. Qorib and H\. T\. Ng \(2023\)System combination via quality estimation for grammatical error correction\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 12746–12759\.Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, and I\. Sutskever \(2019\)Language models are unsupervised multitask learners\.Cited by:[§3\.3](https://arxiv.org/html/2606.15069#S3.SS3.SSS0.Px2.p7.1)\.
- C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.Journal of Machine Learning Research,pp\. 1–67\.Cited by:[§4\.3](https://arxiv.org/html/2606.15069#S4.SS3.p1.1)\.
- S\. Rothe, J\. Mallinson, E\. Malmi, S\. Krause, and A\. Severyn \(2021\)A simple recipe for multilingual grammatical error correction\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 2: Short Papers\),pp\. 702–707\.Cited by:[§B\.3](https://arxiv.org/html/2606.15069#A2.SS3.p2.1)\.
- A\. Solyman, M\. Zappatore, Z\. Wang,et al\.\(2023\)Optimizing the impact of data augmentation for low\-resource grammatical error correction\.Journal of King Saud University – Computer and Information Sciences\.Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px1.p1.1)\.
- K\. Sreedhar, T\. Kavya, J\. Prasad, and V\. Varshini \(2025\)A novel metric\-based counterfactual data augmentation with self\-imitation reinforcement learning \(sil\)\.\.International Journal of Advanced Computer Science & Applications\.Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px2.p1.1)\.
- F\. Stahlberg and S\. Kumar \(2020\)Sequence transduction using span\-level edit operations\.InProceedings of EMNLP,Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px1.p1.1)\.
- F\. Stahlberg and S\. Kumar \(2021\)Synthetic data generation for grammatical error correction with tagged corruption models\.InProceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications,pp\. 37–47\.Cited by:[§1](https://arxiv.org/html/2606.15069#S1.p4.1)\.
- J\. Sun, W\. Peng, Z\. Xu, S\. Wang, and J\. Song \(2023\)Incorporating syntactic cognitive in multi\-granularity data augmentation for chinese grammatical error correction\.InInternational Conference on Neural Information Processing,Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Tang, F\. Qu, and Y\. Wu \(2024\)Ungrammatical\-syntax\-based in\-context example selection for grammatical error correction\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 1758–1770\.Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Tang, K\. Qi, J\. Li, and M\. Zhang \(2023\)Beyond hard samples: robust and effective grammatical error correction with cycle self\-augmenting\.CoRR\.Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2606.15069#S4.SS2.p1.1)\.
- M\. Treviso, A\. Ross, N\. M\. Guerreiro, and A\. F\. Martins \(2023\)CREST: a joint framework for rationalization and counterfactual text generation\.arXiv preprint arXiv:2305\.17075\.Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems,pp\. 5998–6008\.Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Verma, V\. Boonsanong, M\. Hoang, K\. Hines, J\. Dickerson, and C\. Shah \(2024\)Counterfactual explanations and algorithmic recourses for machine learning: a review\.ACM Computing Surveys\.Cited by:[§3\.1](https://arxiv.org/html/2606.15069#S3.SS1.p1.5)\.
- Z\. Wan, X\. Wan, and W\. Wang \(2020\)Improving grammatical error correction with data augmentation by editing latent representation\.InProceedings of the 28th International Conference on Computational Linguistics,pp\. 2202–2212\.Cited by:[§1](https://arxiv.org/html/2606.15069#S1.p4.1)\.
- K\. Wang and X\. Wan \(2022\)Counterfactual representation augmentation for cross\-domain sentiment analysis\.IEEE Transactions on Affective Computing,pp\. 1979–1990\.Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Wang, B\. Wang, Y\. Liu, Q\. Zhu, D\. Wu, and W\. Che \(2024a\)Improving grammatical error correction via contextual data augmentation\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 10898–10910\.Cited by:[§1](https://arxiv.org/html/2606.15069#S1.p4.1),[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Wang, X\. Qiu, Y\. Yue, X\. Guo, Z\. Zeng, Y\. Feng, and Z\. Shen \(2024b\)A survey on natural language counterfactual generation\.External Links:2407\.03993,[Link](https://arxiv.org/abs/2407.03993)Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Wang, X\. Qiu, Y\. Yue, X\. Guo, Z\. Zeng, Y\. Feng, and Z\. Shen \(2024c\)A survey on natural language counterfactual generation\.arXiv preprint arXiv:2407\.03993\.Cited by:[§3\.1](https://arxiv.org/html/2606.15069#S3.SS1.p1.5)\.
- P\. Xia, Y\. Zhou, Z\. Zhang, Z\. Tang, and J\. Li \(2022\)Chinese grammatical error correction based on knowledge distillation\.arXiv preprint arXiv:2208\.00351\.Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px1.p1.1)\.
- D\. Xin, J\. Yuan, and Y\. Li \(2024\)Diffusion based counterfactual augmentation for dual sentiment classification\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\),pp\. 4901–4911\.Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Yang, S\. Hwang, and J\. So \(2024\)Relation\-based counterfactual data augmentation and contrastive learning for robustifying natural language inference models\.arXiv preprint arXiv:2410\.20710\.Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Yang \(2017\)Test for english majors\-band 8 \(tem8\) in china\.Journal of Language Teaching and Research,pp\. 1229–1233\.Cited by:[§1](https://arxiv.org/html/2606.15069#S1.p3.1)\.
- J\. Ye, Z\. Xu, Y\. Li, X\. Cheng, L\. Song, Q\. Zhou, H\. Zheng, Y\. Shen, and X\. Su \(2024\)CLEME2\.0: towards more interpretable evaluation by disentangling edits for grammatical error correction\.arXiv preprint arXiv:2407\.00934\.Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Ye, Y\. Li, Y\. Li, and H\. Zheng \(2023\)MixEdit: revisiting data augmentation and beyond for grammatical error correction\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 10161–10175\.Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px1.p1.1)\.
- P\. Youssef, C\. Seifert, J\. Schlötterer,et al\.\(2024\)LLMs for generating and evaluating counterfactuals: a comprehensive study\.InFindings of the association for computational linguistics: EMNLP 2024,pp\. 14809–14824\.Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Zhang, Y\. Li, L\. Bai, H\. Zhang, Y\. Li, H\. Lin, H\. Zheng, X\. Su, and Z\. Shan \(2025\)Loss\-aware curriculum learning for chinese grammatical error correction\.InICASSP 2025 \- 2025 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Zhang, L\. Cui, E\. Zhao, W\. Bi, and S\. Shi \(2023\)RobustGEC: robust grammatical error correction against subtle context perturbation\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 16780–16793\.Cited by:[§B\.3](https://arxiv.org/html/2606.15069#A2.SS3.p2.1),[§B\.6](https://arxiv.org/html/2606.15069#A2.SS6.p1.4),[§1](https://arxiv.org/html/2606.15069#S1.p4.1),[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.15069#S4.SS1.p1.5),[§4\.2](https://arxiv.org/html/2606.15069#S4.SS2.p1.1),[§4\.3](https://arxiv.org/html/2606.15069#S4.SS3.p1.1)\.
- Y\. Zheng, R\. Zhang, J\. Zhang, Y\. Ye, Z\. Luo, and Z\. Ma \(2024\)LlamaFactory: unified efficient fine\-tuning of 100\+ language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 3: System Demonstrations\),pp\. 156–166\.Cited by:[§B\.3](https://arxiv.org/html/2606.15069#A2.SS3.p1.1),[§4\.4](https://arxiv.org/html/2606.15069#S4.SS4.p1.1)\.
- X\. Zhou, O\. Wu, and M\. K\. Ng \(2023\)Implicit counterfactual data augmentation for robust learning\.arXiv preprint arXiv:2304\.13431\.Cited by:[§2](https://arxiv.org/html/2606.15069#S2.SS0.SSS0.Px2.p1.1)\.

## Appendix APrompt Templates for Counterfactual Generation

We instantiate our span\-edit template over the training corpus to create a set of prompt–sentence pairs, denotedgec\_examples\_with\_span, for eliciting span\-level edits from a large language model \(LLM\)\. All prompts described in this section operate in a*sentence*mode: given a full sentence, the LLM is asked either to fill a masked span or produce additional surrounding context\. This section summarizes the templates used in our counterfactual generation pipeline\.

### A\.1Intra\-sentential Editing

#### Global instruction\.

All span\-edit prompts follow the “Span\-edit prompt \(sentence mode\)” template below and differ only in the concrete \{sentence\} and span positions\.

#### Unified span\-edit template\.

We use a single template for replacement, deletion, and insertion\. The selected span \(or insertion position\) is masked with\[BLANK\], and the LLM is required to output only the content that should fill\[BLANK\]\. An empty prediction corresponds to a deletion\. For insertion, we treat the insertion position as a zero\-length span between two tokens and apply the same template\.

Span\-edit prompt \(sentence mode\)Task Instruction: Fill in the \[BLANK\] with a word or phrase, or leave it empty\.\- Follow standard English grammar\.\- No grammatical errors are allowed\.\- Do not copy from the original sentence\.\- The final sentence must be logically correct and sound natural to native speakers\.\- Output*only*the content that should replace \[BLANK\]\. Do*not*output the full sentence\.Sentence: \{sentence\}\[BLANK\] should be:

### A\.2Inter\-sentential Expansion for Context Augmentation

We perform inter\-sentential expansion by generating a fluent context before and after an input sentence\. Given a training sample with source sentencessand its correctiontt, we use an instruction\-following LLM to generate a prefixppand a suffixqqsuch thatppnaturally leads intossandqqnaturally followsss, while keepingssunchanged\. The expanded source and target are constructed asp⊕s⊕qp\\oplus s\\oplus qandp⊕t⊕qp\\oplus t\\oplus q, respectively\. To improve diversity, we randomly sample the number of prefix/suffix sentences \(e\.g\., 3–5\) for each sample\. If the LLM output cannot be parsed into the required two\-block format or violates constraints \(e\.g\., repeatingss\), we fall back to the original sample without expansion\.

Inter\-sentential expansion prompt \(prefix/suffix\)You are given a sentence S in English\. Your task is to write additional context BEFORE and AFTER S\.RULES:1\) In the`<<<PREFIX\>\>\>`block, write about \{n\_pre\} fluent English sentences that smoothly lead into S\.2\) In the`<<<SUFFIX\>\>\>`block, write about \{n\_suf\} fluent English sentences that naturally follow S\.3\) Do NOT change S at all\. Keep it EXACTLY AS\-IS\.4\) Do NOT include S itself in the blocks\.5\) Write only in English\.6\) Output EXACTLY TWO blocks:`<<<PREFIX\>\>\>`\.\.\.text before S\.\.\.`<<<SUFFIX\>\>\>`\.\.\.text after S\.\.\.S: \{sentence\}

### A\.3LLM\-based Correction Prompt

For instruction\-following LLMs used as correction models \(or LLM\-based scorers\), we adopt the following sentence\-level GEC prompt:

Grammar error correction prompt \(GEC\)You are an experienced English teacher who specializes in grammatical error correction \(GEC\)\. You are given exactly one sentence as input\. Correct it with the fewest possible edits \(minimal edit distance\)\.Requirements:1\) Correct grammar, spelling, and word choice only\. 2\) Keep the original structure and meaning; do not paraphrase or reorder\. 3\) If the sentence is already correct, return it unchanged\. 4\) Output format: return exactly one sentence on a single line, with no explanations or extra text\.Original: \{sentence\}Corrected:

## Appendix BAdditional Experimental Details

### B\.1Dataset Statistics

Table[5](https://arxiv.org/html/2606.15069#A2.T5)reports corpus statistics for each split, including sentence counts, source\-token volumes, and error\-type breakdowns\.

Table 5:Dataset statistics across splits\. Errorful % is the percentage of examples with at least one ERRANT edit between \(src, tgt\)\. Sentence counts are reported using spaCy segmentation and a rule\-based heuristic \(punctuation/newline boundaries\)\.
### B\.2Gold Validity Check

After enforcing the edit\-subset constraint \(E′⊆EE^\{\\prime\}\\subseteq E\), we apply a lightweight gold\-validity check to ensure that the revised targett′t^\{\\prime\}remains valid for the perturbed sourcec′c^\{\\prime\}\. We discard candidates if \(i\)t′t^\{\\prime\}fails automatic grammaticality checking, \(ii\) a frozen external GEC verifier further editst′t^\{\\prime\}, or \(iii\)t′t^\{\\prime\}shows obvious semantic drift\.

#### Quality\-control pipeline\.

We further apply a filtering pipeline to ensure that generated contexts are valid distractors without introducing new errors or leaking the correction\.

#### Grammaticality check \(ERRANT\)\.

Using the same ERRANT\-based edit extraction pipeline as in the main experiments, we verify that generated prefixes and suffixes introduce no additional grammatical errors\. Most generated contexts pass this check, yielding a Context Robustness Score \(CRS\) of 99\.6%\.

#### Fluency and coherence filtering\.

We use a perplexity filter to remove non\-fluent generations and a semantic\-similarity constraint to filter severe semantic drift, ensuring natural and coherent augmented contexts\.

#### Leakage prevention\.

We remove candidates that may directly reveal the target correction, reducing reliance on accidental lexical cues\.

#### Manual verification\.

Manual inspection of sampled cases shows that the retained counterfactuals generally preserve the original error pattern while varying only the surrounding context, supporting the validity of our pipeline\.

ParameterValueTarget modulesallLoRA rank16Learning rate5×10−55\\times 10^\{\-5\}Training epochs2Per\-device batch size2Gradient accumulation steps8Warmup ratio0\.1Max sequence length1024SchedulercosinePrecisionbf16Table 6:LoRA configuration for fine\-tuning Qwen3\-8B\.

### B\.3Hyperparameters

For the LLM\-based scorer, we fine\-tune Qwen3\-8B with LoRA\(Huet al\.,[2022](https://arxiv.org/html/2606.15069#bib.bib101)\)using LLaMA\-Factory\(Zhenget al\.,[2024](https://arxiv.org/html/2606.15069#bib.bib100)\); the LoRA configuration is summarized in Table[6](https://arxiv.org/html/2606.15069#A2.T6)\.

We also train and evaluate GECToR\-large and T5\-large backbones\. For GECToR\-large, we follow the official implementation and training hyperparameters used in RobustGEC\(Zhanget al\.,[2023](https://arxiv.org/html/2606.15069#bib.bib64)\)\. For T5\-large, we follow the text\-to\-text training recipe ofRotheet al\.\([2021](https://arxiv.org/html/2606.15069#bib.bib103)\)and use the publicgec\-t5implementation\.666[https://github\.com/gotutiyan/gec\-t5](https://github.com/gotutiyan/gec-t5)For𝒟ℐ𝒮𝒞𝒪\\mathcal\{DISCO\}and TypeDA, we implement the methods as described in the original papers and run them with the authors’ recommended hyperparameters and filtering rules, without additional tuning beyond adapting them to our GEC data\. Unless otherwise specified, all baseline settings are kept identical to the referenced implementations\.

Table 7:EvaluatingGECToR\-large with and withoutCoCoGECon its original BEA\-19 and CoNLL\-14 benchmarks\.
### B\.4Standard GEC Performance on Original Benchmarks

Table[7](https://arxiv.org/html/2606.15069#A2.T7)reports standard GEC results on the original BEA\-19 dev/test and CoNLL\-14 test sets\. Compared with the vanilla GECToR backbone, theCoCoGEC\-augmented model attains similar or slightly higherF0\.5F\_\{0\.5\}scores on all three benchmarks \(up to \+3\.7 on CoNLL\-14\), indicating that improving robustness on RobustGEC does not degrade conventional GEC performance\.

### B\.5Supplementary Cross\-Lingual Results on VisCGEC

Following the reviewer suggestion on cross\-lingual and cross\-domain validation, we additionally evaluate COCOGEC on the Chinese VisCGEC benchmark\. Results are shown in Table[8](https://arxiv.org/html/2606.15069#A2.T8)\. We follow the official VisCGEC evaluation protocol and compare the baseline and COCOGEC\-augmented models under the same training and decoding settings, without additional model\-specific tuning\.

Table 8:Supplementary results on the Chinese VisCGEC benchmark\.As shown in Table[8](https://arxiv.org/html/2606.15069#A2.T8), COCOGEC also improvesF0\.5F\_\{0\.5\}on VisCGEC, suggesting that the proposed context\-decoupled counterfactual augmentation is not limited to the English RobustGEC setting and can generalize to a different language and benchmark\.

### B\.6Context Robustness Metrics: CRS and P\-CRS

Following RobustGEC\(Zhanget al\.,[2023](https://arxiv.org/html/2606.15069#bib.bib64)\), we report two context\-robustness metrics: Context Robustness Score \(CRS\) and Pair\-wise Context Robustness Score \(P\-CRS\)\. Each GEC case contains one original sentence and a set of context\-perturbed variants\. CRS measures strict stability: it counts a case as correct only if the model outputs exactly identical corrections for*all*variants within the same case,

CRS=\#CaseC\#CaseT\.\\mathrm\{CRS\}=\\frac\{\\\#\\mathrm\{Case\}\_\{C\}\}\{\\\#\\mathrm\{Case\}\_\{T\}\}\.\(2\)P\-CRS is more lenient and evaluates stability at the original⇔\\Leftrightarrowperturb pair level,

P\-CRS=\#P\-sampleC\#P\-sampleT\.\\mathrm\{P\\mbox\{\-\}CRS\}=\\frac\{\\\#\\mathrm\{P\\mbox\{\-\}sample\}\_\{C\}\}\{\\\#\\mathrm\{P\\mbox\{\-\}sample\}\_\{T\}\}\.\(3\)For example, if a case has one original and five perturbed variants where four variants share the same correction as the original, then CRS is0while P\-CRS is4/54/5\.

Table 9:CRS and P\-CRS on TEM\-8\.CoCoGECimproves robustness over vanilla GECToR and T5, while the effects on LLM\-based baselines are mixed\.
### B\.7CRS and P\-CRS on TEM\-8

Table[9](https://arxiv.org/html/2606.15069#A2.T9)summarizes CRS and P\-CRS on TEM\-8\*\. For Seq2Edit \(GECToR\) and Seq2Seq \(T5\),CoCoGECimproves robustness over the corresponding vanilla models for both GECToR and T5\. For GECToR,𝒟ℐ𝒮𝒞𝒪\\mathcal\{DISCO\}yields the highest CRS/P\-CRS among the training variants, whileCoCoGECstill provides a clear gain over the vanilla baseline\. For Qwen3,CoCoGECgives a modest improvement in both CRS and P\-CRS, whereas other recipes exhibit trade\-offs \(e\.g\., CPR increases P\-CRS but lowers CRS\), suggesting that stability improvements for LLM\-based correction may be sensitive to the training recipe\.

### B\.8Case Study

We present representative long\-context cases to qualitatively compare model behaviors\. For each case, we report the original input, the expanded input, the gold correction, and model predictions from different systems \(e\.g\., GECToR, T5, Qwen3, ChatGPT, and LLaMA\)\.

Table 10:Line\-by\-line case study of a contextual counterfactual pair\(s,s′\)\(s,s^\{\\prime\}\)for “an sad person”\. Red bold spans mark residual errors relative to the gold targets, while dark\-green bold spans highlight canonical corrections\.
CoCoGEC: Counterfactual Generation for Robust Grammatical Error Correction

Similar Articles

GCCM: Enhancing Generative Graph Prediction via Contrastive Consistency Model

Encode Errors: Representational Retrieval of In-Context Demonstrations for Multilingual Grammatical Error Correction

SERC: LDPC-Inspired Semantic Error Correction for Retrieval-Augmented Generation

COFT: Counterfactual-Conformal Decoding for Fair Chain-of-Thought Reasoning in Large Language Models

CogGuard: Cognitive and Operational Profiling for Proactive Warning in Edge Intelligent Services

Submit Feedback

Similar Articles

GCCM: Enhancing Generative Graph Prediction via Contrastive Consistency Model
Encode Errors: Representational Retrieval of In-Context Demonstrations for Multilingual Grammatical Error Correction
SERC: LDPC-Inspired Semantic Error Correction for Retrieval-Augmented Generation
COFT: Counterfactual-Conformal Decoding for Fair Chain-of-Thought Reasoning in Large Language Models
CogGuard: Cognitive and Operational Profiling for Proactive Warning in Edge Intelligent Services