Towards Robust Argumentative Essay Understanding via TIDE: An Interactive Framework with Trial and Debate

arXiv cs.AI 05/19/26, 04:00 AM Papers
Summary
This paper introduces TIDE, a novel framework that integrates trial and debate mechanisms to improve criteria-based prompt optimization for argumentative essay understanding tasks such as automated essay scoring, argument component detection, and argument relation identification. Experiments show performance improvements, highlighting the potential of combining prompt-based methods for robust argument analysis.
arXiv:2605.17247v1 Announce Type: new Abstract: Argumentative essays serve as a vital medium for assessing critical thinking and reasoning skills, yet there is limited works on accurately understanding and evaluating such texts via prompt. In this work, we propose TIDE, a novel framework designed to improve criteria-based prompt optimization for argument-related tasks by integrating TrIal and DEbate mechanism. Our method addresses key limitations of criteria-based prompt optimizing by mitigating the influence of noisy training data and enhancing optimization stability. We evaluate TIDE on three core tasks: Automated Essay Scoring, Argument Component Detection, and Argument Relation Identification. Results demonstrate that our framework improves performance across tasks. These findings underscore the potential of combining prompt-based methods for advanced argument understanding.
Original Article
View Cached Full Text
Cached at: 05/19/26, 06:40 AM
# Towards Robust Argumentative Essay Understanding via TIDE: An Interactive Framework with Trial and Debate
Source: [https://arxiv.org/html/2605.17247](https://arxiv.org/html/2605.17247)
Zheqin Yin11, Yupei Ren11,22,33, Yadong Zhang11, Yujiang Lu11,Man Lan11,22,33 1School of Computer Science and Technology, East China Normal University 2Shanghai Institute of Artificial Intelligence for Education, East China Normal University 3Lab of Artificial Intelligence for Education, East China Normal University zqyin@stu\.ecnu\.edu\.cn, mlan@cs\.ecnu\.edu\.cn

###### Abstract

Argumentative essays serve as a vital medium for assessing critical thinking and reasoning skills, yet there is limited works on accurately understanding and evaluating such texts via prompt\. In this work, we proposeTIDE, a novel framework designed to improve criteria\-based prompt optimization for argument\-related tasks by integratingTrIal andDEbate mechanism\. Our method addresses key limitations of criteria\-based prompt optimizing by mitigating the influence of noisy training data and enhancing optimization stability\. We evaluate TIDE on three core tasks: Automated Essay Scoring, Argument Component Detection, and Argument Relation Identification\. Results demonstrate that our framework improves performance across tasks\. These findings underscore the potential of combining prompt\-based methods for advanced argument understanding\.

Towards Robust Argumentative Essay Understanding via TIDE: An Interactive Framework with Trial and Debate

Zheqin Yin11, Yupei Ren11,22,33, Yadong Zhang11, Yujiang Lu11, Man Lan11,22,33††thanks:Corresponding author\.1School of Computer Science and Technology, East China Normal University2Shanghai Institute of Artificial Intelligence for Education, East China Normal University3Lab of Artificial Intelligence for Education, East China Normal Universityzqyin@stu\.ecnu\.edu\.cn, mlan@cs\.ecnu\.edu\.cn

## 1Introduction

Argumentative essays, as a genre of academic writing, serve as tangible artifacts that reflect ones’ abilities to construct, articulate, and defend coherent argumentsDrury et al\. \([2019](https://arxiv.org/html/2605.17247#bib.bib11)\); Ulfa and Purwati \([2023](https://arxiv.org/html/2605.17247#bib.bib34)\)\. Understanding and evaluating argumentative essays, i\.e\. conducting argument mining, not only provides a window into the study of argumentative thinking but also offers a practical pathway for promoting the development of this crucial cognitive skillLu \([2021](https://arxiv.org/html/2605.17247#bib.bib23)\); Mombaers et al\. \([2024](https://arxiv.org/html/2605.17247#bib.bib25)\)\.

In recent years, the predominant approaches in argument mining have focused on training pretrained language models or fine\-tuning large language models \(LLMs\), both of which have demonstrated strong performance on this taskFavero et al\. \([2025](https://arxiv.org/html/2605.17247#bib.bib14)\); Wang et al\. \([2020](https://arxiv.org/html/2605.17247#bib.bib36)\)\. However, current research in this field has seen limited exploration of novel frameworks built upon prompt\-based methods, despite their advantages in terms of simplicity and flexibility\. With the emergence of advanced reasoning models such as DeepSeek\-R1DeepSeek\-AI et al\. \([2025](https://arxiv.org/html/2605.17247#bib.bib8)\), prompt\-based approaches are becoming an increasingly promising alternative, warranting deeper investigation within the context of argument mining\.

Recent studies demonstrate that criterion\-based prompt optimization methods, as illustrated in Figure[1\(a\)](https://arxiv.org/html/2605.17247#S1.F1.sf1), can achieve optimization through implicit task signals in training data without directly modifying the prompt text of LLMsYang et al\. \([2023a](https://arxiv.org/html/2605.17247#bib.bib38)\); Liu et al\. \([2023](https://arxiv.org/html/2605.17247#bib.bib22)\)\. While this approach aims to emulate human\-like abstraction of inferential rulesBarwise \([1993](https://arxiv.org/html/2605.17247#bib.bib3)\)from observations, existing methods often suffer from critical limitations\. Specifically, this approach refines the initial criteria in a gradient\-free, iterative manner, which lacks performance guarantees and may be overly sensitive to noisy or unrepresentative samples\. To address these challenges, we propose criteria optimizing withTrIal andDEbate \(TIDE\), a novel framework that leverages Randomized Trial Selection and Debate to enhance the performance of understanding argumentative essays\. The overview of TIDE can be found in Figure[1\(b\)](https://arxiv.org/html/2605.17247#S1.F1.sf2)\. First, to mitigate the adverse influence of noisy or unrepresentative training data, we introduce a Debate process that allows the current criteria to "defend" itself\. Furthermore, to enhance the quality of each refinement step, we introduce a Randomized Trial Selection mechanism, which explores multiple candidate updates and selects the most promising one, thereby improving both stability and convergence of the optimization process\.

![Refer to caption](https://arxiv.org/html/2605.17247v1/x1.png)\(a\)The process of criteria\-based prompt optimizing\.
![Refer to caption](https://arxiv.org/html/2605.17247v1/x2.png)\(b\)Our proposed framework TIDE

Figure 1:The overview of criteria\-based prompt optimizing \(Figure[1\(a\)](https://arxiv.org/html/2605.17247#S1.F1.sf1)\) and our proposed framework TIDE \(Figure[1\(b\)](https://arxiv.org/html/2605.17247#S1.F1.sf2)\), where Debate and Randomized Trial Selection is employed to enhance the optimizing process\.We evaluate our TIDE across three representative tasks—Automated Essay Scoring \(AES\), Argument Component Detection \(ACD\), and Argument Relation Identification \(ARI\), where the results demonstrate the effectiveness of our framework\. We find that task complexity should guide the configuration of these components, with simpler tasks like AES benefiting from minimal refinement but need more debate, while more complex tasks like ARI requiring deeper iterative processing for optimal outcomes\. Furthermore, we highlight that the reasoning ability of the base models is essential for fully realizing the potential of TIDE for DeepSeek series models by comparing DeepSeek\-R1 and DeepSeek\-V3DeepSeek\-AI et al\. \([2024](https://arxiv.org/html/2605.17247#bib.bib9)\)\. Substituting these models with weaker alternatives results in substantial performance degradation across all tasks\. This work aims to provide a comprehensive understanding of the framework’s capabilities and its implications for improving argumentative essay understanding and evaluating tasks\. In summary, our main contributions are as follows:

- •We conduct a systematic study on o1\-like models, specially DeepSeek\-R1, in understanding human argumentative thinking, grounded in three representative tasks: AES, ACD, and ARI\.
- •We propose a novel prompt optimization framework, TIDE, which integrates Randomized Trial Selection and Debate to significantly enhance model performance in argument understanding and evaluation across the three tasks\.
- •We empirically analyze the role of reasoning within TIDE and demonstrate that reasoning ability is essential to fully unlocking the argumentative potential of DeepSeek\-series\.

## 2Related Works

### 2\.1Argument Mining

In the domain of argumentation\-related tasks, prior research has predominantly focused on pretrained language models such as BERTDevlin et al\. \([2019](https://arxiv.org/html/2605.17247#bib.bib10)\)as its variantsSazid and Mercer \([2022](https://arxiv.org/html/2605.17247#bib.bib30)\); Cheng et al\. \([2020](https://arxiv.org/html/2605.17247#bib.bib5)\)\. In recent times, there has been an emerging trend to delve into the performance of LLMs in pertinent tasksFavero et al\. \([2025](https://arxiv.org/html/2605.17247#bib.bib14)\); Gorur et al\. \([2024](https://arxiv.org/html/2605.17247#bib.bib15)\)\. For instance,Favero et al\. \([2025](https://arxiv.org/html/2605.17247#bib.bib14)\)investigated the application of open\-source models such as Qwen and LLaMA for argument segmentation and argument type classification within educational settings\. On the other hand,Gorur et al\. \([2024](https://arxiv.org/html/2605.17247#bib.bib15)\)explored the performance of LLaMA and Mistral models with varying parameter scales on the task of relation prediction\.

To the best of our knowledge, most existing studies have concentrated on either fine\-tuning pretrained language models or developing task\-specific adaptations of LLMs, leaving a gap in developing new forms of prompting strategies for argumentation\-related tasks\.

### 2\.2Debate by LLMs

There has been growing interest in exploring the interaction between LLMs and the concept of debateLiang et al\. \([2023](https://arxiv.org/html/2605.17247#bib.bib20)\); Irving et al\. \([2018](https://arxiv.org/html/2605.17247#bib.bib17)\)\. Some researchers treat debate as a scenario to probe and evaluate relevant capabilities of LLMs\. For example,He et al\. \([2024](https://arxiv.org/html/2605.17247#bib.bib16)\)employed LLMs to compose argumentative essays\. In addition,Arnesen et al\. \([2024](https://arxiv.org/html/2605.17247#bib.bib1)\)assigned them roles as participants and trained them to win debates, whileLiang et al\. \([2024](https://arxiv.org/html/2605.17247#bib.bib19)\)positioned them as judges to assess the quality of arguments from three dimensions\.

On the other hand, there are researchers incorporating debate as a functional module within broader system architectures\. They treat LLMs as argumentative collaborators more than just answer providersMusi et al\. \([2025](https://arxiv.org/html/2605.17247#bib.bib26)\)\. For instance,Wu et al\. \([2024](https://arxiv.org/html/2605.17247#bib.bib37)\)employed debate between different psychologist\-agents to enhance empathetic response of their system in the domain of psychological diagnostics\. Similarly, in the context of chain\-of\-thought \(CoT\) prompting, debate mechanisms have been introduced to facilitate more robust math capabilityWan et al\. \([2024](https://arxiv.org/html/2605.17247#bib.bib35)\)\.

### 2\.3Prompt Optimizing Strategies

Rather than relying on manually crafted prompts, which can be time\-consuming, suboptimal, and difficult to generalize across tasks, an alternative approach is to leverage the LLM itself to generate or refine prompts\. For example, APEZhou et al\. \([2022](https://arxiv.org/html/2605.17247#bib.bib42)\)uses LLMs to propose candidate prompts, which are then evaluated and selected based on their performance on a given task\. This approach achieves human\-level performance on various tasks with minimal human input\. Similarly,Yang et al\. \([2023b](https://arxiv.org/html/2605.17247#bib.bib39)\)introduces Optimization by PROmpting \(OPRO\), a method leveraging LLMs as optimizers\. OPRO describes optimization tasks in natural language and iteratively generates new solutions based on previously evaluated ones\. This approach is demonstrated in prompt optimization, where prompts are optimized to maximize task accuracy\. Different from optimize the prompt itself,Liu et al\. \([2023](https://arxiv.org/html/2605.17247#bib.bib22)\)employed the in\-context learning capability of LLMs to capture criteria from annotated examples, which is then assessed and refined through self\-correction, ultimately yielding a calibrated version of criteria\.

## 3Method

### 3\.1Criteria\-based prompt optimizing

Criteria are central to prompts for guiding LLMs toward better performance\. For instance, users may include directives such as "…carefully analyze coherence, structure, and argumentation before giving the final score…" to improve alignment in scoring argumentative essays\. However, how should the LLM understand coherence, and, how much weight should be placed on it? Given the typical absence of detailed rubric in user\-provided prompts or annotation guidelines, criteria\-based prompt optimization is used to generate and refine well\-specified criteria, thereby enhancing the guidance for LLM predictions\.

An overview of criteria\-based prompt optimizing is illustrated in Figure[1\(a\)](https://arxiv.org/html/2605.17247#S1.F1.sf1)\. Initially,Guiderproduces an initial version of criteria, denoted asc0c\_\{0\}, which in our context serves as a guideline for understanding and assessing argumentative essays\. ThenSolveris employed to perform the corresponding task withc0c\_\{0\}, producing a predicted outputy^\\hat\{y\}given an inputx∈𝒳x\\in\\mathcal\{X\}\. By comparing the predicted outputy^\\hat\{y\}with the ground truthyy,Guideris guided to update the criteria fromc0c\_\{0\}toc1c\_\{1\}\. This updated critiera is then evaluated on the training data again, and the refinement process continues\.

However, several limitations exist in this vanilla process\. First, since no gradients are updated throughout this pipeline, there is no guarantee that the updated criteriaci\+1c\_\{i\+1\}will lead to improved performance ofSolvercompared to the previous versioncic\_\{i\}\. Moreover, forcingGuiderto update the criteria solely based on observed discrepancies may be problematic, as it could be influenced by noisy or unrepresentative data, potentially degrading overall performance\.

### 3\.2Overall Framework of TIDE

We propose criteria optimizing withTrIal andDEbate \(TIDE\), which leverages Debate and Randomized Trial Selection to improve criteria\-based prompt optimization as shown in Figure[1\(b\)](https://arxiv.org/html/2605.17247#S1.F1.sf2)\. TIDE begins by initializing the criteria draftc0c\_\{0\}byGuiderand iterates over batchesBBof the training datasetDtrainD\_\{train\}\. For each batch,Solvergenerates predictions, computes discrepancy \(see Section[3\.4](https://arxiv.org/html/2605.17247#S3.SS4)\), and identifies the sample with the largest discrepancy\. A debate is then conducted between the predicted valuey^\\hat\{y\}and the ground truthyyto determine whether an update is necessary \(see Section[3\.3](https://arxiv.org/html/2605.17247#S3.SS3)\)\. If they^\\hat\{y\}wins the debate, the process proceeds into the next iteration without update\. Otherwise, the algorithm generates multiple candidate updates and selects the one with the minimal error on the same batchBBas the final update\. Algorithm[1](https://arxiv.org/html/2605.17247#alg1)presents detailed TIDE algorithm\.

### 3\.3Debate module

Algorithm 1TIDE1:Training dataset

DtrainD\_\{train\}, max iteration

nItern\_\{Iter\}, batch size

bszbsz, number of Trial

ntrialn\_\{trial\}
2:Refined criteria

cc
3:Generate initial criteria

ccurrent←c0c\_\{current\}\\leftarrow c\_\{0\}
4:for

i=0i=0to

nItern\_\{Iter\}do

5:forBatch

B∈DtrainB\\in D\_\{train\}do

6:Employing

ccurrentc\_\{current\}to generate predictions

y^=y^1,⋯,y^bsz\\hat\{y\}=\\hat\{y\}\_\{1\},\\cdots,\\hat\{y\}\_\{bsz\}
7:Compute discrepancy for batch

b=b1,⋯,bbszb=b\_\{1\},\\cdots,b\_\{bsz\}
8:Pick the sample with the largest discrepancy

imax,bimax=max⁡\(b\)i\_\{max\},b\_\{i\_\{max\}\}=\\max\(b\)
9:Debate between

y^imax\\hat\{y\}\_\{i\_\{max\}\}and

yy
10:if

yywinsthen

11:Generate

ntrialn\_\{trial\}candidate updates

c^i\+1=c^i\+11,⋯,c^i\+1ntrial\\hat\{c\}\_\{i\+1\}=\\hat\{c\}\_\{i\+1\}^\{1\},\\cdots,\\hat\{c\}^\{n\_\{trial\}\}\_\{i\+1\}with

B\[imax\]B\[i\_\{max\}\]
12:The minimal discrepancy

bmin←∞b\_\{min\}\\leftarrow\\infty
13:The final update

cfinal←Nonec\_\{final\}\\leftarrow\\textbf\{None\}
14:for

c^∈c^i\+1\\hat\{c\}\\in\\hat\{c\}\_\{i\+1\}do

15:Compute discrepancy of

BBusing

c^\\hat\{c\}to get

bc^b^\{\\hat\{c\}\}
16:if

max\(bc^\)≤bminmax\(b^\{\\hat\{c\}\}\)\\leq b\_\{min\}then

17:

bmin←max\(bc^\)b\_\{min\}\\leftarrow max\(b^\{\\hat\{c\}\}\)
18:

cfinal←c^c\_\{final\}\\leftarrow\\hat\{c\}
19:endif

20:endfor

21:

ccurrent←cfinalc\_\{current\}\\leftarrow c\_\{final\}
22:endif

23:endfor

24:endfor

25:return

ccurrentc\_\{current\}

Many studies have demonstrated that debate serves as an effective mechanism for enhancing the truthfulness of system\-generated responses, which is primarily because LLMs have been shown to struggle when attempting to defend false or inaccurate claimsMichael et al\. \([2023](https://arxiv.org/html/2605.17247#bib.bib24)\); Khan et al\. \([2024](https://arxiv.org/html/2605.17247#bib.bib18)\); Du et al\. \([2023](https://arxiv.org/html/2605.17247#bib.bib12)\)\. Leveraging this characteristic, and recognizing the natural compatibility between debate structures and argumentative tasks, we introduce a debate\-based process aimed at mitigating the impact of noisy data in argumentative contexts\. We further extend the evaluation process by incorporating a simple internal debate mechanism, which compares the predicted output with the gold reference\. This simulated debate in turn serves to control the iteration condition, allowing the model to refine its outputs more selectively and robustly\.

Concretely, whenSolverproduces an incorrect predictiony^\\hat\{y\}based on a given criterioncic\_\{i\}, a debate is initiated between the predicted outputy^\\hat\{y\}and the correct labelyy\. In this setting, the "speech" fory^\\hat\{y\}is the explanation originally generated bySolverat prediction time, while the "speech" foryyis constructed by prompting the LLM to generate a plausible explanation supporting the correct label\. A LLM\-based judge is then employed to assess which explanation better aligns with the annotation criterion \(differ according to specific dataset\)\. The update will only initialize wheny^\\hat\{y\}won the debate, or the process would proceed to next iteration\. This debate mechanism allowsGuiderto reduce the influence of potentially noisy annotations present in the training data, as their explanation would be less convincing, thereby enhancing the overall robustness of the framework, resulting in the training dynamic illustrated in Figure[2](https://arxiv.org/html/2605.17247#S5.F2)\.

### 3\.4Randomized Trial Selection

We introduced randomized trial selection upon updating criteria\. Specifically, given the previous criteriacic\_\{i\}, multiple candidate updates, denoted asc^i\+10,c^i\+11,⋯\\hat\{c\}\_\{i\+1\}^\{0\},\\hat\{c\}\_\{i\+1\}^\{1\},\\cdots, are generated, where each sample serves as an estimate of the potentially optimal update fromcic\_\{i\}\. More formally, an update is generated viac^i\+1=πθ\(𝒯,ci\)\\hat\{c\}\_\{i\+1\}=\\pi\_\{\\theta\}\(\\mathcal\{T\},c\_\{i\}\), where𝒯\\mathcal\{T\}is the prompt template,πθ\\pi\_\{\\theta\}isGuiderwith parameterθ\\theta\. We repeat this generation forntrialn\_\{trial\}times to get different candidates\.

Among the generated candidates,discrepancybetween the prediction and the latent criteria embedded in the data is computed to identify the most promising update\. We measure the discrepancy via the error predictions, which may differ according to each task\. Specifically, for ACD, we regard the number of predicted labels that do not match the ground truth labels, while the absolute difference between predicted and labeled scores for AES\. For ARI, which is rather complicated, we consider both the error rate of the identification of related pairs and the prediction of the relationship categories between them\. Additionally, we introduce a penalty mechanism that imposes a higher penalty when the model fails to recognize the existence of any relationship between a pair of sequences, prior to predicting the exact type of relationship\. Details of how the discrepancy is computed for each task can be found in Appendix[D](https://arxiv.org/html/2605.17247#A4)\. The candidate with the lowest score is then selected as the final update forci\+1c\_\{i\+1\}\.

## 4Experiment Setup

Table 1:Performance comparison with different baselines on ACD and ARI, both from CEAMC dataset\. All results in this table are presented in percent format\(%\)\.### 4\.1Task Format

In this work, we mainly focus on three representative argument\-related tasks:

Automated Essay Scoring \(AES\): This task involves assigning an overall score to an input argumentative essay ranged from 1 to 5, based on the quality, coherence, and other relevant aspects of the argumentation\.

Argument Component Detection \(ACD\): Given an argumentative essayDDconsisting ofnndiscourse units, i\.e\.,D=\[s1,…,sn\]D=\[s^\{1\},\.\.\.,s^\{n\}\], the task requires predicting the fine\-grained type of each sentence, resulting in a label sequence𝒜𝒞=y^1,…,y^n\\mathcal\{AC\}=\\hat\{y\}^\{1\},\.\.\.,\\hat\{y\}^\{n\}\.

Argument Relation Identification \(ARI\): In this task, the input is an essayDcD\_\{c\}segmented intommdiscourse chunks, i\.e\.,Dc=\[c1,…,cm\]D\_\{c\}=\[c^\{1\},\.\.\.,c^\{m\}\]\. Given the argument component type of each chunks, the model is required to identify and classify all possible argument relations between discourse chunks𝒜ℛ=\{\(ifrom,ito,r\)\},ifrom,ito∈\[1,m\]\\mathcal\{AR\}=\\\{\(i\_\{from\},i\_\{to\},r\)\\\},i\_\{from\},i\_\{to\}\\in\[1,m\]\. It is worth noting that relational instances are sparse within individual essays, leading to a high ratio of negative to positive samples, which make this task more complexed\.

The three tasks are arranged in ascending order of difficulty\. The AES task is rather simple, while the ACD task necessitates more domain knowledge though manageable and relatively easy in general\. The ARI task, however, combines complexity and domain\-specific expertise, posing a significant challenge to the capabilities of LLMs\.

### 4\.2Dataset

- •AEEStab and Gurevych \([2017](https://arxiv.org/html/2605.17247#bib.bib31)\)was annotated on student\-written essays\. A stance for a controversial theme is expressed by a major claim component as well as claim components, and premise components justify or refute the claims\. Attack and support labels are defined as relations\.
- •CEAMCRen et al\. \([2025](https://arxiv.org/html/2605.17247#bib.bib28)\)also contains argumentative essays penned by students\. It defines 10 fine\-grained categories of argument components and 14 types of argument relations, posing additional challenges in the comprehension and evaluation of argumentative essays\. Moreover, in CEAMC, a single sequence pair may be associated with multiple relation types, which adds additional challenges to the task\.
- •ASAP 2\.0Crossley et al\. \([2025](https://arxiv.org/html/2605.17247#bib.bib7)\)is a representative dataset in essay scoring, which collected a large amount of student\-written argumentative essays\. Given its large scale, we sampled 1000 essays to conduct experiments\.
- •ArGPTRocha et al\. \([2024](https://arxiv.org/html/2605.17247#bib.bib29)\)includes arguments generated by ChatGPT and annotated labels assessing the quality\. We incorporate this dataset into our experiments with the aim of providing insights for applications involving LLM\-based scoring or evaluation\.

Additional details about these dataset are provided in Appendix[E](https://arxiv.org/html/2605.17247#A5)\.

### 4\.3Baselines and Evaluation metrics

As baselines, we employ DeepSeek\-R1 as base model \(for details, see Appendix[A](https://arxiv.org/html/2605.17247#A1)\) and report the performance of several existing methods, including In\-Context Learning \(ICL\), Chain of Thought \(CoT\), and CalibrateLiu et al\. \([2023](https://arxiv.org/html/2605.17247#bib.bib22)\), which an prompt optimization framework that follows a three\-stage framework consisting of drafting, filtering, and refining to generate final criteria\.

For the AES task, we additionally compare our approach with the methods proposed byStahl et al\. \([2024](https://arxiv.org/html/2605.17247#bib.bib32)\), who investigated various prompt strategies for automated scoring of argumentative essays\. Specifically, we select the Feedback\_dCoT→\\rightarrowScore and Explanation→\\rightarrowScore strategies as baselines due to their relatively strong performance as reported in the original study\. Different subtasks are evaluated using task\-specific metrics:

1. 1\.AES:For this scoring task, we use Quadratic Weighted Kappa \(QWK\) as the primary evaluation metric, providing a nuanced measure of inter\-rater reliability by considering the relative difference between scores\.
2. 2\.ACD and ARI:For these classification tasks, we report both Micro F1 and Macro F1 scores \(denoted asMicroandMacroin tables\) to provide a comprehensive evaluation\.

## 5Main Results

The main results of our experiments is shown in Table[1](https://arxiv.org/html/2605.17247#S4.T1)and Table[2](https://arxiv.org/html/2605.17247#S5.T2), where we report the performance of our framework in all three subtasks on different datasets\.

In the AES task, the Criteria\-based approach yielded relatively high performance in terms of QWK\. This pattern suggests thatGuideris effective at capturing the relative ranking or category of the scores\. In contrast, TIDE achieved a higher score of QWK, which demonstrates more consistent and robust performance, exhibiting lower sensitivity to individual samples that could otherwise compromise overall model effectiveness\. As illustrated in Figure[2](https://arxiv.org/html/2605.17247#S5.F2), the reduced error magnitude further substantiates the efficacy of our proposed design\.

Table 2:Performance comparison with different baselines on AES\. In this table ’Feed’ and ’Exp’ represents Feedback\_dCoT and Explanation fromStahl et al\. \([2024](https://arxiv.org/html/2605.17247#bib.bib32)\), while ’Criteria’ represents the criteria\-based prompt optimizing\. All results in this table are presented in percent format\(%\)\.For the ACD task, the Criteria\-based method attained the second highest Micro F1 score but performed worst in terms of Macro F1\. This discrepancy indicates thatGuideris particularly discrepancyed towards frequent labels\. Conversely, TIDE maintains balanced performance, demonstrating the ability to correctly classify both frequent and rare labels, thus enhancing overall robustness in label distribution\.

04040808012012016016020020024024000\.10\.10\.20\.20\.30\.30\.40\.40\.50\.50\.60\.6IterationErrorCriteria\-basedTIDEFigure 2:The error dynamic during optimizing process of AES, where error is computed via the absolute difference between predicted and labeled scores\.Table 3:Predictions of whether there is some relation between two chunks of discourse in task ARI from CEAMC\. All results in this table are presented in percent format\(%\)\.![Refer to caption](https://arxiv.org/html/2605.17247v1/x3.png)\(a\)ntrialn\_\{trial\}=2
![Refer to caption](https://arxiv.org/html/2605.17247v1/x4.png)\(b\)ntrialn\_\{trial\}=4

Figure 3:Debate Wins for AES on CEAMC in different settingsIn the ARI task, TIDE surpassed all other methods by achieving superior performance in both Micro and Macro F1 scores\. To gain deeper insights, we analyzed whether the models were capable of correctly identifying the presence of relations between chunks, i\.e\. telling the positive samples from negative ones\. As presented in Table[3](https://arxiv.org/html/2605.17247#S5.T3), the Criteria\-based approach improved the detection of discourse relations relative to the ICL baseline, and also enhanced relation\-type classification \(Table[1](https://arxiv.org/html/2605.17247#S4.T1)\)\. Notably, although Calibrate and TIDE exhibited comparable performance in general metrics, TIDE significantly outperformed Calibrate on ARI evaluations\. These results further underscore the effectiveness and generalizability of our proposed framework\. The output criteria can be found in Appendix[F](https://arxiv.org/html/2605.17247#A6)\.

### 5\.1Reasoning Models Leads to Better Performance

Table[4](https://arxiv.org/html/2605.17247#S5.T4)presents an evaluation of different models employed asGuiderandSolver\. Our primary focus is on assessing the impact of the reasoning capability of DeepSeek\-R1, which proves to be essential for generating and optimizing criteria as shown in the table\.

Table 4:Different combination of \(Guider\+Solver\)\. For example, R1\+V3 represents employing DeepSeek\-R1 asGuider, while DeepSeek\-V3 asSolver\. QwQ\+QwQ represents employ QwQ\-32B for bothGuiderandSolver\. All results in this table are presented in percent format\(%\)\.The results demonstrate that utilizing DeepSeek\-R1 as bothGuiderandSolverleads to significant performance improvements across all tasks\. More specifically, when the reasoning ability is removed fromGuider\(i\.e\., the V3\+R1 setting in Table[4](https://arxiv.org/html/2605.17247#S5.T4)\), TIDE suffers a substantial decline in performance across AES, ACD, and ARI\. This finding highlights the critical role of a reasoning\-capableGuiderin driving the framework’s success\. Furthermore, although the R1\+V3 setting performs better than V3\+R1, it still underperforms compared to the R1\+R1 configuration\. This indicates that a strongSolveris also necessary to fully leverage the benefits of the framework\. We also included the results provided by o1\-mini and QwQ in the table, where DeepSeek\-R1 consistently outperforms o1\-mini and QwQ across all tasks\. This is possibly due to their pretraining focus on mathematical and code reasoning tasks instead of argumentsEl\-Kishky \([2024](https://arxiv.org/html/2605.17247#bib.bib13)\); Team \([2025](https://arxiv.org/html/2605.17247#bib.bib33)\)\.

In addition, to further study the generalization of TIDE, we tested a small scaled model, Qwen2\.5\-14B using criteria derived from Deepseek\-R1’s TIDE runs \(Qwen\-R1 row in the table\)\. The results show clear performance improvements across all three tasks compared to the native Qwen\-TIDE setup, even outperforming o1\-mini and QwQ, and coming reasonably close to R1\+V3 in Table[4](https://arxiv.org/html/2605.17247#S5.T4)\. These findings demonstrate the generalization capability of TIDE, as well as reducing costs on influence phase\.

### 5\.2Complex Tasks need More Trial

In this section we discuss different settings of TIDE, which mainly includes Trial and batch size, which is presented in Table[5](https://arxiv.org/html/2605.17247#S5.T5)\. It is evident that, for the ARI task—which is comparatively more complex and requires a higher degree of domain knowledge—a larger number of Trial is more suitable\. This is becauseGuiderbenefits from multiple refinement steps to achieve more effective model updates\. The impact of Trial is further accentuated when considering the effect of batch size: smaller batch sizes consistently yield better performance, likely due to improved granularity in learning signal and reduced averaging effects\. In contrast, for the AES task—which is relatively simpler in nature—increasing the number of Trial or using smaller batch sizes tends to degrade performance\. Specifically, such configurations result in reduced QWK scores, thereby mimicking the shortcomings observed in the Criteria\-based approach, as discussed in Table[1](https://arxiv.org/html/2605.17247#S4.T1)\. As for ACD, which lies between AES and ARI in terms of task complexity, optimal performance is achieved with moderate settings\. This suggests that both the number of Trial iterations and batch size should be chosen to reflect the intermediate difficulty of the task, balancing the trade\-off between refinement and overfitting\.

Table 5:Different setting of Trial and batch in TIDE\. When discussing Trial, we set batch size to 2\. When discussing batch, we set Trial to 2\. All results in this table are presented in percent format\(%\)\.
### 5\.3Simple Tasks need More Debate

The results in Table[6](https://arxiv.org/html/2605.17247#S5.T6)clearly demonstrate that the incorporation of the Debate mechanism significantly improves both Pearson and QWK scores, which indicates its effectiveness in reducing noise and enabling more accurate and stable predictions during training\.

Table 6:Ablation study of Debate in AES on CEAMC\. We computed another metric: Pearson coefficient, in order to provide more insights\.The win rate ofSolverin debate during AES execution is illustrated in Figure[3](https://arxiv.org/html/2605.17247#S5.F3)\. Notably, under thentrialn\_\{trial\}=4 setting,Solverachieves a substantially higher win rate compared to the Gold Data\. This may, to some extent, hinder the system’s ability to continuously improve through training\. In contrast, withntrialn\_\{trial\}=2,Solverexhibits a considerably lower win rate, suggesting that the system continues to optimize, which corresponds to the superior performance reported in Table[1](https://arxiv.org/html/2605.17247#S4.T1)\. Further discussion and details on the win rate are provided in Appendix[C](https://arxiv.org/html/2605.17247#A3)\.

## 6Conclusion

In this work, we propose TIDE, a novel framework designed to enhance performance of understanding and evaluating argumentative essays through a combination of iterative refinement and structured interaction between two main roles—GuiderandSolver\. Through studies, we show that key components such as the Trial mechanism and the Debate module play pivotal roles in balancing robustness and performance\. Specifically, we find that task complexity should guide the configuration of these components: while simpler tasks like AES benefit from minimal refinement but need more debate, while more complex tasks like ARI require deeper iterative processing to achieve optimal outcomes\.

Moreover, our analysis reveals that, for DeepSeek series models, the reasoning ability of the underlying models is crucial: both theGuiderand theSolvermust possess strong reasoning capabilities to fully realize the potential of TIDE\. In addition, other weak, small\-scaled models can also benefit from the outcome criteria from these powerful ones, which shows the generalizability of the framework\.

## Limitations

While TIDE demonstrates promising performance across a range of educational NLP tasks, several limitations remain that warrant further exploration\.

- •Debate Protocol Variants\.In the current implementation of TIDE, we adopt a conventional debate protocol in which two parties represent opposing stances and take turns to defend their respective positions\. Prior work\(Khan et al\.,[2024](https://arxiv.org/html/2605.17247#bib.bib18)\)has proposed alternative debate paradigms, including consultancy\-style discussions, structured debates, and interactive debates that allow for dynamic exchanges\. Exploring how these different debate protocols influence the quality and stability of prompt optimization within TIDE may lead to further performance gains and a deeper understanding of model reasoning dynamics\.
- •Cross\-domain Generalizability\.Although our experiments demonstrate the effectiveness of TIDE across three argumentation\-related tasks in the educational domain, its applicability to other fields remains uncertain\. Domains such as law, medicine, and policy analysis present unique linguistic structures, reasoning requirements, and data distributions\. Investigating the adaptability of TIDE to these high\-stakes or domain\-specific applications—particularly in terms of prompt robustness, criteria abstraction, and reasoning fidelity—constitutes an important direction for future research\.

## References

- Arnesen et al\. \(2024\)Samuel Arnesen, David Rein, and Julian Michael\. 2024\.[Training language models to win debates with self\-play improves judge accuracy](https://api.semanticscholar.org/CorpusID:272881215)\.*ArXiv*, abs/2409\.16636\.
- Bai et al\. \(2023\)Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenhang Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, K\. Lu, and 31 others\. 2023\.[Qwen technical report](https://api.semanticscholar.org/CorpusID:263134555)\.*ArXiv*, abs/2309\.16609\.
- Barwise \(1993\)Jon Barwise\. 1993\.Everyday reasoning and logical inference\.*Behavioral and Brain Sciences*, 16\(2\):337–338\.
- Beltagy et al\. \(2020\)Iz Beltagy, Matthew E\. Peters, and Arman Cohan\. 2020\.[Longformer: The long\-document transformer](https://api.semanticscholar.org/CorpusID:215737171)\.*ArXiv*, abs/2004\.05150\.
- Cheng et al\. \(2020\)Liying Cheng, Lidong Bing, Qian Yu, Wei Lu, and Luo Si\. 2020\.[Argument pair extraction from peer review and rebuttal via multi\-task learning](https://api.semanticscholar.org/CorpusID:227035335)\.In*Conference on Empirical Methods in Natural Language Processing*\.
- Contributors et al\. \(2024\)Foundational Contributors, Ahmed El\-Kishky, Daniel Selsam, Francis Song, Giambattista Parascandolo, Hongyu Ren, Hunter Lightman, Hyung Won, Ilge Akkaya, Ilya Sutskever, Jason Wei, Jonathan Gordon, Karl Cobbe, Kevin Yu, Lukasz Kondraciuk, Max Schwarzer, Mostafa Rohaninejad, Noam Brown, Shengjia Zhao, and 189 others\. 2024\.[Openai o1 system card](https://api.semanticscholar.org/CorpusID:274611667)\.*ArXiv*, abs/2412\.16720\.
- Crossley et al\. \(2025\)Scott A\. Crossley, Perpetual Baffour, L\. Burleigh, and Jules King\. 2025\.[A large\-scale corpus for assessing source\-based writing quality: Asap 2\.0](https://doi.org/10.1016/j.asw.2025.100954)\.*Assessing Writing*, 65:100954\.
- DeepSeek\-AI et al\. \(2025\)DeepSeek\-AI, Daya Guo, Dejian Yang, Haowei Zhang, Jun\-Mei Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiaoling Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z\. F\. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, and 179 others\. 2025\.[Deepseek\-r1: Incentivizing reasoning capability in llms via reinforcement learning](https://api.semanticscholar.org/CorpusID:275789950)\.*ArXiv*, abs/2501\.12948\.
- DeepSeek\-AI et al\. \(2024\)DeepSeek\-AI, Aixin Liu, Bei Feng, Bing Xue, Bing\-Li Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dong\-Li Ji, Erhang Li, Fangyun Lin, Fucong Dai, and 179 others\. 2024\.[Deepseek\-v3 technical report](https://api.semanticscholar.org/CorpusID:275118643)\.*ArXiv*, abs/2412\.19437\.
- Devlin et al\. \(2019\)Jacob Devlin, Ming\-Wei Chang, Kenton Lee, and Kristina Toutanova\. 2019\.[Bert: Pre\-training of deep bidirectional transformers for language understanding](https://api.semanticscholar.org/CorpusID:52967399)\.In*North American Chapter of the Association for Computational Linguistics*\.
- Drury et al\. \(2019\)Jeffrey P\. Mehltretter Drury, Nicholas S\. Paliewicz, and Sara A\. Mehltretter Drury\. 2019\.[Argument pedagogy for everyday life](https://api.semanticscholar.org/CorpusID:146011477)\.*Journal of Communication Pedagogy*\.
- Du et al\. \(2023\)Yilun Du, Shuang Li, Antonio Torralba, Joshua B\. Tenenbaum, and Igor Mordatch\. 2023\.[Improving factuality and reasoning in language models through multiagent debate](https://api.semanticscholar.org/CorpusID:258841118)\.*ArXiv*, abs/2305\.14325\.
- El\-Kishky \(2024\)Ahmed El\-Kishky\. 2024\.[Openai o1 system card](https://api.semanticscholar.org/CorpusID:272648256)\.*ArXiv*, abs/2412\.16720\.
- Favero et al\. \(2025\)Lucile Favero, Juan Antonio P’erez\-Ortiz, Tanja Käser, and Nuria Oliver\. 2025\.[Leveraging small llms for argument mining in education: Argument component identification, classification, and assessment](https://api.semanticscholar.org/CorpusID:276482778)\.*ArXiv*, abs/2502\.14389\.
- Gorur et al\. \(2024\)Deniz Gorur, Antonio Rago, and Francesca Toni\. 2024\.[Can large language models perform relation\-based argument mining?](https://api.semanticscholar.org/CorpusID:267750218)*ArXiv*, abs/2402\.11243\.
- He et al\. \(2024\)Yuhang He, Jianzhu Bao, Yang Sun, Bin Liang, Min Yang, Bing Qin, and Ruifeng Xu\. 2024\.[Decomposing argumentative essay generation via dialectical planning of complex reasoning](https://api.semanticscholar.org/CorpusID:271860879)\.In*Annual Meeting of the Association for Computational Linguistics*\.
- Irving et al\. \(2018\)Geoffrey Irving, Paul Francis Christiano, and Dario Amodei\. 2018\.[Ai safety via debate](https://api.semanticscholar.org/CorpusID:22050710)\.*ArXiv*, abs/1805\.00899\.
- Khan et al\. \(2024\)Akbir Khan, John Hughes, Dan Valentine, Laura Ruis, Kshitij Sachan, Ansh Radhakrishnan, Edward Grefenstette, Samuel R\. Bowman, Tim Rocktaschel, and Ethan Perez\. 2024\.[Debating with more persuasive llms leads to more truthful answers](https://api.semanticscholar.org/CorpusID:267627652)\.*ArXiv*, abs/2402\.06782\.
- Liang et al\. \(2024\)Jingcong Liang, Rong Ye, Meng Han, Ruofei Lai, Xinyu Zhang, Xuanjing Huang, and Zhongyu Wei\. 2024\.[Debatrix: Multi\-dimensional debate judge with iterative chronological analysis based on llm](https://api.semanticscholar.org/CorpusID:268379278)\.In*Annual Meeting of the Association for Computational Linguistics*\.
- Liang et al\. \(2023\)Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and Shuming Shi\. 2023\.[Encouraging divergent thinking in large language models through multi\-agent debate](https://api.semanticscholar.org/CorpusID:258967540)\.*ArXiv*, abs/2305\.19118\.
- Liu et al\. \(2019\)Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov\. 2019\.[Roberta: A robustly optimized bert pretraining approach](https://api.semanticscholar.org/CorpusID:198953378)\.*ArXiv*, abs/1907\.11692\.
- Liu et al\. \(2023\)Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, and Qi Zhang\. 2023\.[Calibrating llm\-based evaluator](https://api.semanticscholar.org/CorpusID:262464745)\.In*International Conference on Language Resources and Evaluation*\.
- Lu \(2021\)Chunxia Lu\. 2021\.[Infusing critical thinking skills into argumentative writing: A study of chinese college college learners](https://api.semanticscholar.org/CorpusID:239046459)\.*English Language and Literature Studies*\.
- Michael et al\. \(2023\)Julian Michael, Salsabila Mahdi, David Rein, Jackson Petty, Julien Dirani, Vishakh Padmakumar, and Samuel R\. Bowman\. 2023\.[Debate helps supervise unreliable experts](https://api.semanticscholar.org/CorpusID:265213107)\.*ArXiv*, abs/2311\.08702\.
- Mombaers et al\. \(2024\)Tine Mombaers, Roos Van Gasse, and Sven De Maeyer\. 2024\.[Learning from compa\(i\)ring exemplars: Enhancing genre knowledge of argumentative texts](https://api.semanticscholar.org/CorpusID:270467114)\.*Journal of Writing Research*\.
- Musi et al\. \(2025\)Elena Musi, Nadin Kokciyan, Khalid Al\-Khatib, Davide Ceolin, Emmanuelle Dietz, Klara Gutekunst, Annette Hautli\-Janisz, Cristian Manuel Santibañez Yañez, Jodi Schneider, Jonas Scholz, and 1 others\. 2025\.Toward reasonable parrots: Why large language models should argue with us by design\.*arXiv preprint arXiv:2505\.05298*\.
- Ren et al\. \(2024\)Yupei Ren, Hongyi Wu, Zhaoguang Long, Shangqing Zhao, Xinyi Zhou, Zheqin Yin, Xinlin Zhuang, Xiaopeng Bai, and Man Lan\. 2024\.[Ceamc: Corpus and empirical study of argument analysis in education via llms](https://api.semanticscholar.org/CorpusID:274060298)\.In*Conference on Empirical Methods in Natural Language Processing*\.
- Ren et al\. \(2025\)Yupei Ren, Xinyi Zhou, Ning Zhang, Shangqing Zhao, Man Lan, and Xiaopeng Bai\. 2025\.[Towards comprehensive argument analysis in education: Dataset, tasks, and method](https://arxiv.org/abs/2505.12028)\.*Preprint*, arXiv:2505\.12028\.
- Rocha et al\. \(2024\)Victor Hugo Nascimento Rocha, Igor Cataneo Silveira, Paulo Pirozelli, Denis Deratani Mauá, and Fábio Gagliardi Cozman\. 2024\.[Assessing good, bad and ugly arguments generated by chatgpt: a new dataset, its methodology and associated tasks](https://api.semanticscholar.org/CorpusID:266757593)\.In*Portuguese Conference on Artificial Intelligence*\.
- Sazid and Mercer \(2022\)Muhammad Tawsif Sazid and Robert E\. Mercer\. 2022\.[A unified representation and a decoupled deep learning architecture for argumentation mining of students’ persuasive essays](https://api.semanticscholar.org/CorpusID:252819316)\.In*Workshop on Argument Mining*\.
- Stab and Gurevych \(2017\)Christian Stab and Iryna Gurevych\. 2017\.[Parsing argumentation structures in persuasive essays](https://doi.org/10.1162/COLI_a_00295)\.*Computational Linguistics*, 43\(3\):619–659\.
- Stahl et al\. \(2024\)Maja Stahl, Leon Biermann, Andreas Nehring, and Henning Wachsmuth\. 2024\.[Exploring llm prompting strategies for joint essay scoring and feedback generation](https://api.semanticscholar.org/CorpusID:269362090)\.*ArXiv*, abs/2404\.15845\.
- Team \(2025\)Qwen Team\. 2025\.[Qwq\-32b: Embracing the power of reinforcement learning](https://qwenlm.github.io/blog/qwq-32b/)\.
- Ulfa and Purwati \(2023\)Siti Maria Ulfa and Oikurema Purwati\. 2023\.[Argumentative essay patterns produced by university students](https://api.semanticscholar.org/CorpusID:265027617)\.*Journal of English Education and Teaching*\.
- Wan et al\. \(2024\)Guangya Wan, Yuqi Wu, Jie Chen, and Sheng Li\. 2024\.Cot rerailer: Enhancing the reliability of large language models in complex reasoning tasks through error detection and correction\.*arXiv preprint arXiv:2408\.13940*\.
- Wang et al\. \(2020\)Hao Wang, Zhen Huang, Yong Dou, and Yu Hong\. 2020\.[Argumentation mining on essays at multi scales](https://api.semanticscholar.org/CorpusID:227230893)\.In*International Conference on Computational Linguistics*\.
- Wu et al\. \(2024\)Yijie Wu, Shi Feng, Ming Wang, Daling Wang, and Yifei Zhang\. 2024\.[Llm\-based empathetic response through psychologist\-agent debate](https://api.semanticscholar.org/CorpusID:272511985)\.In*APWeb/WAIM*\.
- Yang et al\. \(2023a\)Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen\. 2023a\.Large language models as optimizers\.*arXiv preprint arXiv:2309\.03409*\.
- Yang et al\. \(2023b\)Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V\. Le, Denny Zhou, and Xinyun Chen\. 2023b\.[Large language models as optimizers](https://api.semanticscholar.org/CorpusID:261582296)\.*ArXiv*, abs/2309\.03409\.
- Yang et al\. \(2024\)Qwen An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxin Yang, Jingren Zhou, Junyang Lin, and 25 others\. 2024\.[Qwen2\.5 technical report](https://api.semanticscholar.org/CorpusID:274859421)\.*ArXiv*, abs/2412\.15115\.
- Zeng et al\. \(2024\)Team Glm Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, and 36 others\. 2024\.[Chatglm: A family of large language models from glm\-130b to glm\-4 all tools](https://api.semanticscholar.org/CorpusID:270562306)\.*ArXiv*, abs/2406\.12793\.
- Zhou et al\. \(2022\)Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba\. 2022\.[Large language models are human\-level prompt engineers](https://api.semanticscholar.org/CorpusID:253265328)\.*ArXiv*, abs/2211\.01910\.

## Appendix AImplementation Details

In this work, we mainly employ Deepseek\-R1 as the backbone model, due to its powerful capability shown in long CoT reasoning and low in cost\. Also, we compare it with o1\-miniContributors et al\. \([2024](https://arxiv.org/html/2605.17247#bib.bib6)\), QwQTeam \([2025](https://arxiv.org/html/2605.17247#bib.bib33)\), a small model Qwen2\.5\-14BYang et al\. \([2024](https://arxiv.org/html/2605.17247#bib.bib40)\), as well as Deepseek V3DeepSeek\-AI et al\. \([2024](https://arxiv.org/html/2605.17247#bib.bib9)\), which is the base model of Deepseek R1 but do not feature in o1\-like reasoning, in order to investigate the influence of reasoning ability in our framework\.

To better guideGuiderin generating more aligned initial criteria, we incorporate several in\-context demonstrations during this stage\. In addition, unless explicitly stated, the batch size is set to 2 in our experiments\. Furthermore, to minimize the influence from theSolvermodel, we set its temperature to 0\.7\.

For the baselines, we include two demonstrations in the context of both ICL and CoT\. To mitigate prompt\-induced discrepancy, we adopt the same prompt template used forSolverwhen generating predictions in ICL, and append “Let’s think step by step…” for CoT\. Similarly, for Calibrate, we follow the same setup, but replace the updating criteria with the four atomic editing operations proposed in the original paper\.

## Appendix BToken Budgets

Given the iterative nature of TIDE, it inherently entails a relatively high token budget\. This aspect constitutes one of the primary motivations for selecting DeepSeek\-R1 as the base model for TIDE, owing to its favorable cost\-efficiency and competitive performance\. The token consumption of both Criteria\-based and TIDE methods during the prompt optimization process is documented in Table[7](https://arxiv.org/html/2605.17247#A2.T7)\.

Table 7:Token budget for Criteria\-based method and TIDE, where DeepSeek\-R1 was adopted as the backbone model\.The introduction of the Debate and Trial modules intuitively increases TIDE’s token consumption compared to the Criteria\-based method, as the trade\-off of the performance\. However, as shown in the table, we observe the opposite trend on AES\. One possible reason is that the Guider achieves a certain number of victories during the Debate module, thus reducing the number of updates\.

We also incorporate insights aimed at reducing the overall cost, as discussed in Section[5\.1](https://arxiv.org/html/2605.17247#S5.SS1)\. During inference, an alternative approach is to employ lightweight models rather than powerful but costly ones, leveraging the optimized criteria to predict labels\. Although this substitution may incur a slight performance degradation, it can significantly reduce API invocation costs or GPU resource consumption\.

## Appendix CDebate Win Rate

To further dive into the win rate ofSolver, we examine the performance at the iteration point whereSolver’s win rate surpasses that of the Gold Data under Trial=4 on Iteration 180\. The results, presented in Table[8](https://arxiv.org/html/2605.17247#A3.T8), reveal a noticeable performance jump—particularly in Pearson and QWK metrics—between iteration 180 and iteration 300\. Further investigation into controllingSolver’s win rate remains an open direction for future work\.

Additionally, we observe thatSolverrarely wins debates during training in ACD and ARI\. For instance, under the batch size of 2, only five wins are observed in ACD and two in ARI over 240 iterations\. We argue that this takes place when the task is relatively complex, compared the scoring task where the model itself is already capable \(though not aligned with the data\)\.

Table 8:The performance in different settings in AES task, where Trial=4 and batch size is set to 2\.
## Appendix DDiscrepancy Computation

For different task, we employed different methods for computing discrepancy, all in document\-level\. For AES, we utilize the absolute difference betweenyyandy^\\hat\{y\}as the discrepancy, while number of labels that mismatch withyyfor each discourse in ACD\.

dAES=abs\(y−y^\)d\_\{AES\}=abs\(y\-\\hat\{y\}\)
dACD=\|\{y^i\|y^i≠yi\}\|d\_\{ACD\}=\|\\\{\\hat\{y\}\_\{i\}\|\\hat\{y\}\_\{i\}\\neq y\_\{i\}\\\}\|
For ARI, in addition to mismatched labels between individual pairs of chunks \(i\.e\., predicted labels not appearing in the ground truthyy, and ground truth labels not present in the predictionsy^\\hat\{y\}\), we also consider the accuracy of pairwise identification\. To further enhance the precision of index extraction, we impose penalty on cases with incorrect index predictions, as illustrated in Algorithm[2](https://arxiv.org/html/2605.17247#alg2)\. Specifically, we count the number of predicted pairs that do not exist inyy, as well as the number of ground truth pairs inyythat are not predicted\. These errors are penalized with a weight of 2 in our experiments\. This design encourages TIDE to more accurately identify chunk pairs with relational links, as demonstrated in Table[3](https://arxiv.org/html/2605.17247#S5.T3)\.

0303060609090120120150150180180210210240240222\.42\.42\.82\.83\.23\.23\.63\.6444\.44\.4IterationErrorCriteria\-basedTIDEFigure 4:Error dynamic during training for ACD03030606090901201201501501801802102102402402020222224242626282830303232IterationErrorCriteria\-basedTIDEFigure 5:Error dynamic during training for ARIAlgorithm 2Computation forDiscrepancyariDiscrepancy\_\{ari\}1:Ground truth pairs

YY, predicted pairs

Y^\\hat\{Y\}, penalty

pp
2:Discrepancy for

Y^\\hat\{Y\}
3:

dARI←0d\_\{ARI\}\\leftarrow 0
4:for

y^∈Y^\\hat\{y\}\\in\\hat\{Y\}do

5:Ground truth pair that match in index

y,y∈Yandyfrom=y^fromandyto=y^toy,y\\in Y~and~y\_\{from\}=\\hat\{y\}\_\{from\}~and~y\_\{to\}=\\hat\{y\}\_\{to\}
6:ify is not nullthen

7:

dARI←dARI\+d\_\{ARI\}\\leftarrow d\_\{ARI\}\+mismatched labels between

yyand

y^\\hat\{y\}
8:else

9:

dARI←dARI\+d\_\{ARI\}\\leftarrow d\_\{ARI\}\+number of labels in

y^\+2×p\\hat\{y\}\+2\\times p
10:endif

11:endfor

12:for

y∈Yy\\in Ydo

13:if

yynot be predicted in previous loopthen

14:

dARI←dARI\+d\_\{ARI\}\\leftarrow d\_\{ARI\}\+number of labels in

y\+2×py\+2\\times p
15:endif

16:endfor

17:return

dARId\_\{ARI\}

The discrepancy dynamic during training for AES is presented in Figure[2](https://arxiv.org/html/2605.17247#S5.F2), while for ACD and ARI is illustrated in Figure[4](https://arxiv.org/html/2605.17247#A4.F4)and Figure[5](https://arxiv.org/html/2605.17247#A4.F5)\.

## Appendix EDetails of Dataset

In this work, we utilize CEAMC , ArGPT, AEE and ASAP 2\.0 to evaluate the performance of TIDE\. For ASAP 2\.0, given its large scale, we randomly shuffle and select 10% samples to conduct experiments\. For AEE, we follow the original train\-test split from the original paper\. For the other datasets, we randomly shuffle the data using a fixed seed of 42, and then split each dataset into 60% for training and 40% for evaluation to ensure sufficient evaluation samples\.

### E\.1CEAMC

CEAMCRen et al\. \([2025](https://arxiv.org/html/2605.17247#bib.bib28)\)includes 226 Chinese argumentative essays penned by high school students\. These essays range from 557 to 1,101 tokens with an average of 829\.82 tokens\. There are 4,726 dicourse in total, each of which has an argument component category inMajorClaim, Claim, Restated Claim, Fact, Anecdote, Quotation, Proverb, Axiom, ElaborationandOthers, which is what the label needs to be predicted in ACD task\. For ARI, the dataset defines 14 fine\-grained categories ofPositive, Negative, Comparative, Example, Citation, Metaphorical, Hypothetical, Restatement, Detail, Background, Coherence, Progression, ContrastandConcession\. There are 4,837 relations appear in the chunks\. FollowingRen et al\. \([2024](https://arxiv.org/html/2605.17247#bib.bib27)\), we categorize the original score data into five levels, corresponding to scores from 1 to 5\.

The original paper mainly reported performance of pretrained models such as RoBERTaLiu et al\. \([2019](https://arxiv.org/html/2605.17247#bib.bib21)\)and LongformerBeltagy et al\. \([2020](https://arxiv.org/html/2605.17247#bib.bib4)\), and LLMs after fine\-tuning such as ChatGLMZeng et al\. \([2024](https://arxiv.org/html/2605.17247#bib.bib41)\)and QwenBai et al\. \([2023](https://arxiv.org/html/2605.17247#bib.bib2)\)\. In this work we mainly discuss how TIDE help improve the performance via prompt optimization\.

### E\.2AEE

The AEE dataset focuses on analyzing argumentative essays by annotating the argument components and relations within\. The dataset contains over 400 student\-written argumentative essays\. We include this dataset mainly because its division of components \(Major Claim, Claim and Premise\) and relations \(Attack and Support\) is more close to conventional argument mining\.

### E\.3ASAP 2\.0

Inherited from the representative dataset for automatic essay scoring ASAP, a newly\-updated ASAP 2\.0 dataset was promoted\. This dataset incorporate about 24,000 student\-written argumentative essays, aligned to the latest standards for student\-appropriate assessments\. It also included samples across economic and location populations to mitigate the potential of algorithmic bias\.

### E\.4ArGPT

The ArGPT dataset primarily targets the evaluation of argument quality in texts generated by ChatGPT\. It consists of 168 argumentative essays, carefully constructed through simulated student\-professor dialogues to elicit diverse argument structures\. We included this dataset with the aim of providing insights for applications that utilize LLM\-based evaluation to give a score\. However, in comparison to CEAMC, the definitions of argument components \(Major Claim and Premise\) and their relations \(Attack and Support\) are relatively simplistic\. Therefore, we primarily utilize the AES task from this dataset to evaluate the effectiveness of TIDE\.

040408080120120160160200200240240400400450450500500550550600600650650700700750750800800IterationLengthFigure 6:Length dynamic during iteration for AES04040808012012016016020020024024005005001,0001\{,\}0001,5001\{,\}5002,0002\{,\}0002,5002\{,\}5003,0003\{,\}0003,5003\{,\}5004,0004\{,\}000IterationLengthFigure 7:Length dynamic during iteration for ACD0404080801201201601602002002402405005001,0001\{,\}0001,5001\{,\}5002,0002\{,\}0002,5002\{,\}500IterationLengthFigure 8:Length dynamic during iteration for ARI

## Appendix FOutput Samples

In this section we present the final output from TIDE in Table[10](https://arxiv.org/html/2605.17247#A6.T10),[11](https://arxiv.org/html/2605.17247#A6.T11)and[12](https://arxiv.org/html/2605.17247#A6.T12), respectively\. It is obvious that after iterations of refinement, the length of criteria extends with more details of each category included compared to the original one, which is mainly caused by learning different features through training, as shown in Figure[6](https://arxiv.org/html/2605.17247#A5.F6), Figure[7](https://arxiv.org/html/2605.17247#A5.F7)and Figure[8](https://arxiv.org/html/2605.17247#A5.F8)\. In addition, as shown in Table[12](https://arxiv.org/html/2605.17247#A6.T12),Guiderlearns to develop a quantification principle during the iterative process in ARI, which enables the model to better adapt to the task\. This is further demonstrated in Table[9](https://arxiv.org/html/2605.17247#A6.T9)\. This capability allowsSolverto more effectively distinguish between different categories, thereby enhancing overall performance in ARI\.

Table 9:Case study of ARI, whereGuiderlearned to develop a principle of quantification during iteration\.Five\-Dimensional and Five\-Level Scoring System for High School Argumentative Essays\(1\-5 Points\) \- Innovation of Thesis\(Weight:8%\) \- 5 Points: Dual\-dimension comparison including cultural symbol comparison\(cross\-regional/generational\)\+implicit opposition≥\\geq2 groups\(including at least 1 group of philosophical/cultural conflict\)\+at least 1 example from after 2015\(philosophical examples may be used as substitutes\)\. \- 4 Points: Single\-dimension analysis with implicit opposition≥\\geq1 group\(clearly specifying the type of opposition\)\+allowing classical philosophical examples to replace modern examples\. \- 3 Points: Single\-case argumentation\+implicit opposition≥\\geq1 group\(philosophical attributes must be clearly specified\)\. \- Effectiveness of Argumentation\(Weight:38%\) \- 5 Points: Three\-tiered progressive structure\(phenomenon\-essence\-value\)\+cross\-temporal and cross\-spatial case corroboration≥\\geq2 groups\(spanning at least 2 different fields or time periods\)\+argumentation layers≥\\geq4\(including at least 1 layer of counter\-proof\)\. \- 4 Points: Two\-tiered progression\+cross\-era case comparison\(time span≥\\geq5 years\)\+argumentation layers≥\\geq3\(supported by data or philosophical reasoning\)\. \- Quality of Evidence\(Weight:20%\) \- 5 Points: Case analysis≥\\geq100 words\(including deconstruction of contradictions\)\+positive\-to\-negative ratio≥\\geq1\.2:1\+at least 1 citation from academic literature or philosophical classics\. \- 4 Points: Case analysis≥\\geq80 words\(including deconstruction of symbols\)\+positive\-to\-negative ratio≥\\geq1:1\+allowing philosophical examples to replace literature citations\. \- Precision of Expression\(Weight:14%\) \- 5 Points: Composite rhetorical density≥\\geq0\.7 types per 100 words\(including at least 1 type of rhetorical nesting\)\+logical density≥\\geq7%\(including at least 3 types of logical connectors\)\. \- 4 Points: Composite rhetorical density≥\\geq0\.5 types per 100 words\+logical density≥\\geq5%\(including comparative or progressive structures\)\. \- Dialectical Power of Values\(Weight:20%\) \- 5 Points: Three\-dimensional value model\(individual\-group\-civilization\)\+at least 2 specific measures in the feasibility plan\(including the executing body\)\. \- 4 Points: Two\-way value model\+theoretical implementation framework\(must include the path of value transformation\)\. Advanced Standards: \- Argumentation on philosophical/cultural conflicts can be counted as 1\.5 groups of ordinary oppositions\(complete analysis of the nature of the conflict is required\)\. \- Citations from classical philosophical works can replace one group of case analysis\(text source must be indicated\)\. Review Mechanism: \- In\-depth analysis of philosophical cases\(≥\\geq150 words\)can exempt the requirement for literature citations\. \- A complete construction of the value model\(reaching 4 points\)can compensate for deficiencies in the feasibility plan\.

Table 10:The refined criteria via TIDE for AES\.The major claim is the overarching core judgment of the entire essay\. It must be unique and global in scope, and must appear in the introductory or concluding paragraph \(if it appears in a body paragraph, it must meet the following conditions: the judgment must run through all supporting arguments, must not be overturned by subsequent discussion, and must be explicitly echoed in the introduction or conclusion\)\. It must be a complete, independent evaluative judgment sentence \(it must include an explicit or implicit assertive term such as ’should/must/is/better than/indispensable’\)\. If in the introduction it appears as a compound sentence formed through a concessive\-turning structure that negates the opposing viewpoint but contains multiple predicate cores without forming a unifying assertion, it shall be downgraded to a supporting argument\. If the conclusion uses extended metaphors, the core vocabulary must directly correspond semantically with that in the introduction, and the core predicate structure must allow for semantic equivalence without introducing new\-dimensional predicates\. Exclusions: The concessive clause at the beginning of a compound sentence in the introduction; Transitional sentences that only define or describe a phenomenon without forming a complete value judgment\. Addition: If new concepts appear in the conclusion, the core predicate structure must remain strictly consistent with or semantically aligned with the introduction, and any appended solution path must constitute a synonymous transformation of the core assertion\. If the core predicate undergoes an equivalent transformation \(e\.g\., triple negation, implicit assertive terms like ‘indispensable’\) and maintains core concept correspondence, it is still considered a major claim\. However, if the operational path description does not form an independent judgment sentence, it is classified as elaboration\. Supplement: If the conclusion introduces a new predicate dimension \(e\.g\., ’facilitate’\) or fails to form strict correspondence with the introduction’s core predicate, it is downgraded to elaboration\. Reinforcing the original assertion with adverbs of degree \(e\.g\., ’especially should’\) is allowed as long as the core predicate remains unchanged\. Supporting Argument Supplement: Must propose an alternative judgment through causal analysis and directly support the major claim\. It must be an independent judgment sentence with clear assertive vocabulary \(including implicit terms like ’can/need to’ and negative assertion sentences following rhetorical questions, e\.g\., ’Heightened tension can compel focus’\)\. Excluded: Negative judgments that merely describe harmful phenomena without offering alternative assertions\. Valid forms include: Negative comparison judgments that establish new assertions \(the main clause must contain assertive terms like ’should/need to,’ directly support the major claim, and form a complete causal chain\)\. New dimensional assertions introduced through definition \(e\.g\., ’Tension is the lock of the soul’\) must contain assertive predicates or form an explicit logical link to the major claim\. Compound sentences led by concessive conjunctions \(e\.g\., ’Indeed…can demonstrate benefits’\) that substantively support the major claim are allowed\. Exclusions: Summary compound sentences used only for paragraph transition; Mechanism descriptions that do not establish an alternative assertion\. New: Rhetorical\-question\-led negative assertions that directly support the major claim and establish a causal chain must include explicit assertive terms or a complete causal chain to qualify as supporting arguments\. Conclusions drawn from cited research data that directly establish a causal chain and support the major claim are still supporting arguments\. Compound sentences that negate extreme interpretations, redefine concepts, and directly support the major claim \(e\.g\., ’Tension does not mean everything must be done hastily’\) are valid\. Causal analysis must contain alternative assertions, not merely explain mechanisms; use of implicit assertive terms \(e\.g\., ’rely on’\) is allowed if a complete causal chain is formed\. Elaboration Additions: Background descriptions introducing opposing viewpoints in the introduction; Transitional harm descriptions\. Supplement: New dimensions proposed through definitions without assertive predicates; Phenomenon\-based causal analyses \(e\.g\., ’the root cause’\) and mechanism explanations; Natural analogies used to illustrate group\-level universal patterns supporting the argument are classified as fact; Transitional rhetorical questions not providing background or structural explanation fall under “Other”; Operational path descriptions that do not form independent judgment sentences are elaboration; Group\-level social background phenomenon\-based causal analysis is classified as fact; Rhetorically asked causal analyses that only explain a phenomenon without forming alternative assertions are elaboration; Sentences merely describing research data sources without forming causal chains are elaboration; Compound sentences listing group phenomena without forming complete causal chains are elaboration\.

Table 11:The refined criteria via TIDE for ACD\.The classification is based on semantic functions and logical structure, combined with the categorical attributes of argument components\. Features are as follows: Example\-based Argumentation: Must include <Historical Example\> or <Famous Quotation\> component markers Historical examples must fully present the cause\-action\-effect causal chain \(Cross\-sentence combinations require stage completeness markers \+≥\\geq5 logical inference words\) Newly added exclusion criteria: e\) Case listings that do not directly correspond to the core elements of the sub\-argument \(element mapping degree <85% and inference words <5\) f\) Quotations that fail to form a complete argumentative chain \(must meet element mapping degree≥\\geq90% and include≥\\geq5 inference words\) g\) Historical examples missing any stage of cause\-action\-effect Positive Argumentation: Must meet dual conditions: a\) Explicit transition words \(e\.g\., “therefore,” “thus”\) \+≥\\geq7 logical inference words b\) Implicit logical inference words≥\\geq8 and sub\-argument element coverage≥\\geq95% Logical leap degree must be≥\\geq7\.0 and agent/patient matching weight≥\\geq45% \(Conclusion statements are exempt from inference word count limits\) Newly added exclusion criteria: e\) Only contains explicit transition words but inference words <7 f\) Argumentative paragraphs with contrastive or progressive markers Negative Argumentation: Must meet at least five antagonistic dimensions \(semantic/structural/agent/emotional/logical/contextual\) and include≥\\geq4 negation markers Comparative sentence structures must contain≥\\geq5 negation markers or contrastive words, and antagonistic dimension matching degree≥\\geq85% Newly added exclusion criteria: c\) Surface\-level negation lacking a contradictory focus \(antagonistic dimension match <85%\) d\) Contains only a single comparison dimension and no negation markers Refinement Relationship: Must simultaneously meet: a\) Structural differentiation≥\\geq85% and added information≥\\geq70% b\) Explicit/implicit transition words \(including “thereby,” etc\.\) \+ core element expansions≥\\geq5 items c\) Semantic role matching with prior argument≥\\geq85% \(Excludes pure repetition or supplementary explanation\) Newly added exclusion criteria: d\) Inter\-sentence core element repetition≥\\geq20% Restatement Relationship: Core element repetition≥\\geq98% \+ summary marker \+ structural differentiationleqleq3% Newly added exclusion criteria: c\) Summary sentences containing progressive or contrastive markers d\) Injection of new argument components \(≥\\geq1 new dimension added\) e\) Sub\-argument element coverage <99% Progressive Relationship: Explicit progression must have strength level≥\\geq4 \(e\.g\., “even more,” “especially,” “furthermore,” “particularly necessary”\) Implicit progression must meet: a\) Logical chain element retention≥\\geq95% \+≥\\geq5 new dimensions b\) Hierarchical progression must be adjusted by semantic shift weight \(Interrogative sentences counted as \+2\.0 intervals\) Parallel Relationship: Core element repetition≥\\geq90% and structural similarity≥\\geq95% Sub\-argument element coverage≥\\geq98% Newly added informationleqleq10% Compound Relationship Processing Rules: Refinement relationship takes precedence over progression if≥\\geq5 expansion dimensions are met Quotation\-based argumentation must verify that famous quotations match sub\-argument mapping≥\\geq90% and contain≥\\geq5 inference words Dual labels are allowed only if all core conditions of both types are met and contradiction dimensionsleqleq2 General Adjustments: Interval calculation increases semantic shift weight \(progressive markers count as \-1\.0 interval\) Element mapping requires agent/patient matching weight≥\\geq40% \+ semantic role matching degree≥\\geq75%

Table 12:The refined criteria via TIDE for ARI\.
## Appendix GPrompt Templates

In table[13](https://arxiv.org/html/2605.17247#A7.T13)we present the prompt template we used in AES, specically when conducting prediction viaSolverand update criteria byGuider, while table[14](https://arxiv.org/html/2605.17247#A7.T14)presents templates we used in ACD and ARI\.

Table[15](https://arxiv.org/html/2605.17247#A7.T15)shows the prompt template used for both debating and explanation generation, where thestandardsare defined according to the specific task\. In particular, we align the arguments with the annotation standards provided in CEAMC\.

For each task, we begin by inserting the task name and description into the template, along with task\-specific labels \(for ACD and ARI\)\. During prediction, the input essay from the training batch is filled in\. For updating criteria, we provide the current criteria, predictions, and ground truth\. For the debate stage, we assign the explanation generated by the LLM to the gold side, and the explanation from predictions to the opposing side\.

Table 13:Prompt templates for updating criteria and conduct prediction for AES accordingly\.Table 14:Prompt templates for updating criteria and conduct prediction for ACD and ARI\.Table 15:Prompt templates for debating\.
Towards Robust Argumentative Essay Understanding via TIDE: An Interactive Framework with Trial and Debate

Similar Articles

TIDE: Proactive Multi-Problem Discovery via Template-Guided Iteration

STRIDE-ED: A Strategy-Grounded Stepwise Reasoning Framework for Empathetic Dialogue Systems

Prober.ai: Gated Inquiry-Based Feedback via LLM-Constrained Personas for Argumentative Writing Development

Counterargument for Critical Thinking as Judged by AI and Humans

Retrieval-Augmented Tutoring for Algorithm Tracing and Problem-Solving in AI Education

Submit Feedback

Similar Articles

TIDE: Proactive Multi-Problem Discovery via Template-Guided Iteration
STRIDE-ED: A Strategy-Grounded Stepwise Reasoning Framework for Empathetic Dialogue Systems
Prober.ai: Gated Inquiry-Based Feedback via LLM-Constrained Personas for Argumentative Writing Development
Counterargument for Critical Thinking as Judged by AI and Humans
Retrieval-Augmented Tutoring for Algorithm Tracing and Problem-Solving in AI Education