Counterargument for Critical Thinking as Judged by AI and Humans

arXiv cs.CL Papers

Summary

This study investigates the use of student-written counterarguments to AI-generated content to foster critical thinking in an educational context, and finds that frontier LLMs can evaluate such submissions with moderate agreement to human assessors.

arXiv:2605.05353v1 Announce Type: new Abstract: This intervention study investigates the use of counterarguments in writing for critical thinking by students in the context of Generative AI (GenAI). This is especially as risks of cheating and cognitive offloading exist with the use of GenAI. We presented 36 students in a particular university course with 4 carefully selected thesis statements (from a set of popular debates) to write about anyone of them. We used six established rubrics (focus, logic, content, style, correctness and reference) to conduct three human assessments (two student peer-reviews and one experienced teacher) per writeup on a 5-point Likert scale for all the qualified samples (n) of 35 submissions (after disqualifying one for irregularity). Using the same rubrics and guidelines, we also assessed the submissions using six frontier LLMs as judges. Our mixed-method design included qualitative open-ended feedback per assessment and quantitative methods. The results reveal that (1) the students' self-written counterarguments to AI-generated content contains logic, among other things, which is a key component of critical thinking, and (2) GenAI can be successfully used at scale to assess students' written work, based on clear rubrics, and these assessments generally align with human assessments as shown with Gwets AC2 inter-rater reliability values of 0.33 for all the models except one.
Original Article
View Cached Full Text

Cached at: 05/08/26, 06:25 AM

# Counterargument for Critical Thinking as Judged by AI and Humans
Source: [https://arxiv.org/html/2605.05353](https://arxiv.org/html/2605.05353)
Tosin Adewumi\*, Marcus Liwicki, Foteini Simistira Liwicki, Lama Alkhaled, Hamam Mokayed, Esra Sümer\-Arpak Machine Learning Group, EISLAB, Luleå University of Technology, Sweden\. firstname\.lastname@ltu\.se

###### Abstract

This intervention study investigates the use of counterarguments in writing for critical thinking by students in the context ofGenerative AI \(GenAI\)\. This is especially as risks of cheating and cognitive offloading exist with the use ofGenAI\. We presented 36 students in a particular university course with 4 carefully selected thesis statements \(from a set of popular debates\) to write about anyone of them\. We used six established rubrics \(focus, logic, content, style, correctness and reference\) to conduct three human assessments \(two student peer\-reviews and one experienced teacher\) per writeup on a 5\-point Likert scale for all the qualified samples \(n\) of 35 submissions \(after disqualifying one for irregularity\)\. Using the same rubrics and guidelines, we also assessed the submissions using six frontierLLMs as judges\. Our mixed\-method design included qualitative open\-ended feedback per assessment and quantitative methods\. The results reveal that \(1\) the students’ self\-written counterarguments toAI\-generated content contains logic, among other things, which is a key component of critical thinking, and \(2\)GenAIcan be successfully used at scale to assess students’ written work, based on clear rubrics, and these assessments generally align with human assessments as shown with Gwets AC2 inter\-rater reliability values of 0\.33 for all the models except one\.

###### keywords:

Counterargument , Argument\-Based Learning , Critical Thinking ,AI

††journal:Open## 1Introduction

Education is evolving rapidly, particularly its pedagogy\(Zou et al\.,[2025](https://arxiv.org/html/2605.05353#bib.bib50)\), the engagement and evaluation of students\(Ennis and Weir,[1985](https://arxiv.org/html/2605.05353#bib.bib14); Halpern,[2006](https://arxiv.org/html/2605.05353#bib.bib20)\), and the end\-to\-end planning of learning\(Romiszowski,[2016](https://arxiv.org/html/2605.05353#bib.bib43)\)\. This is thanks toGenerative AI \(GenAI\), particularlylarge language models \(LLMs\), which have many benefits\(Adewumi et al\.,[2025d](https://arxiv.org/html/2605.05353#bib.bib5); Mulaudzi and Hamilton,[2025](https://arxiv.org/html/2605.05353#bib.bib38)\)\. Recent studies, however, have shown that people who used ChatGPT or similar tools to write essays demonstrated less brain activity in areas linked with cognitive processing or reported negative impact to their skills\(Kosmyna et al\.,[2025a](https://arxiv.org/html/2605.05353#bib.bib27); Helal et al\.,[2025](https://arxiv.org/html/2605.05353#bib.bib21)\)\. This behaviour of shifting cognitive tasks toGenAIis termed cognitive offloading and may lead to cognitive atrophy, where one’s abilities become worse\(Kosmyna et al\.,[2025a](https://arxiv.org/html/2605.05353#bib.bib27); Gerlich,[2025a](https://arxiv.org/html/2605.05353#bib.bib17)\)\. Given thatGenAIis most likely here to stay, it is essential and beneficial to find ways that students can engage with it such that it encourages critical thinking, thereby boosting their skills and learning\.

Critical thinking is a purposeful, reflective meta\-cognitive process that increases the possibility of forming a robust and logical conclusion to an argument or finding a solution to a problem\(Facione,[1990](https://arxiv.org/html/2605.05353#bib.bib15); Ku,[2009](https://arxiv.org/html/2605.05353#bib.bib29); Lai,[2011](https://arxiv.org/html/2605.05353#bib.bib32); Dwyer et al\.,[2014](https://arxiv.org/html/2605.05353#bib.bib12)\)\. It emphasizes moving beyond passive information intake to examining evidence, identifying assumptions, comparing perspectives, and using logic to determine the strength of a position\(Kuhn,[1991](https://arxiv.org/html/2605.05353#bib.bib30); Sinfield and Burns,[2023](https://arxiv.org/html/2605.05353#bib.bib44)\)\. Meanwhile, a counterargument is a logical argument that offers a different viewpoint to an existing one, thereby constituting a debate\(Fulan et al\.,[2025](https://arxiv.org/html/2605.05353#bib.bib16)\)\. While there are differences of opinion on what the components of critical thinking are\(Dwyer et al\.,[2014](https://arxiv.org/html/2605.05353#bib.bib12)\), logic is widely seen as a fundamental ingredient\(Adewumi et al\.,[2025a](https://arxiv.org/html/2605.05353#bib.bib1); Dwyer et al\.,[2014](https://arxiv.org/html/2605.05353#bib.bib12)\)\.

Assessing critical thinking through open\-ended formats, such as with written counterarguments, is favored by many researchers because the format seeks to evaluate the logic in an existing content\. One example of an open\-ended format is the Ennis–Weir Critical Thinking Essay Test \(EWCTET\)\(Ennis and Weir,[1985](https://arxiv.org/html/2605.05353#bib.bib14); Lai,[2011](https://arxiv.org/html/2605.05353#bib.bib32)\)\. However, objectively assessing students’ writing for critical thinking can be a challenging task for many reasons, including the differences of opinion on the details of critical thinking\(Dwyer et al\.,[2014](https://arxiv.org/html/2605.05353#bib.bib12)\)\. The Likert scale offers a possible solution and has been used by other researchers\(Joshi et al\.,[2015](https://arxiv.org/html/2605.05353#bib.bib24); Alkharusi,[2022](https://arxiv.org/html/2605.05353#bib.bib6); Gerlich,[2025b](https://arxiv.org/html/2605.05353#bib.bib18)\)\.

In this work, our motivation is to answertwo research questions: \(1\) Do counterarguments toAI\-generated arguments promote students’ critical thinking in writing, in terms of logic and relevant rubrics? and \(2\) How similar \(or dissimilar\) areGenAIsystems in judging counterarguments compared to humans, given the same rubrics? This is more so that the topic of how student\-written counterarguments toAI\-generated arguments across diverse topics promote critical thinking in students is understudied in the literature, since argument\-based learning appears under\-utilized in pedagogy compared to traditional transmissive teaching methods\(González et al\.,[2026](https://arxiv.org/html/2605.05353#bib.bib19)\)\. We used a mixedresearch designinvolving qualitative and quantitative methods, with a sample size \(n\) of 35 submissions and the 5\-point Likert scale for the rubrics\. We publicly release the data artifacts\.111github\.com/LTU\-Machine\-Learning/counterargument\_aiOur key contributions are:

1. 1\.We show that students’ self\-written counterarguments toAI\-generated content contains logic, among other things, which is a key component of critical thinking\.
2. 2\.We show thatGenAIcan be successfully used to assess students’ written work \(particularly counterarguments\) based on clear rubrics and its assessments generally align with human assessments, including expert review and students’ peer review\.

The rest of this paper is organised as follows\. In Section[2](https://arxiv.org/html/2605.05353#S2), we describe the theoretical framework in the literature\. In Section[3](https://arxiv.org/html/2605.05353#S3), we fully describe the method of this study, including the research design, participants and the checks forAI\-generated counterarguments\. In Section[4](https://arxiv.org/html/2605.05353#S4), we present our findings using multiple charts\. In Section[5](https://arxiv.org/html/2605.05353#S5), we discuss the implications of this study in connection to theory\. Finally, in Section[6](https://arxiv.org/html/2605.05353#S6), we provide concluding remarks\.

## 2Theoretical Framework

We conduct a relatively thorough review of the literature on some of the theoretical underpinnings of critical thinking, counterarguments, andGenAIassessment\.

### 2\.1Models of Critical Thinking

Bloom et al\. \([1956](https://arxiv.org/html/2605.05353#bib.bib8)\)’s taxonomy of educational objectives is one of many models of thinking applicable to critical thinking because it contains the analysis of knowledge\(Dwyer et al\.,[2014](https://arxiv.org/html/2605.05353#bib.bib12)\)\. The model has influenced many others, includingAnderson and Krathwohl \([2001](https://arxiv.org/html/2605.05353#bib.bib7)\)’s revised taxonomy for learning, teaching, and assessing,Duron et al\. \([2006](https://arxiv.org/html/2605.05353#bib.bib11)\)’s 5\-step framework, andRomiszowski \([2016](https://arxiv.org/html/2605.05353#bib.bib43)\)’s course design using a systems approach\.

In the first layer of the model byBloom et al\. \([1956](https://arxiv.org/html/2605.05353#bib.bib8)\)isknowledge, which not only relates to the specifics and terminologies of a content but knowledge about the ways of dealing with those specifics\. In the second layer iscomprehension, involving explaining and summarizing learned information, while in the third layer isapplication\. In the fourth layer isanalysisof elements and how they relate to one another\. In the fifth layer issynthesis, which involves the production of a plan or new communication, while in the final, sixth, layer isevaluation\. The Delphi panel \(of 46 experts in critical thinking\) agreed that analysis, evaluation and inference are core skills for critical thinking\(Facione,[1990](https://arxiv.org/html/2605.05353#bib.bib15)\)and they positively correlate\(Dwyer et al\.,[2015](https://arxiv.org/html/2605.05353#bib.bib13)\)\. The three skills form crucial components of the integrative critical thinking framework byDwyer et al\. \([2014](https://arxiv.org/html/2605.05353#bib.bib12)\)\.

### 2\.2Counterargument as a Tool of Critical Thinking

Beyond their structural role in argumentation, counterarguments function as a critical tool in argument\-based learning \(e\.g\. argumentative writing and debate\), serving to strengthen a primary argument by acknowledging, analyzing, and refuting opposing viewpoints\(González et al\.,[2026](https://arxiv.org/html/2605.05353#bib.bib19)\)\. Indeed, the definition of critical thinking is fully entrenched in the making of logical arguments \(or counterarguments\)\(Dwyer et al\.,[2014](https://arxiv.org/html/2605.05353#bib.bib12)\)\. In educational settings, exposure to contrasting viewpoints has been shown to stimulate conceptual change and promote integrative understanding, particularly when students are required to respond to those alternatives\. This is the case either with the concept ofargue to learn\(i\.e\. facilitating the learning of field knowledge\) orlearn to argue\(i\.e\. pedagogical tool for developing critical thinking skills\)\(Chi and Wylie,[2014](https://arxiv.org/html/2605.05353#bib.bib10); Nussbaum and Sinatra,[2003](https://arxiv.org/html/2605.05353#bib.bib39); González et al\.,[2026](https://arxiv.org/html/2605.05353#bib.bib19)\)\.

In the context ofGenAI, emerging research suggests that critical interaction with LLM outputs, such as questioning, revising, or challenging generated text, can support higher\-order thinking when learners remain cognitively active\(Kasneci et al\.,[2023](https://arxiv.org/html/2605.05353#bib.bib25)\)\. Rather than accepting AI\-generated responses at face value, students who formulate counterarguments must evaluate the adequacy, relevance, and coherence of the presented reasoning\. The use of counterarguments, according toGonzález et al\. \([2026](https://arxiv.org/html/2605.05353#bib.bib19)\), therefore offers significant benefits, pedagogically or otherwise, including

1. 1\.It provides evidence of substantive engagement with the source material\.
2. 2\.It signals that learning has taken place\.
3. 3\.It develops students’ socio\-emotional skills as they manage to cultivate an attitude of mutual respect, collaboration and dialogical empathy\.

For these reasons, the use of counterargument offers a measurable and theoretically grounded construct for assessing critical thinking inAI\-mediated writing contexts\.

### 2\.3Thinking Routines

Thinking routines, which are ways or procedures of thinking systematically, have been shown to develop critical thinking\(Pinedo et al\.,[2018](https://arxiv.org/html/2605.05353#bib.bib40); Manurung et al\.,[2022](https://arxiv.org/html/2605.05353#bib.bib37)\)\. Many thinking routines exist, e\.g\.See\-Think\-Wonder,the 4 C’s, andcircle of viewpoints\(Ritchhart et al\.,[2011](https://arxiv.org/html/2605.05353#bib.bib42)\)\. Comparing them requires distinguishing between the cognitive processes underlying human reasoning and the procedural mechanisms used to evaluate it\. In the context of argumentative writing, they refer to systematic practices through which individuals construct claims, evaluate evidence, engage opposing views, and justify conclusions\. Examples of thinking routines relevant to argument\-based learning \(categorized into 3 parts\) includethe explanation game\(for introducing and exploring ideas\),connect\-extend\-challenge\(for synthesizing and organising ideas\), andwhat makes you say that\( for digging deeper into ideas\)\(Ritchhart et al\.,[2011](https://arxiv.org/html/2605.05353#bib.bib42)\)\.Ritchhart et al\. \([2011](https://arxiv.org/html/2605.05353#bib.bib42)\)advocated for flexibility in using the routines, as some examples may cut across multiple categories\. In the era ofGenAI, thinking routines may be instantiated differently across human and artificial agents\. While students engage in argumentative writing through cognitive and metacognitive processes,LLMs generate structured arguments through probabilistic pattern recognition trained on large corpora\.

### 2\.4Assessment and Rubrics

Ku \([2009](https://arxiv.org/html/2605.05353#bib.bib29)\)argues that simply using multiple\-choice response format is inadequate for revealing students’ underlying reasoning for an answer or the ability to think critically under unprompted situations, thereby advocating for assessment that allows both multiple\-choice and open\-ended format\. Different critical thinking assessment tools exist, e\.g\. Halpern Critical Thinking Assessment Using Everyday Situations \(HCTAES\)\(Halpern,[2006](https://arxiv.org/html/2605.05353#bib.bib20)\)and Watson\-Glaser Critical Thinking Appraisal \(WGCTA\)\(Watson,[1980](https://arxiv.org/html/2605.05353#bib.bib47)\), but they differ in their formats and contexts\(Ku,[2009](https://arxiv.org/html/2605.05353#bib.bib29)\)\.

Rubrics, as scoring guide for assessing specific components of a task\(Yavuz et al\.,[2025](https://arxiv.org/html/2605.05353#bib.bib48); Ling,[2025](https://arxiv.org/html/2605.05353#bib.bib34)\), play a central role in operationalizing abstract constructs such as critical thinking into observable and scorable criteria\. By defining performance dimensions and scale descriptors, rubrics aim to enhance reliability, transparency, and alignment between learning objectives and assessment practices\(Brookhart,[2013](https://arxiv.org/html/2605.05353#bib.bib9); Jonsson and Svingby,[2007](https://arxiv.org/html/2605.05353#bib.bib23)\)\. However, the validity of a rubric depends not only on the selected dimensions but also on how clearly performance levels map onto those dimensions\. Misalignment raises concerns about construct validity and interpretability, particularly when scales are applied uniformly across heterogeneous rubrics\. Such challenges are especially relevant when comparing human andAI\-based assessments\. Therefore, careful rubric design becomes essential when the goal is to assess counterarguments and critical thinking rather than general writing proficiency\.

### 2\.5AIEvaluation

The use ofLLMs as evaluators has recently emerged as a promising approach for scalable and cost\-efficient assessment\. Rather than serving solely as text generators, frontier models have been shown to approximate human judgments in structured evaluation tasks when provided with explicit criteria and grading rubrics\(Kocmi and Federmann,[2023](https://arxiv.org/html/2605.05353#bib.bib26); Zheng et al\.,[2023](https://arxiv.org/html/2605.05353#bib.bib49)\)\. This paradigm, often referred to as “LLM\-as\-a\-judge,” relies on prompting models to assess outputs according to predefined dimensions, sometimes achieving substantial agreement with expert raters\. Such findings suggest that LLMs can operationalize assessment constructs when evaluation standards are clearly specified\.AIis particularly useful for scaling assessment when the number of students are so many that it’s inconvenient for humans or negatively impacts the quality of assessment humans can do\.

Regardless, there are still concerns related to validity, consistency, and sensitivity to deeper cognitive dimensions\. For example,GenAI\-based evaluation can be sensitive to prompt phrasing and rubric formulation, raising questions about robustness and replicability\(Zheng et al\.,[2023](https://arxiv.org/html/2605.05353#bib.bib49)\)\. It is also well\-established thatLLMs suffer from hallucinations and other misalignments\(Adewumi et al\.,[2025b](https://arxiv.org/html/2605.05353#bib.bib3),[c](https://arxiv.org/html/2605.05353#bib.bib4),[2024](https://arxiv.org/html/2605.05353#bib.bib2)\), despiteGenAIalignment methods for ensuring that they adhere to human intentions and values\(Li et al\.,[2026](https://arxiv.org/html/2605.05353#bib.bib33)\)\. Hence, when evaluating counterarguments, these limitations are especially important\. A text may include a clearly written opposing view, but this does not necessarily mean that the writer has genuinely engaged with the issue or demonstrated deep critical thinking\. For this reason, assessing the performance of anLLM\-as\-a\-judge should go beyond simply comparing scores to examining whether the model’schain\-of\-thought \(CoT\)reasoning is logical or reflects established theoretical definitions of critical thinking\.

## 3Materials and Methods

We combine both qualitative and quantitative research designs for a more insightful intervention study, given that each one presents a unique perspective to the study\. We perform Spearman’s correlation analysis for investigating correlation between key rubrics\. We provide additional details of the materials and mixed research design in the following subsections\.

### 3\.1Participants

The participants were 36 Master students of the Text Mining course of Luleå University of Technology, Sweden, for the 2025/26 calendar period\.222Course code: D7058EThe students are nationals of different countries and are mostly around the same age bracket in their early twenties\. The course is based on hybrid onsite\-online delivery\. Enrollment in the course meant that students became participants automatically because the counterargument task was designed as one of the assignments of the course\. The students were awarded credits if the task was successfully completed\. Completing the task involved submitting theLLMargument of their choice, their self\-written counterargument, and 2 peer\-reviews that were randomly assigned on the Canvaslearning management system \(LMS\)\.

### 3\.2Instruments

Four topics \(or thesis statements\) of debate were selected after searching online for some modern scientific debates\. The selection was based on the first 4 results from reputable venues across diverse categories, including statistics, linguistics, population genetics, and education\. They include:

1. 1\.Statistically non\-significant results indicate ‘no difference’\.333www\.nature\.com/articles/d41586\-019\-00857\-9
2. 2\.Humans are born with an innate capacity for language\.444www\.simplypsychology\.org/naturevsnurture\.html
3. 3\.The fate of mutations, which occurs randomly, is singularly governed by natural selection\.555www\.tandfonline\.com/doi/epdf/10\.1080/00219266\.2009\.9656163? needAccess=true
4. 4\.Pedagogy relates only to ways or methods of teaching\.666www\.tandfonline\.com/doi/full/10\.1111/curi\.12006

The students were asked to prompt anyLLMof their choice with one of the topics by starting with:Write an argument for the following thesis statement\. Thereafter, without any use ofGenAI, they were required to write their counterargument of 300 words, minimum, to theAI\-generated argument with academic referencing and to submit after 6 days for assessment\.

Six established rubrics were employed for assessment, includingFocus,Logic,Content,Style,Correctness, andReferences\(Howard et al\.,[2012](https://arxiv.org/html/2605.05353#bib.bib22); Adewumi et al\.,[2025a](https://arxiv.org/html/2605.05353#bib.bib1)\)\. Focus is important becauseLiu and Stapleton \([2020](https://arxiv.org/html/2605.05353#bib.bib35)\)identified in their study on argumentative writing the problem of participants drifting from the topic\. Each was represented on a 5\-point Likert scale, which provides ordinal data and is suitable for statistical analysis\. The allocation of the peer\-assessment was automatically randomized anonymously on theCanvaslearning management system and 2 submissions assigned to each student\. This helps to reduce potential bias in grading\. In addition to the Likert scale, there was an optional free\-text format for each student to provide open\-ended feedback on the rationale for their scores\. The experienced teacher, who is one of the authors, also provided expert assessment for each submission\. The instruction for student pair assessment is given below\.

> Score each writeup on a Likert scale: 1 \(Strongly disagree\), 2 \(Disagree\), 3 \(Neither agree or disagree\), 4 \(Agree\), and 5 \(Strongly agree\) for each of \(1\) focus, \(2\) logic, \(3\) content, \(4\) style, \(5\) correctness, and \(6\) evidence of peer\-reviewed references\. 1. 1\.Focus: that the text is focused on the given topic 2. 2\.Logic: that the statements therein are logically constructed 3. 3\.Valid Content: that there is sufficient content for an argument or position 4. 4\.Valid Style: that the style conforms with the academic or prescribed style of writing 5. 5\.Correct: that the argument or position is correct 6. 6\.Peer\-Reviewed References: backed up by credible references

For cases in the results where averages of the scores are taken, we used the equivalents in Table[1](https://arxiv.org/html/2605.05353#S3.T1)\.

Table 1:Averaging Likert Values
### 3\.3AIModels and the System Prompt

Table[2](https://arxiv.org/html/2605.05353#S3.T2)identifies theLLMs used in the study\. It includesstate\-of\-the\-art \(SotA\)closed commercial models and open models777https://huggingface\.co/chat/\. They were implemented through the user interfaces \(UIs\) and their default hyperparameters were used\. There were two cases \(out of 35\) that DeepSeek predicted 0, which is not on the Likert scale, forReference\. In both cases, we adjusted the values to the lowest possible values on the Likert scale \(1\)\. The alternative might have been to drop all DeepSeek results completely from calculating the averages but we would be losing valuable information, hence, we settled for adjusting the two anomalies\. The prompt was engineered based on recommendations in the literature for getting the best outputs, including assigning a persona to the model, clarity, and specificity, among others\(Tripathi et al\.,[2025](https://arxiv.org/html/2605.05353#bib.bib46)\)and is publicly available\.1As a result, the instruction to the models has slightly more clarification compared to what humans would know by unstated assumption\. The prompt is provided below\.

Table 2:AIModels for Counterargument Assessment> Assume the role of an expert educator, skilled in assessments\. In the zipped file, each document has an argument \(or thesis\) and counterargument \(or response\)\. Score only each counterargument writeup on a Likert scale from 1 \(Strongly disagree\), 2 \(Disagree\), 3 \(Neither agree or disagree\), 4 \(Agree\), and 5 \(Strongly agree\) for each part of the rubric below\. Optionally, give any comment about the counterargument writeup\. Save the complete structured assessment in a downloadable Excel file\. 1. 1\.Focus: that the counterargument is focused on the given topic \(by comparing the counterargument with the original argument\) 2. 2\.Logic: that the statements in the counterargument are logically constructed 3. 3\.Valid Content: that there is sufficient content \(minimum of 300 words\) for the counterargument 4. 4\.Valid Style: that the style conforms with the academic style of writing 5. 5\.Correct: that the counterargument is correct 6. 6\.Peer\-Reviewed References: that the counterargument is backed up by credible references

For Gemini and the open models, the clauseIn the zipped filewas omitted and the following alternative sentence substituted the last sentence because they were not able to generate Excel file as the final output \-Write out the results in a tabular form showing the provided filenames and the scores per file for all rubrics\.

### 3\.4AIChecks of Counterarguments

We used two popularAI\-text\-generation checkers on the students’ counterarguments: Grammarly’sAIdetector and ZeroGPT\.888grammarly\.com/ai\-detector and zerogpt\.comWe realized from the evaluations that these systems are not that reliable or consistent for checkingAI\-generated text \(see Table[6](https://arxiv.org/html/2605.05353#A1.T6)in the appendix\)\. For example, comparing each output in the two systems, the average, maximum, and minimum differences are 25\.98%, 79\.8% \(or standard deviation of 39\.9%\), and 0, respectively\. Furthermore, 16 cases have differences above 20%\. As a result, we relied on the expert’s experience to determine which of the submissions was largely \(or wholely\)AI\-generated, which happened to be one out of the 36 submissions\. We confirmed this from the student concerned and they admitted it to be so\. Hence, we left the submission out of this study\.

### 3\.5Ethics Consideration

The students peer\-assessment was anonymized to reduce the possibility of bias in assessment among students\. We also ensured that we checked each counterargument submission forAI\-generated content\.

## 4Results

We present the results of the study from multiple perspectives\. ForAI\-generated arguments by students, about 82% of theAImodels used wasChatGPT 5\.1while 12% and 6% wereCopilotandChatGPT 5\.2 Thinking, respectivley\. Figures[1](https://arxiv.org/html/2605.05353#S4.F1),[2](https://arxiv.org/html/2605.05353#S4.F2), and[3](https://arxiv.org/html/2605.05353#S4.F3)represent diverging stacked bar charts for the expert assessment, average student assessment, and the averageAIassessment of the counterarguments, respectively\. We can observe that all the three charts show that the rubrics for most counterarguments appear to the right of the scale, beingAgreeorStrongly Agree\. There are strong similarities in the rubrics for the expert and the average student assessments, as well as theAIassessment, in some cases while they are not as strong in some others\. The most important rubric,Logic, shows that the expert appears stricter in assessment \(with only 10 asStrongly Agree\) than both the average student and averageAI\.

![Refer to caption](https://arxiv.org/html/2605.05353v1/images/diverging.png)Figure 1:Diverging Stacked bar chart for Expert Assessment\.![Refer to caption](https://arxiv.org/html/2605.05353v1/images/diverging_pair.png)Figure 2:Diverging Stacked bar chart for Average Student Assessment\.![Refer to caption](https://arxiv.org/html/2605.05353v1/images/diverging_ai.png)Figure 3:Diverging Stacked bar chart for AverageAIAssessment\.Figures[4](https://arxiv.org/html/2605.05353#S4.F4)and[5](https://arxiv.org/html/2605.05353#S4.F5)represent the bar charts of medians and modes, respectively, for the 3 types of assessments: expert, average student, and averageAI\. The median values for all rubrics are 4 and above, indicatingAgreeandStrongly Agreewhile the modes are equally 4 and above, also indicatingAgreeandStrongly Agree\.

![Refer to caption](https://arxiv.org/html/2605.05353v1/images/bar_medians.png)Figure 4:Bar chart of medians\.![Refer to caption](https://arxiv.org/html/2605.05353v1/images/bar_mode.png)Figure 5:Bar chart of modes\.Figures[6](https://arxiv.org/html/2605.05353#S4.F6),[7](https://arxiv.org/html/2605.05353#S4.F7), and[8](https://arxiv.org/html/2605.05353#S4.F8)represent the box and whisker plots for the expert, average student, and the averageAI, respectively\. The three box plots show more peculiarities per assessment or average assessment\. For example, the plot for expert assessment shows 4 of the rubrics have collapsed boxes, indicating the dominant value, and several outliers\. The average student assessment has the most expanded boxes and fewer outliers compared to the other two\. The plot for the averageAIassessment has narrow boxes and similar count of outliers as the plot for expert assessment\.

![Refer to caption](https://arxiv.org/html/2605.05353v1/images/box_expert.png)Figure 6:Box Plot of Expert Assessment\.![Refer to caption](https://arxiv.org/html/2605.05353v1/images/box_students.png)Figure 7:Box Plot of Average Pair Assessment\.![Refer to caption](https://arxiv.org/html/2605.05353v1/images/box_ai.png)Figure 8:Box Plot of AverageAIAssessment\.### Additional Analysis

Tables[3](https://arxiv.org/html/2605.05353#S4.T3)and[4](https://arxiv.org/html/2605.05353#S4.T4)present average values of all rubrics for all submissions and for the 2 highest categories of submissions \(linguistics and education\), respectively\. The Gwets AC2 inter\-rater reliability values for all the models are 0\.33, except 0\.17 forLLaMA, which are all lower than students’ agreement values with the expert\. We used Gwet’s AC2 because it handles concerns about the paradoxes of Cohen’s Kappa for measuring inter\-rater agreement\. Furthermore, using the average values from Table[3](https://arxiv.org/html/2605.05353#S4.T3)to perform Spearman’s correlation analysis of key rubrics, we observe a very strong positive monotonic correlation between logic and focus \(0\.849\) and logic and correctness \(0\.919\)\. A similar observation is apparent when we consider only the sixLLMs\. We observe a very strong positive monotonic correlation between logic and focus \(0\.882\) and logic and correctness \(0\.893\)\. The table also shows the 2 ChatGPT 5\.2 models have the largest average difference between the two correlated rubricsLogicandCorrectness, though one would have expected a smaller difference, as with others\. ChatGPT 5\.1 deviates most from others \(with the average 3Neither agree or disagree\) on the most important rubrics\.

Table 3:Average Values of All Rubrics AcrossAIand HumansTable 4:Average Values of All Rubrics for 2 Topic AcrossAIand HumansNoEntityFocusLogicContentStyleCorrectnessReferenceLinguistics \(n = 14\)1ChatGPT 5\.2 Flagship3\.64345334\.1432ChatGPT 5\.2 Instant445432\.9293ChatGPT 5\.1 Instant334\.714333\.7864Gemini 3 Thinking54\.78644\.54\.8574\.7145DeepSeek V3\.14\.8574\.5714\.7864\.54\.5714\.0716LLaMA4\-Maverick 17B4\.9294\.92954\.7864\.6434\.57Expert \(Teacher\)54\.3573\.9294\.2144\.2143\.8578Student 14\.5714\.1434\.1433\.9294\.2863\.7149Student 24\.7864\.2864\.4293\.9294\.2863\.857Education \(n = 13\)1ChatGPT 5\.2 Flagship3\.15445334\.3082ChatGPT 5\.2 Instant444\.769432\.9233ChatGPT 5\.1 Instant334\.077333\.7694Gemini 3 Thinking54\.6924\.2314\.6154\.6924\.6155DeepSeek V3\.14\.8464\.4624\.7694\.4624\.5384\.0776LLaMA4\-Maverick 17B4\.7694\.76954\.7694\.7694\.1547Expert \(Teacher\)4\.8464\.2313\.538443\.8468Student 14\.7694\.3854\.3854\.1544\.6924\.3089Student 24\.6924\.23143\.76943\.923Figure[9](https://arxiv.org/html/2605.05353#S4.F9)represents the Z\-score distribution chart for the expert assessment for all the 35 counterarguments\. The chart helps to determine how many standard deviations each value for all the rubrics is from the average\. We can observe from the chart thatFocus, followed byLogic, is the rubric with the least distance from the mean whileReferenceis the one with the most \(with some values above 2\), though not extreme\.

![Refer to caption](https://arxiv.org/html/2605.05353v1/images/zdistribution_expert.png)Figure 9:Z score distribution for expert reviews\.Table 5:Reasoning examples for a Couple of Submissions

## 5Discussion

While recent advances inGenAIhave begun to demonstrate significant potential in education for autonomous learning\(Liu,[2025](https://arxiv.org/html/2605.05353#bib.bib36)\), enhancing student engagement\(Rahman and Watanobe,[2023](https://arxiv.org/html/2605.05353#bib.bib41)\), and supporting writing and idea generation\(Kasneci et al\.,[2023](https://arxiv.org/html/2605.05353#bib.bib25)\), emerging research has highlighted potential negative effects on cognitive functions, including perception, learning, critical thinking, problem solving, and decision\-making abilities\(Gerlich,[2025b](https://arxiv.org/html/2605.05353#bib.bib18)\)\. Growing concerns have suggested that extensive reliance onGenAItools in academic tasks may have implications for cognitive development, particularly with respect to independent problem solving and critical thinking\(Kosmyna et al\.,[2025b](https://arxiv.org/html/2605.05353#bib.bib28)\)\. The results of this study indicate that argument\-based learning withGenAIpromotes critical thinking because of the strong agreement that statements in the counterarguments are logically constructed, generally, and focused on the given topic\. Building on Kuhn’s framework, comparing thinking routines in argument\-based learning or any type of learning, involves examining how argumentative skills are enacted and recognized across different evaluative agents\. In human reasoning, the production of counterarguments and rebuttals reflects deliberate epistemic coordination and dialogic engagement\(Kuhn,[2018](https://arxiv.org/html/2605.05353#bib.bib31)\), consistent with models of argument structure that emphasize rebuttal as a marker of sophisticated reasoning\(Toulmin,[2003](https://arxiv.org/html/2605.05353#bib.bib45)\)\. These theoretical models position counterargument and rebuttal not as optional rhetorical devices but as core indicators of epistemic maturity\.

The transition from active information seeking to consumption of structured AI\-generated content has further reshaped how learners engage with new knowledge and reasoning process\. Unlike conventional search engines, which provide users with multiple sources requiring comparison and evaluation,GenAItypically delivers a single synthesized response\. While such responses may discourage cognitive engagement to analyse and evaluate information critically\(Kasneci et al\.,[2023](https://arxiv.org/html/2605.05353#bib.bib25)\), they appear to challenge the construction of logical response in argument\-based learning\. Despite this study, there is the need for continued research on the interaction betweenGenAIand the human brain in other learning paradigms, especially regarding the influence on cognitive skill development\. It is imperative to promote cognitive engagement across diverse tasks while ensuring thatGenAItools are used to support, rather than replace, core thinking processes\.

## 6Conclusion

The ongoing transformation in education byGenAIlooks set to continue\. It appears every area will be affected and learning needs to adapt to this changes\. In this work, we showed that, in argument\-based learning, students’ self\-written counterarguments toAI\-generated content promotes critical thinking because they contain logic, in addition to other important components\. Furthermore, we showed thatGenAIcan be successfully used to assess students’ counterarguments based on clear rubrics and these assessments generally align with expert and students’ assessments\. Future work needs to evaluate the impactGenAIhas on other types of learning and other areas of education\.

## Acknowledgments

The authors wish to thank the Department of Computer Science, Electrical & Space Engineering at Luleå University of Technology for the 2026 SRT pedagogy fund for this project\. We also thank all the participating students of the Text Mining course, 2025/26 session\. This work was partially supported by the Wallenberg AI, Autonomous Systems and Software Program \(WASP\), funded by the Knut and Alice Wallenberg Foundation and counterpart funding from Luleå University of Technology \(LTU\)\.

## Appendix AAppendix

Table 6:Checks forAI\-generated Counterarguments
## References

- Adewumi et al\. \(2025a\)Adewumi, T\., Alkhaled, L\., Buck, C\., Hernandez, S\., Brilioth, S\., Kekung, M\., Ragimov, Y\., Barney, E\., 2025a\.Procot: Stimulating critical thinking and writing of students through engagement with large language models \(llms\)\.Journal of Pedagogical Sociology and Psychology 7\.doi:[https://doi\.org/10\.33902/jpsp\.202536789](http://dx.doi.org/https://doi.org/10.33902/jpsp.202536789)\.
- Adewumi et al\. \(2024\)Adewumi, T\., Alkhaled, L\., Gurung, N\., van Boven, G\., Pagliai, I\., 2024\.Fairness and bias in multimodal ai: A survey\.arXiv preprint arXiv:2406\.19097 \.
- Adewumi et al\. \(2025b\)Adewumi, T\., Alkhaled, L\., Imbert, F\., Han, H\., Habib, N\., Löwenmark, K\., 2025b\.Ai must not be fully autonomous\.arXiv preprint arXiv:2507\.23330 \.
- Adewumi et al\. \(2025c\)Adewumi, T\., Habib, N\., Alkhaled, L\., Barney, E\., 2025c\.On the limitations of large language models \(LLMs\): False attribution, in: Angelova, G\., Kunilovskaya, M\., Escribe, M\., Mitkov, R\. \(Eds\.\), Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing \- Natural Language Processing in the Generative AI Era, INCOMA Ltd\., Shoumen, Bulgaria, Varna, Bulgaria\. pp\. 11–21\.URL:[https://aclanthology\.org/2025\.ranlp\-1\.2/](https://aclanthology.org/2025.ranlp-1.2/)\.
- Adewumi et al\. \(2025d\)Adewumi, T\., Liwicki, F\.S\., Liwicki, M\., Gardelli, V\., Alkhaled, L\., Mokayed, H\., 2025d\.Findings of mega: Math explanation with llms using the socratic method for active learning\.IEEE Signal Processing Magazine 42, 77–94\.doi:[10\.1109/MSP\.2025\.3590807](http://dx.doi.org/10.1109/MSP.2025.3590807)\.
- Alkharusi \(2022\)Alkharusi, H\., 2022\.A descriptive analysis and interpretation of data from likert scales in educational and psychological research\.Indian Journal of Psychology and Education 12, 13–16\.
- Anderson and Krathwohl \(2001\)Anderson, L\.W\., Krathwohl, D\.R\., 2001\.A taxonomy for learning, teaching, and assessing: A revision of Bloom’s taxonomy of educational objectives: complete edition\.Addison Wesley Longman, Inc\.
- Bloom et al\. \(1956\)Bloom, B\.S\., Engelhart, M\.D\., Furst, E\.J\., Hill, W\.H\., Krathwohl, D\.R\., et al\., 1956\.Taxonomy of educational objectives: The classification of educational goals\. Handbook 1: Cognitive domain\.Longman New York\.
- Brookhart \(2013\)Brookhart, S\.M\., 2013\.How to create and use rubrics for formative assessment and grading\.Ascd\.
- Chi and Wylie \(2014\)Chi, M\.T\., Wylie, R\., 2014\.The icap framework: Linking cognitive engagement to active learning outcomes\.Educational psychologist 49, 219–243\.
- Duron et al\. \(2006\)Duron, R\., Limbach, B\., Waugh, W\., 2006\.Critical thinking framework for any discipline\.International Journal of teaching and learning in higher education 17, 160–166\.
- Dwyer et al\. \(2014\)Dwyer, C\.P\., Hogan, M\.J\., Stewart, I\., 2014\.An integrated critical thinking framework for the 21st century\.Thinking skills and Creativity 12, 43–52\.
- Dwyer et al\. \(2015\)Dwyer, C\.P\., Hogan, M\.J\., Stewart, I\., 2015\.The promotion of critical thinking skills through argument mapping \.
- Ennis and Weir \(1985\)Ennis, R\.H\., Weir, E\.E\., 1985\.The Ennis\-Weir critical thinking essay test: An instrument for teaching and testing\.Midwest Publications\.
- Facione \(1990\)Facione, P\.A\., 1990\.The delphi report: Committee on pre\-college philosophy, in: American Philosophical Association\.
- Fulan et al\. \(2025\)Fulan, L\., Mengchen, Z\., Wenyun, L\., 2025\.Corpus\-assisted counterargumentation instruction: cultivating critical thinking via argumentative writing\.Thinking Skills and Creativity , 102120\.
- Gerlich \(2025a\)Gerlich, M\., 2025a\.Ai tools in society: Impacts on cognitive offloading and the future of critical thinking\.Societies 15, 6\.
- Gerlich \(2025b\)Gerlich, M\., 2025b\.AI tools in society: Impacts on cognitive offloading and the future of critical thinking\.Societies 15, 6\.doi:[10\.3390/soc15010006](http://dx.doi.org/10.3390/soc15010006)\.
- González et al\. \(2026\)González, I\., Rapanta, C\., Larrain, A\., 2026\.Promoting argumentation skills among university students: A scoping review\.Higher Education Quarterly 80, e70080\.
- Halpern \(2006\)Halpern, R\., 2006\.Halpern critical thinking assessment using everyday situations: Background and scoring standards claremont ca: Claremont mckenna college \.
- Helal et al\. \(2025\)Helal, M\.Y\., Elgendy, I\.A\., Albashrawi, M\.A\., Dwivedi, Y\.K\., Al\-Ahmadi, M\.S\., Jeon, I\., 2025\.The impact of generative ai on critical thinking skills: a systematic review, conceptual framework and future research directions\.Information Discovery and Delivery \.
- Howard et al\. \(2012\)Howard, R\.D\., McLaughlin, G\.W\., Knight, W\.E\., 2012\.The handbook of institutional research\.John Wiley & Sons\.
- Jonsson and Svingby \(2007\)Jonsson, A\., Svingby, G\., 2007\.The use of scoring rubrics: Reliability, validity and educational consequences\.Educational research review 2, 130–144\.
- Joshi et al\. \(2015\)Joshi, A\., Kale, S\., Chandel, S\., Pal, D\.K\., 2015\.Likert scale: Explored and explained\.British journal of applied science & technology 7, 396\.
- Kasneci et al\. \(2023\)Kasneci, E\., Seßler, K\., Küchemann, S\., Bannert, M\., Dementieva, D\., Fischer, F\., Gasser, U\., Groh, G\., Günnemann, S\., Hüllermeier, E\., et al\., 2023\.Chatgpt for good? on opportunities and challenges of large language models for education\.Learning and individual differences 103, 102274\.
- Kocmi and Federmann \(2023\)Kocmi, T\., Federmann, C\., 2023\.Large language models are state\-of\-the\-art evaluators of translation quality, in: Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pp\. 193–203\.
- Kosmyna et al\. \(2025a\)Kosmyna, N\., Hauptmann, E\., Yuan, Y\.T\., Situ, J\., Liao, X\.H\., Beresnitzky, A\.V\., Braunstein, I\., Maes, P\., 2025a\.Your brain on chatgpt: Accumulation of cognitive debt when using an ai assistant for essay writing task\.arXiv preprint arXiv:2506\.08872 4\.
- Kosmyna et al\. \(2025b\)Kosmyna, N\., Hauptmann, E\., Yuan, Y\.T\., Situ, J\., Liao, X\.H\., Beresnitzky, A\.V\., Braunstein, I\., Maes, P\., 2025b\.Your brain on ChatGPT: Accumulation of cognitive debt when using an AI assistant for essay writing task\.doi:[10\.48550/ARXIV\.2506\.08872](http://dx.doi.org/10.48550/ARXIV.2506.08872)\. version Number: 2\.
- Ku \(2009\)Ku, K\.Y\., 2009\.Assessing students’ critical thinking performance: Urging for measurements using multi\-response format\.Thinking skills and creativity 4, 70–76\.
- Kuhn \(1991\)Kuhn, D\., 1991\.The skills of argument\.Cambridge University Press\.
- Kuhn \(2018\)Kuhn, D\., 2018\.A role for reasoning in a dialogic approach to critical thinking\.Topoi 37, 121–128\.
- Lai \(2011\)Lai, E\.R\., 2011\.Critical thinking: A literature review\.Pearson’s research reports 6, 40–41\.
- Li et al\. \(2026\)Li, X\., Jiang, Q\., Jiang, L\., Zhang, S\., Hu, S\., 2026\.The landscape of ai alignment: A comprehensive review of theories and methods\.International Journal of Pattern Recognition and Artificial Intelligence 40, 2539001\.
- Ling \(2025\)Ling, J\.H\., 2025\.A review of rubrics in education: Potential and challenges\.Indonesian Journal of Innovative Teaching and Learning 2, 1–14\.
- Liu and Stapleton \(2020\)Liu, F\., Stapleton, P\., 2020\.Counterargumentation at the primary level: An intervention study investigating the argumentative writing of second language learners\.System 89, 102198\.
- Liu \(2025\)Liu, J\., 2025\.The role of generative AI in the process of autonomous learning of college students\.Journal of Education, Humanities and Social Sciences 53, 38–42\.doi:[10\.54097/brzv3w55](http://dx.doi.org/10.54097/brzv3w55)\.
- Manurung et al\. \(2022\)Manurung, M\.R\., Masitoh, S\., Arianto, F\., 2022\.How thinking routines enhance critical thinking of elementary students\.IJORER: International Journal of Recent Educational Research 3, 640–650\.
- Mulaudzi and Hamilton \(2025\)Mulaudzi, L\.V\., Hamilton, J\., 2025\.Lecturer’s perspective on the role of ai in personalized learning: Benefits, challenges, and ethical considerations in higher education\.Journal of Academic Ethics 23, 1571–1591\.
- Nussbaum and Sinatra \(2003\)Nussbaum, E\.M\., Sinatra, G\.M\., 2003\.Argument and conceptual engagement\.Contemporary Educational Psychology 28, 384–395\.
- Pinedo et al\. \(2018\)Pinedo, R\., García, N\., Cañas, M\., 2018\.Thinking routines across different subjects and educational levels, in: INTED2018 Proceedings, IATED\. pp\. 5577–5580\.
- Rahman and Watanobe \(2023\)Rahman, M\.M\., Watanobe, Y\., 2023\.ChatGPT for education and research: Opportunities, threats, and strategies\.Applied Sciences 13, 5783\.doi:[10\.3390/app13095783](http://dx.doi.org/10.3390/app13095783)\.
- Ritchhart et al\. \(2011\)Ritchhart, R\., Church, M\., Morrison, K\., 2011\.Making thinking visible: How to promote engagement, understanding, and independence for all learners\.John Wiley & Sons\.
- Romiszowski \(2016\)Romiszowski, A\.J\., 2016\.Designing instructional systems: Decision making in course planning and curriculum design\.Routledge\.
- Sinfield and Burns \(2023\)Sinfield, S\., Burns, T\., 2023\.Design thinking in education: Adding collaboration, uncertainty, phronesis and fairydust to curriculum design\.International Journal of Management and Applied Research 10, 263–269\.
- Toulmin \(2003\)Toulmin, S\.E\., 2003\.The uses of argument\.Cambridge university press\.
- Tripathi et al\. \(2025\)Tripathi, S\., Alkhulaifat, D\., Lyo, S\., Sukumaran, R\., Li, B\., Acharya, V\., McBeth, R\., Cook, T\.S\., 2025\.A hitchhiker’s guide to good prompting practices for large language models in radiology\.Journal of the American College of Radiology 22, 841–847\.
- Watson \(1980\)Watson, G\., 1980\.Watson\-Glaser critical thinking appraisal\. volume 3\.Psychological Corporation San Antonio, TX\.
- Yavuz et al\. \(2025\)Yavuz, F\., Çelik, Ö\., Yavaş Çelik, G\., 2025\.Utilizing large language models for efl essay grading: An examination of reliability and validity in rubric\-based assessments\.British Journal of Educational Technology 56, 150–166\.
- Zheng et al\. \(2023\)Zheng, L\., Chiang, W\.L\., Sheng, Y\., Zhuang, S\., Wu, Z\., Zhuang, Y\., Lin, Z\., Li, Z\., Li, D\., Xing, E\., et al\., 2023\.Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.Advances in neural information processing systems 36, 46595–46623\.
- Zou et al\. \(2025\)Zou, D\., Xie, H\., Kohnke, L\., 2025\.Navigating the future: establishing a framework for educators’ pedagogic artificial intelligence competence\.European Journal of Education 60, e70117\.

Similar Articles

Thoughts on student’s AI use

Reddit r/AI_Agents

A discussion or opinion piece on students' use of artificial intelligence in educational settings.

AI-written critiques help humans notice flaws

OpenAI Blog

OpenAI trained language models to write critiques of text summaries, helping human evaluators spot flaws more effectively — a step toward scalable oversight of AI systems on difficult tasks. The work explores how AI-assisted feedback can improve human evaluation quality as a proof of concept for alignment research.

The Difference Between Thinking With AI and Depending on AI

Reddit r/ArtificialInteligence

An article exploring the difference between using AI as a tool to enhance thinking versus becoming overly dependent on AI, emphasizing the importance of maintaining human critical thinking and judgment.