Findings of the Counter Turing Test: AI-Generated Text Detection

arXiv cs.CL Papers

Summary

This paper presents findings from the Counter Turing Test shared task on AI-generated text detection, with top systems achieving perfect binary classification but significantly lower performance in model attribution, highlighting the difficulty of distinguishing outputs from different large language models.

arXiv:2605.20761v1 Announce Type: new Abstract: The rapid proliferation of AI-generated text has introduced significant challenges in maintaining the integrity of digital content. Advanced generative models such as GPT-4, Claude 3.5, and Llama can produce highly coherent and human-like text, making it increasingly difficult to differentiate between human-written and AI-generated content. While these models have transformative applications, their misuse has raised concerns about misinformation, biased narratives, and security threats. This paper provides a comprehensive analysis of state-of-the-art AI-generated text detection techniques and evaluates their effectiveness through the Counter Turing Test (CT2) shared tasks. Task A (Binary Classification) required participants to distinguish between human-written and AI-generated text, while Task B (Model Attribution) focused on identifying the specific language model responsible for generating a given text. The results demonstrated high performance in binary classification, with the top system achieving an F1 score of 1.0000, but significantly lower scores in model attribution, where the best system achieved 0.9531, highlighting the increased complexity of this task. The top-performing teams leveraged fine-tuned transformer models, ensemble learning, and hybrid detection approaches, with DeBERTa-based and BART-based methods demonstrating strong results. However, the lower scores in Task B underscore the challenges of distinguishing outputs from different LLMs, necessitating further research into adversarial robustness, feature extraction, and cross-domain generalization.
Original Article
View Cached Full Text

Cached at: 05/21/26, 06:35 AM

# Findings of the Counter Turing Test: AI-Generated Text Detection
Source: [https://arxiv.org/html/2605.20761](https://arxiv.org/html/2605.20761)
\\copyrightclause

Copyright for this paper by its authors\. Use permitted under Creative Commons License Attribution 4\.0 International \(CC BY 4\.0\)\.

\\conference

Defactify 4\.0: Multimodal Fact\-Checking and AI\-Generated Text Detection, March 2025, Virtual Conference

\[email=royrajarshi0123@gmail\.com\] \[email=gurpreet\.singh@email\.com\] \[email=ashhar21137@iiitd\.ac\.in\] \[email=f20213041@hyderabad\.bits\-pilani\.ac\.in\] \[email=imanpour@email\.sc\.edu\] \[email=shwetangshu\.biswas@email\.com\] \[email=kapilw25@gmail\.com\] \[email=parthpatwa@g\.ucla\.edu\] \[email=subhankar\.ghosh@email\.com\] \[email=shreyasrd31@gmail\.com\] \[email=nileshpal530@gmail\.com\] \[email=vipula\.rawte@email\.com\] \[email=ritvik\.garimella@email\.com\] \[email=amitava\.das2@wipro\.com\] \[email=amit@sc\.edu\] \[email=sharma\.vasu55@gmail\.com\] \[email=aish\.nr@gmail\.com\] \[email=hi@vinija\.ai\] \[email=hi@amanchadha\.com\]

Gurpreet SinghAshhar AzizShashwat BajpaiNasrin ImanpourShwetangshu BiswasKapil WanaskarParth PatwaSubhankar GhoshShreyas DixitNilesh Ranjan PalVipula RawteRitvik GarimellaAmitava DasAmit ShethVasu SharmaAishwarya Naresh RegantiVinija JainAman ChadhaKalyani Government Engineering College, IndiaIIIT Guwahati, IndiaIIIT Delhi, IndiaBITS Pilani Hyderabad Campus, IndiaAI Institute, University of South Carolina, USANational Institute of Technology Silchar, IndiaSan José State University, USAUniversity of California Los Angeles, USAWashington State University, USAVishwakarma Institute of Information Technology, IndiaMeta AI, USAAmazon AI, USABITS Pilani, Goa

\(2026\)

###### Abstract

The rapid proliferation of AI\-generated text has introduced significant challenges in maintaining the integrity of digital content\. Advanced generative models such as GPT\-4, Claude 3\.5, and Llama can produce highly coherent and human\-like text, making it increasingly difficult to differentiate between human\-written and AI\-generated content\. While these models have transformative applications, their misuse has raised concerns about misinformation, biased narratives, and security threats\.

This paper provides a comprehensive analysis of state\-of\-the\-art AI\-generated text detection techniques and evaluates their effectiveness through the Counter Turing Test \(CT2\) shared tasks\. Task A \(Binary Classification\) required participants to distinguish between human\-written and AI\-generated text, while Task B \(Model Attribution\) focused on identifying the specific language model responsible for generating a given text\. The results demonstrated high performance in binary classification, with the top system achieving an F1 score of 1\.0000, but significantly lower scores in model attribution, where the best system achieved 0\.9531, highlighting the increased complexity of this task\.

The top\-performing teams leveraged fine\-tuned transformer models, ensemble learning, and hybrid detection approaches, with DeBERTa\-based and BART\-based methods demonstrating strong results\. However, the lower scores in Task B underscore the challenges of distinguishing outputs from different LLMs, necessitating further research into adversarial robustness, feature extraction, and cross\-domain generalization\.

###### keywords:

AI\-Generated Text\\sepDetection Techniques\\sepGenerative AI\\sepNatural Language Processing

## 1Introduction

Generative AI technologies such as GPT\-4\[openai2023gpt4\], Claude\[anthropic2024claude\], and Llama\[touvron2023llama\]have revolutionized the creation of synthetic text content\. These tools leverage advanced neural architectures to produce highly coherent and contextually relevant text, enabling a wide range of applications in industries such as content creation, education, and customer service\. However, the increasing accessibility of these tools has introduced significant challenges related to their potential misuse for spreading disinformation, generating spam, and manipulating public opinion\[solaiman2019release,krishna2023deception\]\.

High\-profile incidents have demonstrated the societal and economic impact of AI\-generated text\. For instance, fabricated news articles and social media posts created by language models have manipulated political narratives, propagated false information, and influenced public sentiment\. During the 2020 U\.S\. presidential election, there were concerns about the use of generative AI to create fake political content, highlighting the vulnerabilities of digital ecosystems to AI\-driven disinformation campaigns\. Similarly, AI\-generated fake scientific papers have posed challenges to academic integrity, raising questions about the reliability of published research\.

The exponential growth in the capabilities of language models further exacerbates the problem\. Technologies like GPT\-4 and Claude 3\.5 have pushed the boundaries of linguistic fluency and contextual understanding, making it increasingly difficult to distinguish between AI\-generated and human\-written text using traditional detection methods\. The proliferation of synthetic text has raised concerns among policymakers and technologists alike\. A report by the European Union noted a decline in responsiveness to online disinformation notifications, reflecting the growing challenge of combating AI\-enabled misinformation\.

Moreover, the misuse of AI\-generated text extends beyond disinformation\. It includes hate speech, phishing attacks, and the generation of biased or harmful narratives\. For example, language models can be fine\-tuned or prompted to produce content that embeds subtle biases or promotes divisive ideologies\. These scenarios underscore the urgency of developing robust detection mechanisms to address the misuse of generative AI technologies\.

This paper focuses on advancing detection techniques for AI\-generated text by analyzing the state\-of\-the\-art methods, identifying gaps, and proposing a comprehensive framework for evaluation\. Building on the insights from the Defactify workshop series\[roy\-2025\-defactify\-overview\-text\], which has established itself as a leading forum for addressing challenges in multimodal fact\-checking, this study aims to bridge the gap between academic research and practical applications\. By addressing these challenges, we seek to contribute to the development of scalable, reliable detection systems that safeguard the integrity of digital ecosystems\.

## 2Related Work

The evolution of Large Language Models \(LLMs\) has necessitated robust methods for distinguishing between human and AI\-generated content\. This section reviews existing datasets and detection methods\.

Statistical & Zero\-Shot Detectors: In recent years, researchers are exploring methods to identify intrinsic signatures of AI text without requiring labeled training data\. DetectGPT\[mitchell2023detectgptzeroshotmachinegeneratedtext\]utilizes the negative curvature of an LLM’s log\-probability surface to identify machine\-generated text through probability discrepancies\. Similarly, GLTR\[gehrmann2019gltrstatisticaldetectionvisualization\]offers a suite of statistical tools, such as word\-rank and entropy analysis, to improve human detection rates\. More recent advancements include DNA\-GPT\[zhang2023dnagptgeneralizedpretrainedtool\], which analyzes N\-gram divergence during text "regeneration," and Binoculars\[hans2024spottingllmsbinocularszeroshot\], which calculates a score based on the contrast between two related language models to identify synthetic signals\.

Supervised Classifiers and Architectural Evolution: This approach involves training neural discriminators on curated data of human and AI\-generated text\. Early work, such as Grover\[zellers2020defendingneuralfakenews\], demonstrated that the best generators often serve as the most effective discriminators for their own outputs\. Research by Ippolito et al\.\[ippolito2020automaticdetectiongeneratedtext\]explored how various decoding strategies \(e\.g\., nucleus sampling\) create machine\-detectable cues \(or statistical artifacts\) even when they successfully fool human evaluators\. Ghostbuster\[verma2024ghostbusterdetectingtextghostwritten\]utilizes a linear classifier trained on features extracted from weaker language models to perform black\-box detection\. Other studies have also utilized linguistic features \(e\.g\., Linguistic Inquiry and Word Count\) for authorship attribution\[uchendu\-etal\-2020\-authorship\]\.

Adversarial Evasion and Active Defense: As detectors evolve, so do methods to bypass them\. Raidar\[mao2024raidargenerativeaidetection\]explores the robustness of detectors against adversarial manipulations via rewriting\-based identification\. While paraphrasing remains a significant threat to detector accuracy, research suggests that semantic retrieval and caching API outputs can act as effective defenses\[krishna2023paraphrasingevadesdetectorsaigenerated\]\. To provide more definitive verification, watermarking frameworks, such as the "red list" and "green list" logit biasing proposed by Kirchenbauer et al\.\[kirchenbauer2024watermarklargelanguagemodels\], embed invisible signals into LLM outputs\. Conversely, frameworks like PIFE\[teja2025modelingattackdetectingaigenerated\]aim for perturbation\-invariant feature engineering to maintain detection accuracy despite character\-level or word\-level attacks\.

## 3Task Details

We use the dataset provided in\[roy2025comprehensivedatasethumanvs\]for AI\-generated text detection\. The dataset included 50,000 samples across various domains, ensuring diversity in style, topic, and complexity\.

### 3\.1Data

The dataset\[roy2025comprehensivedatasethumanvs\]consisted of 50,000 samples, structured to include human\-written stories paired with parallel generations from six modern LLMs\. Each entry was enriched with annotated metadata, detailing the source model, input prompts, and relevant linguistic features\. This structure allowed for comprehensive analysis and model\-specific evaluations\. The LLMs are Gemma\-2\-9, Mistral\-7B, Qwen\-2\-72B, LLaMA\-8B, Yi\-Large, GPT\-4o\. The dataset was divided into train, validation, and test sets with sizes 51,247 / 10,983 / 10,963, respectively\.

### 3\.2Tasks

- •Task A: Binary Classification Participants were tasked with determining whether a given text sample was generated by AI or written by a human\.
- •Task B: Model Attribution Building on Task A, this task required participants to identify the specific language model responsible for generating a given text sample\. Participants were provided with AI\-generated samples and tasked with predicting which LLM generated the text\.

### 3\.3Evaluation

Performance in the competition is assessed using theF1\-score\. ForTask A, we report theweighted F1\-score, which accounts for label imbalance by averaging the F1\-scores of each class weighted by their support\. ForTask B, we use themacro F1\-score, which treats all classes equally by computing the unweighted mean of the per\-class F1\-scores, thus emphasizing the ability to distinguish unique patterns across different model\-generated outputs\.

### 3\.4Baseline

We implement a baseline inspired by the Raidar method\[mao2024raidargenerativeaidetection\], which detects machine\-generated content via rewriting\. The key idea is that LLMs tend to make fewer edits when rewriting AI\-generated text compared to human\-written text\. As illustrated in Figure[3\.4](https://arxiv.org/html/2605.20761#S3.SS4), a fixed rewriting model \(GPT\-3\.5\-Turbo\) is prompted to rewrite the input text, and the Levenshtein edit distance between the original and rewritten text is computed\. The model whose rewrite yields the minimum edit distance is predicted as the generator\. If all edit distances exceed a predefined threshold—chosen as the median of the maximum edit distance across training samples—the input is classified as human\-written\.

We evaluate the baseline under three threshold strategies, as shown in Table[1](https://arxiv.org/html/2605.20761#S3.T1)\. The median threshold, which matches the methodology described above, is used as the official baseline for the competition leaderboard, achieving a weighted F1\-score of0\.5300on Task A and a macro F1\-score of0\.0504on Task B\.

Threshold StrategyTask A \(weighted F1\)Task B \(macro F1\)F1\-optimized0\.84000\.0863Max edit distance0\.84570\.0872Median edit distance0\.53000\.0504Table 1:Baseline F1\-scores under different threshold strategies for the Raidar rewriting\-based method\.![[Uncaptioned image]](https://arxiv.org/html/2605.20761v1/x1.png)\\captionof

figureIllustration of Raidar concept\.Given a News data text and an LLM\-generated text, the same LLM is asked to rewrite the inputs while preserving meaning\. The rewriting of a human\-written text undergoes more character\-level edits \(highlighted in red/yellow\), while the rewriting of an LLM\-generated text remains largely unchanged\.

## 4Participating Systems

With over 52 registrations on the competition web page, there were final leaderboard submissions from 11 teams, with 7 teams making paper submissions\.

The first participating team isSarang\[trivedi2025sarang\]\. They present a fine\-tuned DeBERTa\-based\[he2021debertadecodingenhancedbertdisentangled\]approach that secured first place in both Task A and Task B\. Their method involves an ensemble of DeBERTa models trained on a noisy dataset, incorporating data augmentation techniques to enhance model robustness and generalization\.

TheDakiet\[duong2025scalableframeworkclassifyingaigenerated\]team introduces a scalable framework that integrates perceptual hashing, similarity measurement, and pseudo\-labeling\. Their approach leverages BART\[lewis2019bartdenoisingsequencetosequencepretraining\]Large as the backbone model, achieving second place in Task A and third place in Task B\.

TheTeslateam employs a feature\-rich methodology, extracting style, language complexity, bias, subjectivity, and emotion\-based features, alongside TF\-IDF unigram and bigram representations\. They utilize XGBoost\[Chen\_2016\]models, leading to high F1 scores and securing third place in Task A and second place in Task B\.

TheSKDU\[malviya2025skdu\]team explores a pipelined approach leveraging RAIDAR\-inspired rewriting features and NELA toolkit content\-based features for feature extraction\. Their experiments highlight that NELA features outperform RAIDAR\[mao2024raidargenerativeaidetection\]features, with XGBoost proving to be the most effective classifier\.

TheDrocks\[abburi2025ai\]team develops two neural architectures per task: an Optimized Model and a Simpler Variant\. Their optimized model ranks 5th in Task A, while the simpler version secures 5th in Task B\. The approach enhances a generalizable neural model with RoBERTa, BiLSTM, and E5 embeddings\[wang2024textembeddingsweaklysupervisedcontrastive\]\(Full Architecture\)\. To reduce complexity, the Optimized Architecture replaces BiLSTM token\-level features with stylometry\. For multiclass classification, the Simple Architecture combines E5 embeddings and 11 stylometric features with a gradient boosting classifier\.

TheAI\_Blues\[guggilla2025ai\]team adopts a fine\-tuning–based strategy using both large language models and transformer encoders\. They fine\-tune GPT\-4o\-mini, LLaMA\-3 8B\[grattafiori2024llama3herdmodels\], and BERT\[devlin2019bertpretrainingdeepbidirectional\]for Task A and Task B, employing task\-specific prompting for LLMs and supervised training for BERT\. Their results demonstrate strong performance in human–AI discrimination, with GPT\-4o\-mini excelling in Task A, while BERT shows better performance in Task B\.

TheOsint\[agrahari2025tracing\]team proposesCOT\_Finetuned, a dual\-task framework that integrates Chain\-of\-Thought \(CoT\)\[wei2023chainofthoughtpromptingelicitsreasoning\]reasoning into supervised text classification\. The approach jointly addresses AI\-generated text detection and LLM identification, while producing interpretable reasoning traces\. By incorporating CoT into models such as BERT, their method improves performance over standard fine\-tuning and emphasizes explainability alongside accuracy\.

## 5Results

S\.NoTeam NameScores1Sarang1\.00002Dakiet0\.99993Tesla0\.99624SKDU0\.99455Drocks0\.99416Llama\_Mamba0\.98807AI\_Blues0\.95478NLP\_great0\.91579Osint0\.898210Xiaoyu0\.803011Rohan0\.7546\-BASELINE0\.5300Table 2:Leaderboard for Task A: Classify each text document as either AI\-generated or human\-written\.Table[2](https://arxiv.org/html/2605.20761#S5.T2)showcases the leaderboard for Task A, where participants are ranked based on their scores\. The highest score of 1\.0000 is achieved by Team Sarang, followed closely by Team dakiet with a score of 0\.9999\. The top five participants have scores above 0\.99, demonstrating strong performance in this task\.

S\.NoTeam NameScores1Sarang0\.95312Tesla0\.92183Dakiet0\.90824SKDU0\.76155Drocks0\.62706Xiaoyu0\.56967AI\_Blues0\.46988Llama\_Mamba0\.45519Rohan0\.405310Osint0\.307211NLP\_great0\.1874\-BASELINE0\.0504Table 3:Leaderboard for Task B: Given an AI\-generated text, determine which specific LLM produced it\.Table[3](https://arxiv.org/html/2605.20761#S5.T3)presents the leaderboard for Task B, which has a different ranking compared to Task A\. Team Sarang secures the top position with a score of 0\.9531\. The overall scores for Task B are lower than those in Task A, confirming that Task B is more challenging\. The top three participants get comparable scores whereas the lower\-ranked participants have considerably lower scores\.

## 6Conclusion

The Defactify 4\.0 workshop has highlighted the urgent need for advanced methodologies to detect AI\-generated text\. The shared tasks demonstrated that while binary classification of AI\-generated text has seen strong performance, model attribution remains a significant challenge, with lower accuracy across all participants\. The top\-performing systems leveraged fine\-tuned transformer models, ensemble learning, and hybrid techniques, underscoring the importance of combining linguistic and feature\-based detection methods\. By building on the shared tasks, datasets, and evaluation frameworks discussed in this paper, we aim to inspire further innovation in this critical area\. Future research should focus on improving detection robustness, enhancing adversarial resistance, and developing scalable solutions for real\-world deployment\. Robust detection systems are essential for safeguarding the integrity of digital ecosystems and mitigating the risks associated with generative AI technologies\.

## References

Similar Articles

Show, Don't TELL: Explainable AI-Generated Text Detection

Hugging Face Daily Papers

Introduces TELL, an AI-generated text detection system that provides explainable annotations alongside numerical scores, achieving competitive AUROC of 0.927 while enabling users to judge authorship based on highlighted textual indicators.

Base Models Look Human To AI Detectors

arXiv cs.CL

This paper reveals that commercial AI detectors like GPTZero and Pangram judge text from base language models as overwhelmingly human, while instruction-tuned model outputs are flagged as AI-generated. The authors propose HIP, a detector-agnostic iterative paraphrasing pipeline that improves human-likeness while preserving semantics.

Spotlights and Blindspots: Evaluation Machine-Generated Text Detection

arXiv cs.CL

This paper evaluates 15 machine-generated text detection models across six systems and multiple datasets, finding high variance in model rankings based on dataset and metric choices, with poor performance on novel human-written texts in high-risk domains. The authors highlight that methodological choices in evaluation are critical for accurately reflecting model performance.