MELD: Multi-Task Equilibrated Learning Detector for AI-Generated Text
Summary
This paper introduces MELD, a detector for AI-generated text that uses multi-task learning with auxiliary heads for generator family, attack type, and source domain to improve robustness. MELD achieves strong performance on the RAID benchmark and maintains low false-positive rates under adversarial attacks.
View Cached Full Text
Cached at: 05/11/26, 06:39 AM
# MELD: Multi-Task Equilibrated Learning Detector for AI-Generated Text
Source: [https://arxiv.org/html/2605.06903](https://arxiv.org/html/2605.06903)
Chenjun Li1,2Cheng Wan1,2Johannes C\. Paetzold1,2,3 1Cornell University, Ithaca, NY 14853, USA 2Weill Cornell Medicine, New York, NY 10021, USA 3Cornell Tech, New York, NY 10044, USA
###### Abstract
Large language models are deeply embedded in everyday writing workflows, making reliable AI\-generated text detection important for academic integrity, content moderation, and provenance tracking\. In practice, however, a detector must do more than achieve high aggregate AUROC on clean, in\-distribution human and AI text: it should remain robust to attacks and adversarial rewrites, transfer to new and unseen generators and writing domains, and operate at low false\-positive rates \(FPR\)\. Most existing detectors optimize a single AI/Human objective, which gives the representation little incentive to learn generator, attack, or domain structure once the binary task becomes saturated\. We introduceMELD\(Multi\-Task Equilibrated Learning Detector\), a deployable detector for AI\-generated text that enriches binary detection with auxiliary supervision\. MELD attaches generator\-family, attack\-type, and source\-domain heads to a shared encoder backbone, and balances the four losses with learned homoscedastic uncertainty weights\. To improve robustness, an exponential moving average \(EMA\) teacher predicts on clean inputs while an attack\-augmented student is distilled toward the teacher\. MELD further uses a hard\-negative pairwise ranking loss that enforces a larger score margin between AI\-generated texts and the human texts the detector finds most confusable\. At inference, all auxiliary heads are discarded, so that MELD has the same interface and cost as a standard detector\. On the public RAID benchmark leaderboard,MELD is the strongest open\-source detectorand is competitive with leading commercial models, especially when inputs are under attack and false\-positive rates must remain low\. Across standard held\-out benchmarks, MELD matches or outperforms supervised baselines\. We further introduce MELD\-eval, a held\-out evaluation pool built from recent chat models released by four major LLM providers\. Without additional finetuning, MELD achieves 99\.9% TPR at 1% FPR on MELD\-eval, while many baselines degrade sharply\.
[Code](https://anonymous.4open.science/r/MELD-4D74)[Model & Data](https://huggingface.co/anon-review-meld-2026/meld)
Figure 1:Overview of MELD\.A shared encoder \(Student\) is trained with a main classification head and three auxiliary heads for generator family, attack type, and source domain\. During training, clean inputs are passed through an EMA teacher, while the student is trained on clean or attack\-augmented inputs\. The objective combines \(i\) uncertainty\-weighted multi\-task classification, \(ii\) main\-head teacher–student distillation between clean and attacked views, and \(iii\) a hard\-negative pairwise ranking loss that improves separation near low\-FPR decision thresholds\. The auxiliary heads and teacher are discarded at inference, leaving only the student encoder and main AI/Human head\.## 1Introduction
Large language models are embedded in everyday writing, from student homework and legal filings to scientific manuscript writing and online communication\. Reliable detectors for AI\-generated text are therefore becoming important tools for academic\-integrity software, content\-moderation pipelines, and provenance workflows\. In deployment, low accuracy is not the only failure mode\. False positives can carry serious consequences for human authors, including accusations of academic misconduct and unfair penalties for non\-native English writers\[[22](https://arxiv.org/html/2605.06903#bib.bib18)\]\. Importantly, simple paraphrasing and rewriting strategies have also been shown to evade or destabilize existing detectors\[[20](https://arxiv.org/html/2605.06903#bib.bib17),[17](https://arxiv.org/html/2605.06903#bib.bib14),[41](https://arxiv.org/html/2605.06903#bib.bib37)\]\. Current literature falls into three categories: 1\) Training\-free detectors use token\-rank, likelihood\-curvature, or cross\-perplexity signals from reference language models\[[10](https://arxiv.org/html/2605.06903#bib.bib8),[24](https://arxiv.org/html/2605.06903#bib.bib21),[4](https://arxiv.org/html/2605.06903#bib.bib2),[14](https://arxiv.org/html/2605.06903#bib.bib12)\]; 2\) Supervised encoder detectors learn binary classifiers from labeled examples through a single binary objective\[[30](https://arxiv.org/html/2605.06903#bib.bib26),[12](https://arxiv.org/html/2605.06903#bib.bib10),[8](https://arxiv.org/html/2605.06903#bib.bib6)\]; and, 3\) a newer line incorporates fine\-grained authorship or generator structure through multi\-task contrastive learning, easy\-to\-hard supervision, and disentangled or perturbation\-invariant representations\[[13](https://arxiv.org/html/2605.06903#bib.bib11),[32](https://arxiv.org/html/2605.06903#bib.bib28),[7](https://arxiv.org/html/2605.06903#bib.bib5),[43](https://arxiv.org/html/2605.06903#bib.bib38)\]\. These advances improve benchmark performance, but they also leave three deployment axes unresolved: robustness under attacks, generalization across unseen generators and domains, and operation at the low false\-positive rates required in real deployments\[[9](https://arxiv.org/html/2605.06903#bib.bib7)\]\.
In response to this gap we proposeMELD\(Multi\-Task Equilibrated Learning Detector\), a detector that uses richer supervision during training while retaining the same inference interface as a standard binary classifier\. MELD augments the AI/Human head with three auxiliary heads for generator family, attack type, and source domain on a shared encoder\. These heads expose structure that is usually discarded in binary detector training, and they are removed at inference, so the deployed model has the same cost and interface as a standard single\-head classifier\. MELD combines this auxiliary supervision with learned homoscedastic uncertainty weighting\[[19](https://arxiv.org/html/2605.06903#bib.bib16)\], aligns attack\-augmented examples to a clean exponential moving average \(EMA\) teacher\[[33](https://arxiv.org/html/2605.06903#bib.bib27)\], and adds a lightweight pairwise ranking term\[[6](https://arxiv.org/html/2605.06903#bib.bib3)\]\(Figure[1](https://arxiv.org/html/2605.06903#S0.F1)\)\. Our main contributions are as follows:
- •Explicit auxiliary supervision for AI\-text detection\.MELD jointly trains the AI/Human classification head with generator\-family, attack\-type, and source\-domain heads on a shared backbone\. To our knowledge, MELD is the first AI\-text detector to combine this particular set of explicit auxiliary heads with learned uncertainty\-based loss balancing\.
- •A training objective for robust representations\.MELD combines uncertainty\-weighted multi\-task learning, EMA teacher–student distillation between clean and attacked views, and a pairwise ranking term\. The auxiliary heads are used only during training\.
- •MELD\-eval, a controlled evaluation pool built using current\-generation models\.We introduce MELD\-eval, a held\-out test pool built from four current\-generation chat models and paired with RAID\-style English domains and attacks\. MELD\-eval tests zero\-shot transfer with respect to these generators, while keeping the domain and attack protocol controlled\. Results show that MELD\-eval is one of the hardest evaluation settings we study\.
- •Strong system\-level results\.On RAID\[[9](https://arxiv.org/html/2605.06903#bib.bib7)\], the largest and most comprehensive public benchmark for AI\-generated text detection,MELD ranks firstamong the open\-source systems and is competitive with leading commercial models\. It also matches or outperforms training\-free and supervised baselines on other widely used benchmarks\.
Table 1:RAID public leaderboard\([https://raid\-bench\.xyz/leaderboard](https://raid-bench.xyz/leaderboard), accessed on 2026\-05\-03\)\. AUROC and TPR at5%5\\%/1%1\\%FPR \(×100\\times 100\) on the official RAID test set\. “All settings” includes RAID’s attack suite\. “No attack” is the clean subset\. Commercial rows are public product submissions\. Open\-source rows are leaderboard submissions with a paper and public model or code\.MELD is the strongest open\-source detector and matches or exceeds commercial systems\.Best/second\-bestentries per column\.
## 2Related work
#### Training\-free detectors\.
Training\-free methods usually score text under one or more reference language models \(LMs\) and use token statistics, likelihood geometry, or cross\-model discrepancies as evidence of generation\. GLTR\[[10](https://arxiv.org/html/2605.06903#bib.bib8)\]uses token\-rank statistics\. DetectGPT\[[24](https://arxiv.org/html/2605.06903#bib.bib21)\]and Fast\-DetectGPT\[[4](https://arxiv.org/html/2605.06903#bib.bib2)\]rely on likelihood curvature\. Binoculars\[[14](https://arxiv.org/html/2605.06903#bib.bib12)\]compares cross\-perplexities from two LMs\. These detectors are easy to deploy because they do not require detector\-specific training, but their behavior is tied to the coverage and calibration of the reference models, making them sensitive to paraphrase and surface perturbations\[[20](https://arxiv.org/html/2605.06903#bib.bib17),[9](https://arxiv.org/html/2605.06903#bib.bib7)\]\.
#### Supervised encoder detectors\.
Supervised methods train discriminative models from labeled human and AI text\. Early studies fine\-tuned RoBERTa\-style encoders\[[30](https://arxiv.org/html/2605.06903#bib.bib26)\]\. Subsequent work improved this recipe with structured features\[[36](https://arxiv.org/html/2605.06903#bib.bib31)\], adversarial paraphrasing\[[17](https://arxiv.org/html/2605.06903#bib.bib14)\], stronger encoder backbones\[[38](https://arxiv.org/html/2605.06903#bib.bib34),[8](https://arxiv.org/html/2605.06903#bib.bib6)\], representation\-based detection\[[7](https://arxiv.org/html/2605.06903#bib.bib5)\], and one\-class objectives\[[43](https://arxiv.org/html/2605.06903#bib.bib38)\]\. While these methods can perform well on in\-distribution benchmarks, they are typically trained with a single binary head\. This gives the encoder limited incentive to preserve generator, attack, or domain information beyond what is needed for the training split\. Such information is often useful when the detector is evaluated on unseen generators, domains, or attacks\.
#### Auxiliary supervision beyond the binary label\.
Recent work has moved beyond a pure AI\-versus\-human target\. DeTeCtive\[[13](https://arxiv.org/html/2605.06903#bib.bib11)\]and FAID\[[32](https://arxiv.org/html/2605.06903#bib.bib28)\]use generator\-aware contrastive supervision, while other approaches study easy\-to\-hard training\[[40](https://arxiv.org/html/2605.06903#bib.bib36)\], disentangled representations\[[27](https://arxiv.org/html/2605.06903#bib.bib23)\], surprisal\-variance features\[[5](https://arxiv.org/html/2605.06903#bib.bib4)\], and perturbation\-based features\[[34](https://arxiv.org/html/2605.06903#bib.bib25)\]\. These methods share the idea that detector failures are often driven by factors not exposed by a single binary label\. MELD follows this direction, but makes these factors explicit: generator family, attack type, and source domain are trained as prediction tasks on a shared encoder\. As for concurrent multi\-task detectors\[[13](https://arxiv.org/html/2605.06903#bib.bib11),[32](https://arxiv.org/html/2605.06903#bib.bib28)\], we differ specifically in pairing explicit auxiliary heads with learned homoscedastic uncertainty balancing rather than fixed contrastive weights, and in combining this with EMA clean/attacked distillation and a low\-FPR hard\-negative ranking term\.
#### Multi\-task weighting and robust training\.
MELD uses homoscedastic uncertainty weighting\[[19](https://arxiv.org/html/2605.06903#bib.bib16)\]to balance the main and auxiliary losses\. This approach is standard in multi\-task vision and has also been used in natural language processing\[[23](https://arxiv.org/html/2605.06903#bib.bib20)\], but has not been explored for AI\-text detection\. In our setting, it reduces manual loss tuning and adaptively balances auxiliary signals, helping the shared encoder retain generator, attack, and domain structure after the binary task begins to saturate \(Appendix[B](https://arxiv.org/html/2605.06903#A2)\)\.
## 3MELD
### 3\.1Architecture
LetΦ:𝒳→ℝL×H\\Phi:\\mathcal\{X\}\\to\\mathbb\{R\}^\{L\\times H\}be a bidirectional encoder that maps an input textxxto token\-level hidden states \(sequence lengthLL, hidden sizeHH\), with attention maskm\(x\)∈\{0,1\}Lm\(x\)\\in\\\{0,1\\\}^\{L\}indicating non\-pad positions\. We use masked mean pooling,h¯\(x\)=\(∑ℓmℓ\(x\)\)−1∑ℓ=1Lmℓ\(x\)Φ\(x\)ℓ∈ℝH\\bar\{h\}\(x\)=\\bigl\(\\sum\_\{\\ell\}m\_\{\\ell\}\(x\)\\bigr\)^\{\-1\}\\sum\_\{\\ell=1\}^\{L\}m\_\{\\ell\}\(x\)\\,\\Phi\(x\)\_\{\\ell\}\\in\\mathbb\{R\}^\{H\}, and attach four headsy^t\(x\)=softmax\(ft\(h¯\(x\)\)\)\\hat\{y\}^\{t\}\(x\)=\\mathrm\{softmax\}\(f\_\{t\}\(\\bar\{h\}\(x\)\)\), one for each taskttin
𝒯=\{main,gen,atk,dom\},\\mathcal\{T\}=\\\{\\text\{main\},\\text\{gen\},\\text\{atk\},\\text\{dom\}\\\},corresponding to the binary AI/Human label, generator family, attack type, and source domain\. The three auxiliary heads are linear; the main AI/Human head is a two\-layer MLP\. At inference, only the main AI/Human head is used\. Thus MELD has the same inference cost as a single\-head encoder detector with the same backbone\. We instantiateΦ\\Phiwith Ettin\-400M\[[39](https://arxiv.org/html/2605.06903#bib.bib35)\], a ModernBERT\-family encoder\[[38](https://arxiv.org/html/2605.06903#bib.bib34)\]\.
### 3\.2Heterogeneous\-label objective with per\-task masking
The training corpora do not share the same annotations\. RAID provides all four labels\. Generator\-tagged corpora such as MAGE\[[21](https://arxiv.org/html/2605.06903#bib.bib19)\]and M4GT\[[37](https://arxiv.org/html/2605.06903#bib.bib33)\]provide\{main,gen,dom\}\\\{\\text\{main\},\\text\{gen\},\\text\{dom\}\\\}\. FineWeb\[[26](https://arxiv.org/html/2605.06903#bib.bib22)\]provides only\{main,dom\}\\\{\\text\{main\},\\text\{dom\}\\\}\. The auxiliary label spaces are formed as the union of labels available across the training sources, yieldingG=104G\{=\}104generator classes,A=17A\{=\}17attack classes, andD=59D\{=\}59domain or sub\-corpus classes\. We therefore compute each auxiliary loss only on examples where that label is observed\. Letμt\(x\)\\mu^\{t\}\(x\)indicate whether examplexxhas a label for tasktt\. The loss for headttis
ℒt=1\|ℬt\|∑x∈ℬtCE\(y^t\(x\),yt\(x\)\),ℬt=\{x:μt\(x\)=1\},\\mathcal\{L\}\_\{t\}\\;=\\;\\tfrac\{1\}\{\|\\mathcal\{B\}\_\{t\}\|\}\\sum\_\{x\\in\\mathcal\{B\}\_\{t\}\}\\mathrm\{CE\}\\bigl\(\\hat\{y\}^\{t\}\(x\),\\,y^\{t\}\(x\)\\bigr\),\\qquad\\mathcal\{B\}\_\{t\}=\\\{x:\\mu^\{t\}\(x\)=1\\\},so missing labels simply do not contribute to that head\. Per\-source label coverage is in Table[2](https://arxiv.org/html/2605.06903#S4.T2)\.
### 3\.3Composite training objective
MELD combines three terms: an uncertainty\-weighted multi\-task classification loss, a teacher–student distillation loss between clean and attacked views, and a ranking loss on hard human/AI pairs\. A compact pseudocode view of the full training step is provided in Appendix[A](https://arxiv.org/html/2605.06903#A1)\.
#### Homoscedastic uncertainty weighting\.
FollowingKendallet al\.\[[19](https://arxiv.org/html/2605.06903#bib.bib16)\], each task has a learned scalarst=logσt2s\_\{t\}=\\log\\sigma\_\{t\}^\{2\}:
ℒcls=∑t∈𝒯\(e−stℒt\+12st\)\.\\mathcal\{L\}\_\{\\text\{cls\}\}\\;=\\;\\sum\_\{t\\in\\mathcal\{T\}\}\\Bigl\(e^\{\-s\_\{t\}\}\\,\\mathcal\{L\}\_\{t\}\+\\tfrac\{1\}\{2\}s\_\{t\}\\Bigr\)\.The precision terme−ste^\{\-s\_\{t\}\}controls the weight of tasktt, while the additivests\_\{t\}term prevents the optimizer from drivingst→∞s\_\{t\}\\to\\infty\. Thests\_\{t\}values are optimized jointly with the encoder and provide a useful diagnostic of how the relative weighting of the tasks evolves over training \(Appendix[B](https://arxiv.org/html/2605.06903#A2)\)\.
#### Teacher–student distillation with clean and attacked views\.
For each examplexxin the minibatch, we form two views: a clean viewxcx^\{c\}\(the original text\) and a possibly attacked viewxax^\{a\}\. With probabilityp=0\.5p\{=\}0\.5the attacked view is one of the randomly chosen attacks \(e\.g\., homoglyph substitution, whitespace perturbation, character typo, or synonym swap\), sampled uniformly\. Otherwise we setxa=xcx^\{a\}\{=\}x^\{c\}\. The augmentation is label\-blind: human and AI rows are sampled the same way\. The EMA teacherTθ¯T\_\{\\bar\{\\theta\}\}always takes the clean viewxcx^\{c\}\. The studentSθS\_\{\\theta\}always takesxax^\{a\}\. We match the student’s main\-head distribution to the teacher’s by KL on the binary main head\[[16](https://arxiv.org/html/2605.06903#bib.bib49)\]\. LetzmainT\(xc\),zmainS\(xa\)∈ℝ2z\_\{\\text\{main\}\}^\{T\}\(x^\{c\}\),z\_\{\\text\{main\}\}^\{S\}\(x^\{a\}\)\\in\\mathbb\{R\}^\{2\}denote the two main\-head logits for the teacher and student views, respectively, and letpT=softmax\(zmainT\(xc\)/τtea\)p^\{T\}=\\mathrm\{softmax\}\(z\_\{\\text\{main\}\}^\{T\}\(x^\{c\}\)/\\tau\_\{\\mathrm\{tea\}\}\)andpS=softmax\(zmainS\(xa\)/τstu\)p^\{S\}=\\mathrm\{softmax\}\(z\_\{\\text\{main\}\}^\{S\}\(x^\{a\}\)/\\tau\_\{\\mathrm\{stu\}\}\):
ℒema=KL\(pT∥pS\)\.\\mathcal\{L\}\_\{\\text\{ema\}\}\\;=\\;\\mathrm\{KL\}\\\!\\bigl\(p^\{T\}\\,\\big\\\|\\,p^\{S\}\\bigr\)\.The asymmetric temperaturesτtea=0\.04<τstu=0\.10\\tau\_\{\\mathrm\{tea\}\}\{=\}0\.04<\\tau\_\{\\mathrm\{stu\}\}\{=\}0\.10make the teacher distribution sharper than the student distribution\. The teacher parameters follow the student by EMA,θ¯←βθ¯\+\(1−β\)θ\\bar\{\\theta\}\\leftarrow\\beta\\bar\{\\theta\}\+\(1\-\\beta\)\\thetawithβ=0\.999\\beta\{=\}0\.999, and gradients are stopped through the teacher\. On the augmented half of the batch the loss pulls the student’s prediction on the perturbed text toward the teacher’s prediction on the clean text, encouraging attack\-invariance\. On the unaugmented half \(xa=xcx^\{a\}\{=\}x^\{c\}\) it reduces to a temporal self\-distillation between the EMA teacher and student\.
All supervised classification losses are applied to the student viewxax^\{a\}\. The attack head is supervised by each row’s original attack label, since the synthetic augmentations \(homoglyph, whitespace, typo, synonym\) are light surface\-level edits that do not change the underlying attack family\. Rows without an attack label are skipped\.
#### Hard\-negative pairwise ranking loss\.
Binary cross\-entropy does not explicitly shape the part of the score distribution near the decision boundary, where the hardest human samples sit\. Letyi∈\{0,1\}y\_\{i\}\\in\\\{0,1\\\}denote the binary main label, withyi=1y\_\{i\}\{=\}1for AI andyi=0y\_\{i\}\{=\}0for human, and letmi=ziAI−ziHumanm\_\{i\}=z^\{\\text\{AI\}\}\_\{i\}\-z^\{\\text\{Human\}\}\_\{i\}be the main\-head margin\. For each minibatch, we mine hard human negatives by taking the top\-KKhighest\-margin humans, whereK=⌈αNHuman⌉K=\\lceil\\alpha N\_\{\\text\{Human\}\}\\rceil,α\\alphacontrols how narrowly the loss focuses on the hardest human tail, andTopKK\(m,Human\)\\mathrm\{TopK\}\_\{K\}\(m,\\text\{Human\}\)denotes the index set of thoseKKhighest\-margin human examples\. We setα=0\.05\\alpha=0\.05as a stable default:
ℒrank=1NAI∑i:yi=11K∑j∈TopKK\(m,Human\)log\(1\+e\(mj−mi\)/τr\),\\mathcal\{L\}\_\{\\text\{rank\}\}\\;=\\;\\frac\{1\}\{N\_\{\\text\{AI\}\}\}\\\!\\sum\_\{i:\\,y\_\{i\}=1\}\\,\\frac\{1\}\{K\}\\\!\\sum\_\{j\\in\\mathrm\{TopK\}\_\{K\}\(m,\\text\{Human\}\)\}\\log\\\!\\bigl\(1\+e^\{\(m\_\{j\}\-m\_\{i\}\)/\\tau\_\{r\}\}\\bigr\),with temperatureτr=0\.5\\tau\_\{r\}\{=\}0\.5\. The top\-KKselection is a within\-batch approximation of the upper\-α\\alphaquantile of the human score distribution, so each AI sample is pushed above the hardest negatives in its own batch rather than over an arbitrary mean\. This formulation follows hard\-negative mining in metric learning and retrieval\[[29](https://arxiv.org/html/2605.06903#bib.bib45),[15](https://arxiv.org/html/2605.06903#bib.bib46)\]\. The total loss is:
ℒ=ℒcls\+λemaℒema\+λrankℒrank,λema=1\.0,λrank=0\.5\.\\mathcal\{L\}\\;=\\;\\mathcal\{L\}\_\{\\text\{cls\}\}\\;\+\\;\\lambda\_\{\\text\{ema\}\}\\,\\mathcal\{L\}\_\{\\text\{ema\}\}\\;\+\\;\\lambda\_\{\\text\{rank\}\}\\,\\mathcal\{L\}\_\{\\text\{rank\}\},\\qquad\\lambda\_\{\\text\{ema\}\}=1\.0,\\ \\lambda\_\{\\text\{rank\}\}=0\.5\.Section[4\.4](https://arxiv.org/html/2605.06903#S4.SS4)describes the ablation protocol used to isolate these terms\.
\(a\)Generator separability\.UMAP embeddings colored by generator, with human texts shown in black\.B/WB/Wdenotes the ratio of between\-generator to within\-generator distance, so higher values indicate more compact generator\-specific clusters and better separation across generators\.MELD produces the clearest generator structure and substantially less human/AI overlap than the baselines\.
\(b\)Attack invariance\.Spokes connect clean\-source centroids to their attacked variants\. LowerW/BW/Bmeans that attacked variants stay closer to their clean source while different sources remain separated\. MELD shows the shortest spokes, suggesting that attacks leave their embeddings closest to the corresponding clean source rather than pushing them toward unrelated sources or human text\.
Figure 2:Backbone geometry\.UMAP of∼112,000\{\\sim\}112\{,\}000embeddings per panel from the evaluated detectors\. A robust detector should separate human and AI text, preserve generator\-level structure, and keep attacked variants near their clean sources\. MELD best matches this geometry, with the highest generator separability, the lowest attack displacement, and visibly less human/AI overlap than the baselines\.
## 4Experiments
### 4\.1Datasets
#### Training mixture\.
Table 2:Training dataset/mix\.Per\-source rows, sampling ratio, AI\-to\-human share \(%\), and label coverage \(✓ = present, – = absent\)\. Rows reflect the listed sampling ratio over one training epoch\. The held\-out MELD\-eval pool is described in Appendix[C](https://arxiv.org/html/2605.06903#A3)\.We train MELD on a 6\.60M\-row mixture of seven public sources \(Table[2](https://arxiv.org/html/2605.06903#S4.T2)\)\. Only RAID\[[9](https://arxiv.org/html/2605.06903#bib.bib7)\]carries all four labels\. MAGE\-train\[[21](https://arxiv.org/html/2605.06903#bib.bib19)\], M4GT\-train\[[37](https://arxiv.org/html/2605.06903#bib.bib33)\], DetectRL\-train\[[41](https://arxiv.org/html/2605.06903#bib.bib37)\], Ghostbuster\-train\[[36](https://arxiv.org/html/2605.06903#bib.bib31)\], and WildChat\[[44](https://arxiv.org/html/2605.06903#bib.bib39)\]provide main, generator, and domain labels but no attack labels\. FineWeb\[[26](https://arxiv.org/html/2605.06903#bib.bib22)\]is human\-only and provides main and domain\. We include FineWeb to balance the human/AI ratio and to expose the detector to a broader distribution of human web text\. Sources are mixed at fixed per\-batch ratios\. Small sources are oversampled and RAID is downsampled so that every source contributes meaningfully to each batch\. Missing auxiliary labels are masked, so each source feeds only the heads it can supervise\. We restrict FineWeb to pre\-CC\-MAIN\-2020dumps to limit post\-LLM contamination on the human side, and we deduplicate every training row by text hash against all evaluation pools\.
#### MELD\-eval\.
To test transfer to selected current\-generation chat models, we build a held\-out pool from four generators: GPT 5\.4 Mini \(OpenAI\), Gemini 3 Flash \(Google\), Claude Haiku 4\.5 \(Anthropic\), and Qwen 3\.6 Plus \(Alibaba\)\[[25](https://arxiv.org/html/2605.06903#bib.bib40),[11](https://arxiv.org/html/2605.06903#bib.bib41),[3](https://arxiv.org/html/2605.06903#bib.bib42),[2](https://arxiv.org/html/2605.06903#bib.bib43)\]\. We sample up to1,0001\{,\}000paired human prompts from each of eight RAID English domains \(books, news, abstracts, recipes, reddit, reviews, wiki, poetry\)\. We query each generator under a common no\-preamble, no\-markdown template, strip residual markdown uniformly from both AI and human text to remove formatting fingerprints, and apply RAID\-style attacks\. The pool contains7,8627\{,\}862paired human texts,31,44831\{,\}448clean AI rows, and188,688188\{,\}688attacked AI rows\. Full construction details are in Appendix[C](https://arxiv.org/html/2605.06903#A3)\.
MELD\-eval is a controlled generator\-shift test\. It is held out with respect to the four generators, while reusing RAID\-style English domains, human seeds, and attacks so that changes in detector behavior can be attributed primarily to generator shift\. We therefore interpret it as evidence of transfer to these selected current\-generation chat models under a controlled protocol, not as universal robustness to arbitrary domains\.
### 4\.2Training setup
The encoder backbone is Ettin\-400M\[[39](https://arxiv.org/html/2605.06903#bib.bib35)\], with396396M trainable parameters including three linear auxiliary heads and one MLP main head\. We train for one epoch at sequence length20482048on three NVIDIA H200 GPUs under DDP \(effective batch size384384,∼6\.7\{\\sim\}6\.7h\)\. Optimization uses AdamW \(learning rate4×10−54\{\\times\}10^\{\-5\},1,5001\{,\}500warmup steps then cosine decay, weight decay0\.010\.01\), bfloat16 mixed precision, dropout0\.10\.1, and label smoothing0\.050\.05on the binary main head\. Documents are truncated at training time and split into overlapping20482048\-token chunks at evaluation time, with per\-chunk scores mean\-aggregated\. The final checkpoint is a Stochastic Weight Averaging \(SWA\)\[[18](https://arxiv.org/html/2605.06903#bib.bib15)\]over the top ten checkpoints by AUROC on a held\-out55K validation split \(SWA window from step2,0002\{,\}000\)\. We report paired\-significance tests against the strongest baselines in Appendix[E](https://arxiv.org/html/2605.06903#A5)\.
### 4\.3Evaluation protocol
We evaluate four settings\. First, we report the public RAID leaderboard metrics\[[9](https://arxiv.org/html/2605.06903#bib.bib7)\]: AUROC, TPR@5%5\\%FPR, and TPR@1%1\\%FPR\. Second, we re\-evaluate published detectors on five held\-out benchmarks: HC3\[[12](https://arxiv.org/html/2605.06903#bib.bib10)\], MAGE\[[21](https://arxiv.org/html/2605.06903#bib.bib19)\], M4GT\[[37](https://arxiv.org/html/2605.06903#bib.bib33)\], Ghostbuster\[[36](https://arxiv.org/html/2605.06903#bib.bib31)\], and DetectRL\[[41](https://arxiv.org/html/2605.06903#bib.bib37)\]\. Third, we evaluate current\-generation transfer on MELD\-eval \(Section[4\.1](https://arxiv.org/html/2605.06903#S4.SS1)\)\. Fourth, we run loss\-component ablations and representation analyses to isolate which parts of the training objective matter\.
For the baselines, we use each method’s official inference code or public checkpoint when available\. Unless a table states otherwise, scores are computed on the full held\-out pool, with no subsampling for MELD\-eval\. Each detector’s TPR is reported at its own pool\-specific FPR threshold, computed from the human score distribution of that pool \(per\-pool thresholds, not a single global threshold\)\. This protocol measures score separability at a target FPR under pool\-specific calibration\. It should not be read as evidence that a single fixed threshold transfers unchanged across domains, institutions, or deployment populations\. Fixed\-threshold deployment requires a held\-out calibration population matched to the intended use case\. We treat that as a deployment\-layer requirement rather than as part of the evaluation\-pool comparison\. We emphasize low\-FPR operating points because high AUROC is already saturated for many supervised detectors, and because low false\-positive rates are critical for deployments at the volumes seen in academic integrity and content moderation\. Every cell of Tables[3](https://arxiv.org/html/2605.06903#S5.T3)and[4](https://arxiv.org/html/2605.06903#S5.T4)is annotated with the half\-width \(±\\pm\) of a95%95\\%percentile bootstrap confidence interval \(CI\) on the cell’s metric \(B=5,000B\{=\}5\{,\}000resamples of the per\-row scores\)\.
### 4\.4Ablation protocol
To isolate the training objective in Section[3\.3](https://arxiv.org/html/2605.06903#S3.SS3), we retrain MELD while removing or replacing one component at a time: the auxiliary heads, the hard\-negative ranking term, EMA distillation, or Kendall uncertainty weighting\. All ablations use the same backbone, data mixture, training budget, and SWA selection rule as the full model\. We report the ablation results on HC3 and TuringBench\[[35](https://arxiv.org/html/2605.06903#bib.bib29)\], two out\-of\-distribution pools where the low\-FPR tail is not fully saturated\. We also inspect the learned uncertainty schedule, per\-attack robustness on RAID, and the geometry of the backbone representation\. Appendix[E](https://arxiv.org/html/2605.06903#A5)also reports a same\-data retraining control in which the strongest supervised baselines are retrained on MELD’s data mixture, to separate the effect of the training corpus from the effect of the multi\-task objective\.
## 5Results and discussion
### 5\.1RAID public leaderboard
Table[1](https://arxiv.org/html/2605.06903#S1.T1)reports MELD’s performance on the public RAID leaderboard, which is the largest and most comprehensive benchmark for AI\-text detectors\. MELD is the strongest open\-source system in the table and is also competitive with leading commercial systems\. Averaged over all three metrics in the attacked setting, the gap between MELD and the next\-best open\-source model is more than10 timeslarger than the gap between MELD and the best commercial model\. This is notable because commercial detectors can be trained with a much larger dataset \(for example, GPTZero reports using over4×4\\timesmore training data than our mixture\[[1](https://arxiv.org/html/2605.06903#bib.bib1)\]\)\.
### 5\.2Additional benchmarks and transfer on MELD\-eval
Table 3:Held\-out pool comparison\.AUROC \(×100\\times 100\) on five standard benchmarks and MELD\-eval\. All baseline detectors are re\-evaluated on the same held\-out pools using public checkpoints or official inference code\.Best/second\-bestentries per column\.In Table[3](https://arxiv.org/html/2605.06903#S5.T3)we compare MELD with training\-free reference\-LM detectors, supervised encoders, and recent representation\-based systems\. MELD is the strongest supervised detector on four of the five standard held\-out benchmarks\. The main exception is HC3, where RoBERTa\-ChatGPT is slightly stronger, but importantly this benchmark is much closer to RoBERTa\-ChatGPT’s original training distribution\. The broader pattern is that our multi\-source, multi\-task objective transfers well across datasets whose generator families, domains, and attack coverage differ from one another\. The same\-data retraining control in Appendix[E](https://arxiv.org/html/2605.06903#A5)suggests that these gains are not explained by the training mixture alone\.
Table 4:MELD\-eval results by generator\.TPR@1%1\\%FPR \(×100\\times 100\) on MELD\-eval for each current\-generation generator and overall, evaluated against the paired human texts\. All detectors are evaluated zero\-shot with respect to these four generators\.Best/second\-bestentries per column\.In Table[4](https://arxiv.org/html/2605.06903#S5.T4)we evaluate whether detector behavior transfers beyond the generator families in public benchmarks\. MELD remains strong across all four MELD\-eval generators\. Most previous supervised or zero\-shot detectors have very low TPR@1%1\\%FPR on MELD\-eval under the same per\-pool calibration protocol\. ModernBERT\-Detect is the only non\-MELD baseline that transfers reasonably, but MELD is more stable across generator families\. Appendix[D](https://arxiv.org/html/2605.06903#A4)provides per\-text examples from HC3, DetectRL, and MELD\-eval\.
### 5\.3Ablations and representation analysis
Table 5:Loss\-component ablation\.TPR@1%1\\%FPR and TPR@5%5\\%FPR \(×100\\times 100\) on HC3 and TuringBench\. Each ablation is trained from scratch with the same data mixture, backbone, and training budget after removing one component from MELD\. The Dense row removes the auxiliary heads\.Boldmarks the best entry in each metric column\.Table[5](https://arxiv.org/html/2605.06903#S5.T5)shows that each component of the objective contributes to the performance\. The hard\-negative ranking term is especially important on HC3, where AUROC can remain high even when the deployment threshold is poorly shaped\. On the harder TuringBench pool, the auxiliary heads, ranking loss, and learned uncertainty weighting all carry substantial weight\. The effect of EMA distillation is smaller but positive, consistent with its role as an attack\-invariance regularizer rather than the only source of separation\.
Figure 3:Distance\-space geometry\.Per\-detectorwithin\-sourcevs\.between\-sourcecosine\-distance distributions\. A better representation keeps same\-source variants close while separating different sources, leading to less overlap between the two distributions\.MELD shows the clearest separationand reaches Cohen’sd′=3\.28d^\{\\prime\}=\\mathbf\{3\.28\},∼7×\\sim\\\!7\\timesthe strongest baseline \(ModernBERT\-Detect,d′=0\.47d^\{\\prime\}=0\.47\)\.Figure[2](https://arxiv.org/html/2605.06903#S3.F2)provides a representation\-level view of the same effect\. MELD separates generator structure more clearly than other supervised detectors while keeping attacked variants close to their clean sources\. This is the intended geometry of the auxiliary heads and clean/attacked distillation: the representation should preserve source information without treating attacked texts as new classes\. Figure[3](https://arxiv.org/html/2605.06903#S5.F3)further shows that on600600RAID prompts×12\\times 12generators×8\\times 8attacks, the within\-source \(same prompt, different attack\) and between\-source \(different prompt\) cosine\-distance distributions computed from each detector’s frozenℓ2\\ell\_\{2\}\-normalized backbone are visibly disjoint for MELD, while the corresponding distributions overlap heavily for the other detectors\. Figure[4](https://arxiv.org/html/2605.06903#S5.F4)breaks the RAID attacked setting down by attack type\. MELD is stable across the attacks scored by RAID, including character\-level and paraphrase\-style perturbations\. The per\-attack pattern matches the embedding analysis: the model learns to keep attacks close to the underlying clean source rather than overfitting to a narrow attack signature\.
Figure 4:Per\-attack robustness on RAID\.TPR@5%5\\%FPR on the official RAID test set, aggregated over domain, generator, decoding, and repetition\. We compare open\-source detectors with public papers or models\.Boldmarks the best cell per attack\. “–” denotes an attack not scored by that submission\.
### 5\.4Limitations
The current evaluation focuses on English text, instruction\-tuned chat\-model outputs, and RAID\-style domains and attacks\. MELD is not evaluated on multilingual writing, heavily edited human and AI mixed\-authorship text, or demographic variation among writers\. The reported TPR\-at\-FPR numbers use per\-pool calibration thresholds computed from each pool’s human\-score distribution\. Under a single fixed threshold transferred across domains, low\-FPR performance is expected to degrade, so deployment requires calibration on a representative target population\. We also do not report length\-stratified results for short\-text settings such as social media posts, exam short answers, or brief comments\. The auxiliary heads are tied to the generator, attack, and domain distribution of the training mixture, so refreshing this label space as new generators and attacks appear is a natural extension\. Prior work shows that AI\-text detectors can exhibit systematic false\-positive bias against non\-native English writers\[[22](https://arxiv.org/html/2605.06903#bib.bib18)\], so deployment in critical scenarios \(e\.g\., academic integrity\) should require population\-specific calibration\.
## 6Conclusion
MELD is an AI\-text detector that achieves the strongest overall open\-source performance across our evaluations\. It is trained using richer supervision than the binary label alone: during training, explicit generator, attack, and domain heads shape the shared encoder\. Learned uncertainty weighting balances these losses against the binary objective, while clean/attacked distillation and a hard\-negative ranking term target robustness at low FPR\. The results on RAID, the standard held\-out benchmarks, MELD\-eval, and the ablations show that this training\-time structure improves the regimes that matter most in deployment: attacks, generator shift, and low false\-positive thresholds\. Future work should extend this idea to hybrid human–AI editing, multilingual detection, domain\-specific calibration, and broader generator families beyond instruction\-tuned chat models\.
## References
- \[1\]G\. A\. Adam, A\. Cui, E\. Thomas, E\. Napier, N\. Shmatko, J\. Schnell, J\. J\. Tian, A\. Dronavalli, E\. Tian, and D\. Lee\(2026\)GPTZero: robust detection of LLM\-generated texts\.arXiv preprint arXiv:2602\.13042\.Cited by:[Table 1](https://arxiv.org/html/2605.06903#S1.T1.22.6.6.1),[§5\.1](https://arxiv.org/html/2605.06903#S5.SS1.p1.1)\.
- \[2\]\(2026\)Qwen 3\.6 Plus\.Note:API model snapshotqwen3\.6\-plus\-04\-02Cited by:[§4\.1](https://arxiv.org/html/2605.06903#S4.SS1.SSS0.Px2.p1.4)\.
- \[3\]Anthropic\(2025\)Claude Haiku 4\.5\.Note:API model snapshotclaude\-haiku\-4\.5\-20251001External Links:[Link](https://www.anthropic.com/news/claude-haiku-4-5)Cited by:[§4\.1](https://arxiv.org/html/2605.06903#S4.SS1.SSS0.Px2.p1.4)\.
- \[4\]G\. Bao, Y\. Zhao, Z\. Teng, L\. Yang, and Y\. Zhang\(2024\)Fast\-DetectGPT: efficient zero\-shot detection of machine\-generated text via conditional probability curvature\.InThe Twelfth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.06903#S1.p1.1),[§2](https://arxiv.org/html/2605.06903#S2.SS0.SSS0.Px1.p1.1),[Table 3](https://arxiv.org/html/2605.06903#S5.T3.14.12.7),[Table 4](https://arxiv.org/html/2605.06903#S5.T4.14.10.6)\.
- \[5\]A\. R\. Basani and P\. Chen\(2025\)Diversity boosts AI\-generated text detection\.arXiv preprint arXiv:2509\.18880\.Cited by:[§2](https://arxiv.org/html/2605.06903#S2.SS0.SSS0.Px3.p1.1)\.
- \[6\]C\. Burges, T\. Shaked, E\. Renshaw, A\. Lazier, M\. Deeds, N\. Hamilton, and G\. Hullender\(2005\)Learning to rank using gradient descent\.InProceedings of the 22nd International Conference on Machine Learning,pp\. 89–96\.Cited by:[§1](https://arxiv.org/html/2605.06903#S1.p2.1)\.
- \[7\]X\. Chen, J\. Wu, S\. Yang, R\. Zhan, Z\. Wu, Z\. Luo, D\. Wang, M\. Yang, L\. S\. Chao, and D\. F\. Wong\(2025\)RepreGuard: detecting LLM\-generated text by revealing hidden representation patterns\.Transactions of the Association for Computational Linguistics13,pp\. 1812–1831\.Cited by:[§1](https://arxiv.org/html/2605.06903#S1.p1.1),[§2](https://arxiv.org/html/2605.06903#S2.SS0.SSS0.Px2.p1.1),[Table 3](https://arxiv.org/html/2605.06903#S5.T3.56.54.7),[Table 4](https://arxiv.org/html/2605.06903#S5.T4.49.45.6)\.
- \[8\]G\. Drayson, E\. Yilmaz, and V\. Lampos\(2025\)Machine\-generated text detection prevents language model collapse\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 29645–29661\.Cited by:[Table 1](https://arxiv.org/html/2605.06903#S1.T1.22.8.8.1),[§1](https://arxiv.org/html/2605.06903#S1.p1.1),[§2](https://arxiv.org/html/2605.06903#S2.SS0.SSS0.Px2.p1.1),[Table 3](https://arxiv.org/html/2605.06903#S5.T3.44.42.7),[Table 4](https://arxiv.org/html/2605.06903#S5.T4.44.40.6)\.
- \[9\]L\. Dugan, A\. Hwang, F\. Trhlík, A\. Zhu, J\. M\. Ludan, H\. Xu, D\. Ippolito, and C\. Callison\-Burch\(2024\)RAID: a shared benchmark for robust evaluation of machine\-generated text detectors\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 12463–12492\.Cited by:[4th item](https://arxiv.org/html/2605.06903#S1.I1.i4.p1.1),[§1](https://arxiv.org/html/2605.06903#S1.p1.1),[§2](https://arxiv.org/html/2605.06903#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.06903#S4.SS1.SSS0.Px1.p1.1),[§4\.3](https://arxiv.org/html/2605.06903#S4.SS3.p1.2)\.
- \[10\]S\. Gehrmann, H\. Strobelt, and A\. M\. Rush\(2019\)GLTR: statistical detection and visualization of generated text\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations,pp\. 111–116\.Cited by:[Table 1](https://arxiv.org/html/2605.06903#S1.T1.22.11.11.1),[§1](https://arxiv.org/html/2605.06903#S1.p1.1),[§2](https://arxiv.org/html/2605.06903#S2.SS0.SSS0.Px1.p1.1),[Table 3](https://arxiv.org/html/2605.06903#S5.T3.8.6.7),[Table 4](https://arxiv.org/html/2605.06903#S5.T4.9.5.6)\.
- \[11\]Google DeepMind\(2025\)Gemini 3 Flash\.Note:API model snapshotgemini\-3\-flash\-preview\-20251217Cited by:[§4\.1](https://arxiv.org/html/2605.06903#S4.SS1.SSS0.Px2.p1.4)\.
- \[12\]B\. Guo, X\. Zhang, Z\. Wang, M\. Jiang, J\. Nie, Y\. Ding, J\. Yue, and Y\. Wu\(2023\)How close is ChatGPT to human experts? comparison corpus, evaluation, and detection\.arXiv preprint arXiv:2301\.07597\.Cited by:[§1](https://arxiv.org/html/2605.06903#S1.p1.1),[§4\.3](https://arxiv.org/html/2605.06903#S4.SS3.p1.2),[Table 3](https://arxiv.org/html/2605.06903#S5.T3.32.30.7),[Table 4](https://arxiv.org/html/2605.06903#S5.T4.29.25.6)\.
- \[13\]X\. Guo, S\. Zhang, Y\. He, T\. Zhang, W\. Feng, H\. Huang, and C\. Ma\(2024\)DeTeCtive: detecting AI\-generated text via multi\-level contrastive learning\.InProceedings of the 38th International Conference on Neural Information Processing Systems,pp\. 88320–88347\.Cited by:[§1](https://arxiv.org/html/2605.06903#S1.p1.1),[§2](https://arxiv.org/html/2605.06903#S2.SS0.SSS0.Px3.p1.1)\.
- \[14\]A\. Hans, A\. Schwarzschild, V\. Cherepanova, H\. Kazemi, A\. Saha, M\. Goldblum, J\. Geiping, and T\. Goldstein\(2024\)Spotting LLMs with binoculars: zero\-shot detection of machine\-generated text\.InProceedings of the 41st International Conference on Machine Learning,pp\. 17519–17537\.Cited by:[Table 1](https://arxiv.org/html/2605.06903#S1.T1.22.9.9.1),[§1](https://arxiv.org/html/2605.06903#S1.p1.1),[§2](https://arxiv.org/html/2605.06903#S2.SS0.SSS0.Px1.p1.1),[Table 3](https://arxiv.org/html/2605.06903#S5.T3.20.18.7),[Table 4](https://arxiv.org/html/2605.06903#S5.T4.19.15.6)\.
- \[15\]A\. Hermans, L\. Beyer, and B\. Leibe\(2017\)In defense of the triplet loss for person re\-identification\.arXiv preprint arXiv:1703\.07737\.Cited by:[§3\.3](https://arxiv.org/html/2605.06903#S3.SS3.SSS0.Px3.p2.3)\.
- \[16\]G\. Hinton, O\. Vinyals, and J\. Dean\(2015\)Distilling the knowledge in a neural network\.arXiv preprint arXiv:1503\.02531\.Cited by:[§3\.3](https://arxiv.org/html/2605.06903#S3.SS3.SSS0.Px2.p1.12)\.
- \[17\]X\. Hu, P\. Chen, and T\. Ho\(2023\)RADAR: robust AI\-text detection via adversarial learning\.InProceedings of the 37th International Conference on Neural Information Processing Systems,pp\. 15077–15095\.Cited by:[Table 1](https://arxiv.org/html/2605.06903#S1.T1.22.10.10.1),[§1](https://arxiv.org/html/2605.06903#S1.p1.1),[§2](https://arxiv.org/html/2605.06903#S2.SS0.SSS0.Px2.p1.1),[Table 3](https://arxiv.org/html/2605.06903#S5.T3.38.36.7),[Table 4](https://arxiv.org/html/2605.06903#S5.T4.34.30.6)\.
- \[18\]P\. Izmailov, D\. Podoprikhin, T\. Garipov, D\. Vetrov, and A\. G\. Wilson\(2018\)Averaging weights leads to wider optima and better generalization\.arXiv preprint arXiv:1803\.05407\.Cited by:[§4\.2](https://arxiv.org/html/2605.06903#S4.SS2.p1.12)\.
- \[19\]A\. Kendall, Y\. Gal, and R\. Cipolla\(2018\)Multi\-task learning using uncertainty to weigh losses for scene geometry and semantics\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pp\. 7482–7491\.Cited by:[§1](https://arxiv.org/html/2605.06903#S1.p2.1),[§2](https://arxiv.org/html/2605.06903#S2.SS0.SSS0.Px4.p1.1),[§3\.3](https://arxiv.org/html/2605.06903#S3.SS3.SSS0.Px1.p1.1)\.
- \[20\]K\. Krishna, Y\. Song, M\. Karpinska, J\. Wieting, and M\. Iyyer\(2023\)Paraphrasing evades detectors of AI\-generated text, but retrieval is an effective defense\.Advances in neural information processing systems36,pp\. 27469–27500\.Cited by:[§1](https://arxiv.org/html/2605.06903#S1.p1.1),[§2](https://arxiv.org/html/2605.06903#S2.SS0.SSS0.Px1.p1.1)\.
- \[21\]Y\. Li, Q\. Li, L\. Cui, W\. Bi, Z\. Wang, L\. Wang, L\. Yang, S\. Shi, and Y\. Zhang\(2024\)MAGE: machine\-generated text detection in the wild\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 36–53\.Cited by:[§3\.2](https://arxiv.org/html/2605.06903#S3.SS2.p1.9),[§4\.1](https://arxiv.org/html/2605.06903#S4.SS1.SSS0.Px1.p1.1),[§4\.3](https://arxiv.org/html/2605.06903#S4.SS3.p1.2)\.
- \[22\]W\. Liang, M\. Yuksekgonul, Y\. Mao, E\. Wu, and J\. Zou\(2023\)GPT detectors are biased against non\-native english writers\.Patterns4\(7\)\.Cited by:[§1](https://arxiv.org/html/2605.06903#S1.p1.1),[§5\.4](https://arxiv.org/html/2605.06903#S5.SS4.p1.1)\.
- \[23\]K\. Meshgi, M\. S\. Mirzaei, and S\. Sekine\(2022\)Uncertainty regularized multi\-task learning\.InProceedings of the 12th Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis,pp\. 78–88\.Cited by:[§2](https://arxiv.org/html/2605.06903#S2.SS0.SSS0.Px4.p1.1)\.
- \[24\]E\. Mitchell, Y\. Lee, A\. Khazatsky, C\. D\. Manning, and C\. Finn\(2023\)DetectGPT: zero\-shot machine\-generated text detection using probability curvature\.InProceedings of the 40th International Conference on Machine Learning,pp\. 24950–24962\.Cited by:[§1](https://arxiv.org/html/2605.06903#S1.p1.1),[§2](https://arxiv.org/html/2605.06903#S2.SS0.SSS0.Px1.p1.1)\.
- \[25\]OpenAI\(2026\)GPT\-5\.4 Mini\.Note:API model snapshotgpt\-5\.4\-mini\-20260317Cited by:[§4\.1](https://arxiv.org/html/2605.06903#S4.SS1.SSS0.Px2.p1.4)\.
- \[26\]G\. Penedo, H\. Kydlíček, A\. Lozhkov, M\. Mitchell, C\. Raffel, L\. Von Werra, T\. Wolf,et al\.\(2024\)The FineWeb datasets: decanting the web for the finest text data at scale\.Advances in Neural Information Processing Systems37,pp\. 30811–30849\.Cited by:[§3\.2](https://arxiv.org/html/2605.06903#S3.SS2.p1.9),[§4\.1](https://arxiv.org/html/2605.06903#S4.SS1.SSS0.Px1.p1.1)\.
- \[27\]X\. Pu, Z\. Cheng, L\. Yuan, Y\. Wu, and X\. Bi\(2026\)Breaking the generator barrier: disentangled representation for generalizable AI\-text detection\.arXiv preprint arXiv:2604\.13692\.Cited by:[§2](https://arxiv.org/html/2605.06903#S2.SS0.SSS0.Px3.p1.1)\.
- \[28\]QuillBot, a Learneo, Inc\. business\(2025\)QuillBot ai content detector\.Note:Commercial product; performance reported on the public RAID leaderboard at[https://raid\-bench\.xyz](https://raid-bench.xyz/)Cited by:[Table 1](https://arxiv.org/html/2605.06903#S1.T1.22.5.5.1)\.
- \[29\]F\. Schroff, D\. Kalenichenko, and J\. Philbin\(2015\)Facenet: a unified embedding for face recognition and clustering\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pp\. 815–823\.Cited by:[§3\.3](https://arxiv.org/html/2605.06903#S3.SS3.SSS0.Px3.p2.3)\.
- \[30\]I\. Solaiman, M\. Brundage, J\. Clark, A\. Askell, A\. Herbert\-Voss, J\. Wu, A\. Radford, G\. Krueger, J\. W\. Kim, S\. Kreps,et al\.\(2019\)Release strategies and the social impacts of language models\.arXiv preprint arXiv:1908\.09203\.Cited by:[§1](https://arxiv.org/html/2605.06903#S1.p1.1),[§2](https://arxiv.org/html/2605.06903#S2.SS0.SSS0.Px2.p1.1),[Table 3](https://arxiv.org/html/2605.06903#S5.T3.26.24.7),[Table 4](https://arxiv.org/html/2605.06903#S5.T4.24.20.6)\.
- \[31\]Superhuman Platform Inc\.\(2026\)Grammarly AI writing detector\.Note:Commercial product; performance reported on the public RAID leaderboard at[https://raid\-bench\.xyz](https://raid-bench.xyz/)Cited by:[Table 1](https://arxiv.org/html/2605.06903#S1.T1.22.4.4.1)\.
- \[32\]M\. N\. Ta, D\. C\. Van, D\. Hoang, M\. Le\-Anh, T\. Nguyen, M\. A\. T\. Nguyen, Y\. Wang, P\. Nakov, and D\. V\. Sang\(2026\)FAID: fine\-grained AI\-generated text detection using multi\-task auxiliary and multi\-level contrastive learning\.InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 3275–3296\.Cited by:[§1](https://arxiv.org/html/2605.06903#S1.p1.1),[§2](https://arxiv.org/html/2605.06903#S2.SS0.SSS0.Px3.p1.1)\.
- \[33\]A\. Tarvainen and H\. Valpola\(2017\)Mean teachers are better role models: weight\-averaged consistency targets improve semi\-supervised deep learning results\.InProceedings of the 31st International Conference on Neural Information Processing Systems,pp\. 1195–1204\.Cited by:[§1](https://arxiv.org/html/2605.06903#S1.p2.1)\.
- \[34\]L\. S\. Teja, A\. Yadagiri, S\. S\. Anish, S\. G\. K\. Nuthakki, and P\. Pakray\(2026\)Modeling the attack: detecting AI\-generated text by quantifying adversarial perturbations\.In2026 20th International Conference on Ubiquitous Information Management and Communication \(IMCOM\),pp\. 1–8\.Cited by:[§2](https://arxiv.org/html/2605.06903#S2.SS0.SSS0.Px3.p1.1)\.
- \[35\]A\. Uchendu, Z\. Ma, T\. Le, R\. Zhang, and D\. Lee\(2021\)TURINGBENCH: a benchmark environment for turing test in the age of neural text generation\.InFindings of the Association for Computational Linguistics: EMNLP 2021,pp\. 2001–2016\.Cited by:[§4\.4](https://arxiv.org/html/2605.06903#S4.SS4.p1.1)\.
- \[36\]V\. Verma, E\. Fleisig, N\. Tomlin, and D\. Klein\(2024\)Ghostbuster: detecting text ghostwritten by large language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 1702–1717\.Cited by:[§2](https://arxiv.org/html/2605.06903#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2605.06903#S4.SS1.SSS0.Px1.p1.1),[§4\.3](https://arxiv.org/html/2605.06903#S4.SS3.p1.2)\.
- \[37\]Y\. Wang, J\. Mansurov, P\. Ivanov, J\. Su, A\. Shelmanov, A\. Tsvigun, O\. M\. Afzal, T\. Mahmoud, G\. Puccetti, T\. Arnold,et al\.\(2024\)M4GT\-Bench: evaluation benchmark for black\-box machine\-generated text detection\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 3964–3992\.Cited by:[§3\.2](https://arxiv.org/html/2605.06903#S3.SS2.p1.9),[§4\.1](https://arxiv.org/html/2605.06903#S4.SS1.SSS0.Px1.p1.1),[§4\.3](https://arxiv.org/html/2605.06903#S4.SS3.p1.2)\.
- \[38\]B\. Warner, A\. Chaffin, B\. Clavié, O\. Weller, O\. Hallström, S\. Taghadouini, A\. Gallagher, R\. Biswas, F\. Ladhak, T\. Aarsen,et al\.\(2025\)Smarter, better, faster, longer: a modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 2526–2547\.Cited by:[§2](https://arxiv.org/html/2605.06903#S2.SS0.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2605.06903#S3.SS1.p1.9)\.
- \[39\]O\. Weller, K\. Ricci, M\. Marone, A\. Chaffin, D\. Lawrie, and B\. Van Durme\(2026\)Seq vs seq: an open suite of paired encoders and decoders\.The Fourteenth International Conference on Learning Representations\.Cited by:[§3\.1](https://arxiv.org/html/2605.06903#S3.SS1.p1.9),[§4\.2](https://arxiv.org/html/2605.06903#S4.SS2.p1.12)\.
- \[40\]C\. Wu, Y\. Cheung, B\. Han, and D\. Lian\(2025\)Advancing machine\-generated text detection from an easy to hard supervision perspective\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2605.06903#S2.SS0.SSS0.Px3.p1.1)\.
- \[41\]J\. Wu, R\. Zhan, D\. F\. Wong, S\. Yang, X\. Yang, Y\. Yuan, and L\. S\. Chao\(2024\)DetectRL: benchmarking LLM\-generated text detection in real\-world scenarios\.Advances in Neural Information Processing Systems37,pp\. 100369–100401\.Cited by:[§1](https://arxiv.org/html/2605.06903#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.06903#S4.SS1.SSS0.Px1.p1.1),[§4\.3](https://arxiv.org/html/2605.06903#S4.SS3.p1.2)\.
- \[42\]K\. Wu, L\. Pang, H\. Shen, X\. Cheng, and T\. Chua\(2023\)LLMDet: a third party large language models generated text detection tool\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 2113–2133\.Cited by:[Table 1](https://arxiv.org/html/2605.06903#S1.T1.22.12.12.1)\.
- \[43\]C\. Zeng, S\. Tang, Y\. Chen, Z\. Shen, W\. Yu, X\. Zhao, H\. Chen, W\. Cheng,et al\.\(2025\)Human texts are outliers: detecting LLM\-generated texts via out\-of\-distribution detection\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.06903#S1.p1.1),[§2](https://arxiv.org/html/2605.06903#S2.SS0.SSS0.Px2.p1.1),[Table 3](https://arxiv.org/html/2605.06903#S5.T3.50.48.7),[Table 4](https://arxiv.org/html/2605.06903#S5.T4.39.35.6)\.
- \[44\]W\. Zhao, X\. Ren, J\. Hessel, C\. Cardie, Y\. Choi, and Y\. Deng\(2024\)WildChat: 1m ChatGPT interaction logs in the wild\.InThe Twelfth International Conference on Learning Representations,Cited by:[§4\.1](https://arxiv.org/html/2605.06903#S4.SS1.SSS0.Px1.p1.1)\.
## Appendix AMELD training pseudocode
Box 1: MELD training step Input:batchℬ=\{x,y,g,a,d\}\\mathcal\{B\}=\\\{x,y,g,a,d\\\}with masksμt\\mu^\{t\}; studentSθS\_\{\\theta\}; EMA teacherTθ¯T\_\{\\bar\{\\theta\}\} Hyperparams:p,α,τtea,τstu,τr,λema,λrank,βp,\\alpha,\\tau\_\{\\mathrm\{tea\}\},\\tau\_\{\\mathrm\{stu\}\},\\tau\_\{r\},\\lambda\_\{\\mathrm\{ema\}\},\\lambda\_\{\\mathrm\{rank\}\},\\betaforminibatchℬ\\mathcal\{B\}doxc←xx^\{c\}\\leftarrow xwith prob\.pp:xa←Augment\(x\)x^\{a\}\\leftarrow\\mathrm\{Augment\}\(x\); elsexa←xcx^\{a\}\\leftarrow x^\{c\}\(zmainS,zgenS,zatkS,zdomS\)←Sθ\(xa\)\(z\_\{\\mathrm\{main\}\}^\{S\},z\_\{\\mathrm\{gen\}\}^\{S\},z\_\{\\mathrm\{atk\}\}^\{S\},z\_\{\\mathrm\{dom\}\}^\{S\}\)\\leftarrow S\_\{\\theta\}\(x^\{a\}\)zmainT←stopgrad\(Tθ¯\(xc\)\)z\_\{\\mathrm\{main\}\}^\{T\}\\leftarrow\\mathrm\{stopgrad\}\(T\_\{\\bar\{\\theta\}\}\(x^\{c\}\)\)ℒt←MaskedCE\(ztS,yt;μt\)\\mathcal\{L\}\_\{t\}\\leftarrow\\mathrm\{MaskedCE\}\(z\_\{t\}^\{S\},y^\{t\};\\mu^\{t\}\)fort∈\{main,gen,atk,dom\}t\\in\\\{\\mathrm\{main\},\\mathrm\{gen\},\\mathrm\{atk\},\\mathrm\{dom\}\\\}ℒcls←∑t∈\{main,gen,atk,dom\}\(e−stℒt\+12st\)\\mathcal\{L\}\_\{\\mathrm\{cls\}\}\\leftarrow\\sum\_\{t\\in\\\{\\mathrm\{main\},\\mathrm\{gen\},\\mathrm\{atk\},\\mathrm\{dom\}\\\}\}\\bigl\(e^\{\-s\_\{t\}\}\\mathcal\{L\}\_\{t\}\+\\tfrac\{1\}\{2\}s\_\{t\}\\bigr\)pT←softmax\(zmainT/τtea\)p^\{T\}\\leftarrow\\mathrm\{softmax\}\(z\_\{\\mathrm\{main\}\}^\{T\}/\\tau\_\{\\mathrm\{tea\}\}\),pS←softmax\(zmainS/τstu\)p^\{S\}\\leftarrow\\mathrm\{softmax\}\(z\_\{\\mathrm\{main\}\}^\{S\}/\\tau\_\{\\mathrm\{stu\}\}\)ℒema←KL\(pT∥pS\)\\mathcal\{L\}\_\{\\mathrm\{ema\}\}\\leftarrow\\mathrm\{KL\}\\\!\\left\(p^\{T\}\\,\\\|\\,p^\{S\}\\right\)mi←zmain,iS,AI−zmain,iS,Humanm\_\{i\}\\leftarrow z\_\{\\mathrm\{main\},i\}^\{S,\\mathrm\{AI\}\}\-z\_\{\\mathrm\{main\},i\}^\{S,\\mathrm\{Human\}\}, withyi=1y\_\{i\}\{=\}1for AI andyi=0y\_\{i\}\{=\}0for humanK←⌈αNHuman⌉K\\leftarrow\\lceil\\alpha N\_\{\\mathrm\{Human\}\}\\rceilHK←TopKK\(m,Human\)H\_\{K\}\\leftarrow\\mathrm\{TopK\}\_\{K\}\(m,\\mathrm\{Human\}\)\(index set of top\-KKhuman margins\)ℒrank←Pairwise\(\{mi:yi=1\},HK;τr\)\\mathcal\{L\}\_\{\\mathrm\{rank\}\}\\leftarrow\\mathrm\{Pairwise\}\(\\\{m\_\{i\}:y\_\{i\}=1\\\},H\_\{K\};\\tau\_\{r\}\)ℒ←ℒcls\+λemaℒema\+λrankℒrank\\mathcal\{L\}\\leftarrow\\mathcal\{L\}\_\{\\mathrm\{cls\}\}\+\\lambda\_\{\\mathrm\{ema\}\}\\mathcal\{L\}\_\{\\mathrm\{ema\}\}\+\\lambda\_\{\\mathrm\{rank\}\}\\mathcal\{L\}\_\{\\mathrm\{rank\}\}θ←AdamWStep\(θ,∇θℒ\)\\theta\\leftarrow\\mathrm\{AdamWStep\}\(\\theta,\\nabla\_\{\\theta\}\\mathcal\{L\}\)θ¯←βθ¯\+\(1−β\)θ\\bar\{\\theta\}\\leftarrow\\beta\\bar\{\\theta\}\+\(1\-\\beta\)\\thetaend forInference: use only the main headzmainSz\_\{\\mathrm\{main\}\}^\{S\}\.
Figure 5:Compact MELD training step\.Auxiliary heads, the EMA teacher, and ranking supervision are used only during training; inference uses only the main AI/Human head\.
## Appendix BKendall log\-variance trajectories
Figure[6](https://arxiv.org/html/2605.06903#A2.F6)reports the learned log\-variancests\_\{t\}for each task\. We initialize all tasks with the same weight, and optimizests\_\{t\}jointly with the encoder\. Lowersts\_\{t\}corresponds to a larger multipliere−ste^\{\-s\_\{t\}\}in the uncertainty\-weighted loss\. In our runs, the auxiliary heads move to lowersts\_\{t\}later in training and therefore receive larger relative multipliers than the main head\. This suggests that, within the joint objective, the auxiliary tasks continue to provide useful training signal later in optimization\. We use this trajectory as a compact diagnostic of how training emphasis shifts across tasks over time\.
Figure 6:Per\-task log\-variances over training\.Lowersts\_\{t\}corresponds to a larger learned loss multiplier\. Later in training, the auxiliary heads move to lowersts\_\{t\}and therefore receive larger relative multipliers than the main head\.
## Appendix CMELD\-eval construction details
#### Per\-generator counts and per\-domain configuration\.
MELD\-eval follows RAID’s English\-domain protocol over eight domains\. Each generator is queried under one of two prompting modes:*continuation*\(books, news, reddit\), in which the model is asked to extend a short human seed, and*instruction*\(abstracts, recipes, reviews, wiki, poetry\), in which the model is given only a topic line\. Per\-generator AI row counts and per\-domain seed caps, target lengths, and seed inputs are summarized in Table[6](https://arxiv.org/html/2605.06903#A3.T6)\.
Table 6:MELD\-eval pool construction\.*Top:*per\-generator AI row counts \(clean and attacked\) and the generator’s share of the AI side; the7,8627\{,\}862paired human seeds are shared across generators\.*Bottom:*per\-domain prompting configuration\. Mode A is continuation \(the model extends a human seed\); Mode B is instruction \(the model is given a topic line\)\. Reviews is capped at862862due to its English\-filtered pool size\.
#### Prompting and decoding\.
All generators receive the same plain\-text system instruction: no preamble, no meta\-commentary, no chain\-of\-thought, and no markdown formatting\. User prompts follow one of the two modes in Table[6](https://arxiv.org/html/2605.06903#A3.T6): in continuation mode, the model extends a short human seed in the same voice and register; in instruction mode, the model is given only a topic line and asked to produce the target domain text\. We use temperature0\.70\.7, top\-pp0\.950\.95, and a maximum output length of10241024tokens; Qwen 3\.6 Plus is queried with thinking disabled so that its outputs remain comparable in length and style\. Exact prompt templates and model snapshot identifiers are included in the code release\.
#### Attacks and text normalization\.
Each clean AI row is transformed with the six RAID\-style attacks at RAID’s per\-token rates, producing188,688188\{,\}688attacked rows from31,44831\{,\}448clean AI rows\. The eval\-time set overlaps the train\-time augmentation set \(Section[3\.3](https://arxiv.org/html/2605.06903#S3.SS3)\) on three families, homoglyph, whitespace, and synonym, while zero\-width\-space insertion, upper–lower flip, and±2\\pm 2digit perturbation are held out from train\-time augmentation\. On the more conversational domains \(reviews, recipes, poetry\), we strip residual markdown uniformly from both AI outputs and the paired human texts so detectors cannot exploit formatting artifacts\.
## Appendix DQualitative comparison
TableLABEL:tab:qualitativecompares MELD with Binoculars, ModernBERT\-Detect, and RoBERTa\-ChatGPT on random examples from HC3, DetectRL\-test, and MELD\-eval\. These pools stress near\-saturated clean performance, attacked AI text, and current\-generator transfer, respectively\. Each cell reports the standardized margin from the FPR=1%1\\%threshold; positive values indicate an AI decision and negative values a human decision, with the full definition given in the caption\.
Table 7:Per\-text disagreements on HC3, DetectRL, and MELD\-eval\.Each cell reports the standardized margin of a detector score from the pool\-specific FPR=1%1\\%threshold:Φ−1\(percentile\-in\-humans\)−Φ−1\(0\.99\)\\Phi^\{\-1\}\(\\text\{percentile\-in\-humans\}\)\-\\Phi^\{\-1\}\(0\.99\), in standard\-deviation units\. The threshold is0by construction, so✓\\checkmarkmarks correct classification \(AI: margin\>0\>0; human: margin≤0\\leq 0\) and ✗ marks an operating\-point error\. Rows11–44are HC3 clean examples \(2 AI, 2 human\); rows55–88are DetectRL attacked examples \(3 AI, 1 human\); rows99–1212are clean MELD\-eval AI examples, one per generator; rows1313–1616are attack\-augmented MELD\-eval AI examples, one per generator\. These disagreement examples show cases where MELD remains on the correct side of the low\-FPR threshold while other open\-source detectors fail\.\#Source / attackExcerpt\(truncated\)MELDBinoMBERT\-DRoB\-CGPT*HC3 \(clean text\)*\(1\)HC3cleanAI: John Adams became President in 1797\. He was the second President of the United States, serving one term from 1797 to 1801\.\+0\.71✓\\checkmark−\-1\.32 ✗−\-3\.37 ✗\+\+0\.87✓\\checkmark\(2\)HC3cleanAI: Charcoal and regular coal are similar in that they are both made from carbon\-rich materials and they can be used as fuels\. However, there are some differences between the two that make burning charcoal less practical…\+0\.58✓\\checkmark−\-1\.52 ✗\+\+0\.42✓\\checkmark−\-6\.60 ✗\(3\)HC3cleanHuman: \* " I ’d imagine it has something to do with availability \. " \* True \. We ’d probably have problems locking up women in cages too , though \.−\-1\.43✓\\checkmark−\-0\.65✓\\checkmark\+\+1\.70 ✗−\-1\.15✓\\checkmark\(4\)HC3cleanHuman: Logical consequence \(also entailment\) is a fundamental concept in logic, which describes the relationship between statements that hold true when one statement logically follows from one or more statements\. A valid logical argument…−\-4\.35✓\\checkmark−\-3\.69✓\\checkmark\+\+0\.37 ✗−\-0\.58✓\\checkmark*DetectRL \(attacked\)*\(5\)DetectRLparaphraseAI: Counting them up revealed steady progress one line at a time\. The story was gradually taking shape sentence by sentence\. Introducing details like checking my count kept the writing engaging\. Finding creative ways to el…\+0\.08✓\\checkmark−\-1\.31 ✗−\-2\.69 ✗−\-2\.13 ✗\(6\)DetectRLperturbationAI: I’ve also noticed that it has a peasant scent\. It’ not too strong or overpowering, but it’s definitely noticeable\. It’s a nica bonus, especially since I enjoy using products that have a nice fragrance\. The packagign i…\+0\.36✓\\checkmark−\-0\.94 ✗−\-4\.91 ✗−\-1\.38 ✗\(7\)DetectRLpromptAI: The only thing that was edible was the soup\. The service was also terrible\. The waiter was rude and dismissive\. He took our order and then disappeared for over 30 minutes\. When he finally came back with our food, it w…\+0\.04✓\\checkmark−\-0\.85 ✗−\-2\.05 ✗−\-0\.75 ✗\(8\)DetectRLparaphraseHuman: In the hushed living room, my family surrounds me\. The sole sound is a news reporter’s voice droning from the TV\. Dim light bathes the room in an eerie, pulsating blue glow that mimics the rhythm of my heartbeat\. On t…−\-0\.88✓\\checkmark−\-1\.19✓\\checkmark\+\+0\.46 ✗−\-1\.03✓\\checkmark*MELD\-eval clean*\(9\)MELD\-evalclean\(GPT\-5\.4\-Mini, abstracts\)AI: Whitney’s broken\-circuit theorem gives a foundational expansion of the chromatic polynomial in terms of acyclic substructures of a graph and has played a central role in the combinatorial theory of graph colorings\. In this paper…\+1\.23✓\\checkmark−\-2\.93 ✗−\-2\.10 ✗−\-3\.10 ✗\(10\)MELD\-evalclean\(Gemini\-3\-Flash, abstracts\)AI: The deployment of autonomous robotic systems in dynamic, unstructured environments necessitates perception pipelines capable of high\-fidelity spatial reasoning under stringent temporal constraints\. Reliable image segmentation…\+0\.48✓\\checkmark−\-3\.17 ✗−\-0\.56 ✗−\-3\.79 ✗\(11\)MELD\-evalclean\(Claude\-Haiku\-4\.5, abstracts\)AI: The Riemann Hypothesis remains one of mathematics’ most profound unsolved problems, asserting that all non\-trivial zeros of the Riemann zeta function lie on the critical line Re\(s\) = 1/2\. This review synthesizes…\+0\.60✓\\checkmark−\-2\.10 ✗−\-1\.50 ✗−\-2\.59 ✗\(12\)MELD\-evalclean\(Qwen\-3\.6\-Plus, abstracts\)AI: Boolean satisfiability solving has evolved from a theoretically intractable problem into a cornerstone of modern electronic design automation, formal verification, and artificial intelligence\. While early algorithms struggled…\+0\.87✓\\checkmark−\-3\.19 ✗−\-1\.75 ✗−\-2\.71 ✗*MELD\-eval attacked*\(13\)MELD\-evalupper\-lower\(GPT\-5\.4\-Mini, reddit\)AI: their gc and thEn whEn i asked whY i coUldn’t join they Just Said “it’s complicatEd” like Okay?? apparently it’s fine for literallY everyone else except me\. one of them wIll poSt pics of them all hanging out and i’m sit…\+1\.51✓\\checkmark−\-2\.65 ✗−\-1\.68 ✗−\-3\.28 ✗\(14\)MELD\-evalupper\-lower\(Gemini\-3\-Flash, poetry\)AI: The clock face blurs in the heavy silence of Three a\.m\. The walls are thin, the shadows long, and I am the only pulse in tHis Room\. I tiLt my chin toward the wiNdow, searching the blacK velvet of the void for a sign…\+1\.23✓\\checkmark−\-1\.67 ✗−\-1\.51 ✗−\-2\.59 ✗\(15\)MELD\-evalhomoglyph\(Claude\-Haiku\-4\.5, abstracts\)AI: Image\-to\-image translation, the task of converting images from one domain to another while preserving content structure, has become increasingly important for applications ranging from style transfer to medical imaging\. Recent advances in generative adversarial networks…\+1\.51✓\\checkmark−\-3\.71 ✗−\-0\.94 ✗−\-3\.69 ✗\(16\)MELD\-evalsynonym\(Qwen\-3\.6\-Plus, reddit\)AI: I know moving is trying for everyone, but I didn’t realize it would hit her this hard\. She has always been a pretty chill cat, the kind that sleeps through vacuuming and doesn’t even flinch when the doorbell rings\. See…\+1\.51✓\\checkmark−\-2\.36 ✗−\-3\.59 ✗−\-2\.13 ✗On HC3, high AUROC still coexists with operating\-point errors for the baselines, while MELD is correct on all examples\. On DetectRL, MELD remains above threshold on attacked AI rows and avoids the paraphrased\-human false positive made by ModernBERT\-Detect\. On MELD\-eval, MELD stays above threshold on clean and attacked text from all four generators, whereas the baselines fall below threshold on most rows\. These examples illustrate the patterns behind the low\-FPR and generator\-shift results in Tables[3](https://arxiv.org/html/2605.06903#S5.T3)and[4](https://arxiv.org/html/2605.06903#S5.T4), and they match the ablation results in Section[5\.3](https://arxiv.org/html/2605.06903#S5.SS3)\.
## Appendix ESupplementary statistics
We report two complementary analyses\. Table[8](https://arxiv.org/html/2605.06903#A5.T8)gives paired\-difference bootstrap CIs against published baselines\. Table[9](https://arxiv.org/html/2605.06903#A5.T9)retrains the three strongest supervised baselines on MELD’s training data, isolating the data effect from the multi\-task\-objective effect\.
#### Paired\-difference95%95\\%bootstrap CIs\.
Section A compares MELD against the same\-backbone single\-head Dense ablation on six pools\. Section B compares MELD against the three published baselines on MELD\-eval overall and per\-generator\.Boldmarks rows whose95%95\\%CI excludes zero \(paired\-significant atα=0\.05\\alpha\{=\}0\.05\)\.
Table 8:Paired\-difference95%95\\%bootstrap CIs\.Each row reportsΔ=MELD−baseline\\Delta=\\text\{MELD\}\-\\text\{baseline\}, the point estimate and95%95\\%CI, for AUROC, TPR@1%1\\%FPR, and TPR@5%5\\%FPR\. In Section A, “Dense” is the single\-head ablation from Table[5](https://arxiv.org/html/2605.06903#S5.T5), with the same backbone and training data as MELD but with auxiliary heads and Kendall uncertainty weighting removed\. EMA distillation and pairwise ranking are kept\. Positive values mean MELD is stronger\.Boldmarks rows whose CI excludes zero\. \(B=5,000B=5\{,\}000resamples of the per\-row scores\. RNG seed fixed to20262026\.\)Section B is uniformly significant, and every95%95\\%CI against ModernBERT\-Detect, RepreGuard, and Binoculars lies above zero\. Section A shows the same pattern in a more diagnostic setting\. The largest gains appear on TuringBench, where Dense is not saturated, and HC3 shows the same trend at TPR@1%1\\%FPR\. On M4GT\-test and Ghostbuster the point estimates are near zero and their CIs include zero\. On MAGE\-test and DetectRL\-test, Dense already saturates AUROC, so the AUROC differences are very small \(in some cells slightly negative with CIs excluding zero\), and TPR@1%1\\%FPR also has CIs crossing zero\.
#### Same\-data retraining\.
We retrain the three supervised baselines \(RoBERTa\-ChatGPT, ModernBERT\-Detect and RepreGuard\) from Section B on MELD’s training data, holding each baseline’s official model and recipe fixed\. Only the training corpus changes\. Table[9](https://arxiv.org/html/2605.06903#A5.T9)reports AUROC and TPR@5%5\\%FPR for each \(detector, pool\) cell on the five held\-out benchmarks of Table[3](https://arxiv.org/html/2605.06903#S5.T3)and on MELD\-eval, under the same per\-pool calibration protocol\. Each baseline appears as a public\-checkpoint row and a same\-data retrain row\.
Table 9:Same\-data retraining of supervised baselines\.“Public” is each baseline’s authors\-released checkpoint trained on the authors’ own corpus\. “MELD\-data” is the retrain on MELD’s mixture using that baseline’s own training code and hyperparameters\.↓\\downarrowmarks rows for which the MELD\-data retrain is lower than the public checkpoint on both metrics on at least one evaluation pool\.The results show that simply retraining prior baselines on MELD’s mixture does not reliably recover MELD’s gains, indicating that the training corpus alone is not sufficient to explain the improvement\. By contrast, MELD remains strongest on the deployment metric TPR@5%5\\%FPR on most evaluation pools\. Together with Section A, these results indicate that MELD’s advantage is driven primarily by the multi\-task objective, not just by access to the training mixture\.Similar Articles
Micro-Defects Expose Macro-Fakes: Detecting AI-Generated Images via Local Distributional Shifts
A local distribution-aware detection framework that amplifies micro-scale statistical irregularities to identify AI-generated images with improved accuracy, outperforming baseline detectors across benchmarks.
Spotlights and Blindspots: Evaluation Machine-Generated Text Detection
This paper evaluates 15 machine-generated text detection models across six systems and multiple datasets, finding high variance in model rankings based on dataset and metric choices, with poor performance on novel human-written texts in high-risk domains. The authors highlight that methodological choices in evaluation are critical for accurately reflecting model performance.
MIND-Skill: Quality-Guaranteed Skill Generation via Multi-Agent Induction and Deduction
MIND-Skill is a new framework introduced in this research paper that automates the generation of high-quality, reusable agent skills using multi-agent induction and deduction with quality guarantees via TextGrad optimization.
New AI classifier for indicating AI-written text
OpenAI has released a preliminary AI text classifier designed to help identify AI-written content, with a focus on supporting educators, journalists, and misinformation researchers. The tool comes with acknowledged limitations and is accompanied by an educational resource for teachers on ChatGPT's uses and constraints.
Building an early warning system for LLM-aided biological threat creation
OpenAI conducted a study with 100 participants to evaluate whether GPT-4 meaningfully increases access to dangerous biological threat creation information compared to internet-only baselines, as part of their Preparedness Framework for AI safety. The research introduces an early warning evaluation methodology to detect AI-enabled biorisk uplift and serves as a potential tripwire for flagging models that require further safety testing.