SAGE: An LLM-driven Self Reflective Agentic Framework for Fraud Detection

arXiv cs.AI Papers

Summary

Introduces SAGE, the first end-to-end LLM-driven multi-agent framework for fraud detection, using a Data Diagnostic Tree and Markov decision process with natural-language gradients to optimize models under class imbalance. Experiments show significant F1 improvements over baselines across five datasets.

arXiv:2606.08146v1 Announce Type: new Abstract: Fraud detection in payment, e-commerce, and telecommunications systems requires accuracy at the individual level, robustness under severe class imbalance, and ease of understanding for risk managers. Existing methods fall at least one of these requirements: automated machine learning systems search a fixed numerical space without semantic awareness of the dataset; graph neural network-based methods require pre-defined relational graphs and remain opaque at the individual-decision level; and the design of general-purpose large language model (LLM) agents does not consider the recall and precision constraints specific to real-world fraud detection. In this paper, we propose SAGE, the first end-to-end LLM-driven multi-agent framework for fraud detection. SAGE coordinates three dedicated agents that make decisions based on a six-layer Data Diagnostic Tree (DDT) and a Markov decision process guided by natural-language gradients, automatically optimizing the model under a fraud-specific reward. On five fraud datasets and five LLM backbones, SAGE wins $96.00\%$ of method--dataset comparisons and improves F1 by an average of $40.86\%$ over baselines. The code is available at https://github.com/yichenC1c/SAGE.
Original Article
View Cached Full Text

Cached at: 06/09/26, 08:55 AM

# SAGE: An LLM-driven Self Reflective Agentic Framework for Fraud Detection
Source: [https://arxiv.org/html/2606.08146](https://arxiv.org/html/2606.08146)
Yichen Chen National University of Singapore School of Computing, Singapore 117417 chenyichen@u\.nus\.eduSiying Li University of Chinese Academy of Sciences Institute of Information Engineering, Beijing 100085 lisiying@iie\.ac\.cnYuhang Liang China Mobile Communications Group Big Data Business Group, Beijing 102206 liangyuhang@chinamobile\.comLijun Wang China Mobile Communications Group Big Data Business Group, Beijing 102206 wanglijunit@chinamobile\.comRenyang Liu National University of Singapore Institute of Data Science, Singapore 117602 ryliu@nus\.edu\.sg

###### Abstract

Fraud detection in payment, e\-commerce, and telecommunications systems requires accuracy at the individual level, robustness under severe class imbalance, and ease of understanding for risk managers\. Existing methods fall at least one of these requirements: automated machine learning systems search a fixed numerical space without semantic awareness of the dataset; graph neural network\-based methods require pre\-defined relational graphs and remain opaque at the individual\-decision level; and the design of general\-purpose large language model \(LLM\) agents does not consider the recall and precision constraints specific to real\-world fraud detection\. In this paper, we propose SAGE, the first end\-to\-end LLM\-driven multi\-agent framework for fraud detection\. SAGE coordinates three dedicated agents that make decisions based on a six\-layer Data Diagnostic Tree \(DDT\) and a Markov decision process guided by natural\-language gradients, automatically optimizing the model under a fraud\-specific reward\. On five fraud datasets and five LLM backbones, SAGE wins96\.00%96\.00\\%of method–dataset comparisons and improves F1 by an average of40\.86%40\.86\\%over baselines\. The code is available at[https://github\.com/yichenC1c/SAGE](https://github.com/yichenC1c/SAGE)\.

## 1Introduction

Fraud has become one of the most prevalent economic threats in the digital age, severely eroding public trust in payment systems, e\-commerce platforms, and telecommunications networks\. The rapid development of large language models \(LLMs\) and increasingly automated fraud strategies have led to an exponential increase in the operational costs of fraud cases\. Therefore, fraud detection needs to shift from back\-end data mining tasks to automated engineering processes that are responsive and sustainable, while maintaining interpretability for risk managers and auditors\. However, in most real\-world environments, the available evidence for each suspicious entity is not a rich relational graph, but rather structured features such as device fingerprints, transaction history, call patterns, and aggregated behavioral statistics\. The core problem this paper aims to address is: how to automatically build accurate, robust, and interpretable fraud detectors based on such individual\-level data\.

While numerous studies have dedicated themselves to addressing this issue, existing research still falls short\. Traditional supervised processes and automated machine learning \(AutoML\) systems excel on structured data, but they rely on fixed numerical searches, and the generated models lack a semantic foundation\. Some anti\-fraud methods based on graph neural networks \(GNNs\) require pre\-defined relationship graphs, which are costly to construct in practice and remain opaque at the individual decision\-making level, making them difficult to meet the needs of real\-world anti\-fraud operations\. Recently developed general\-purpose LLM agents can plan, encode, and self\-correct, but they are all developed based on general benchmarks, rather than on the recall and precision\-constrained task environments required for anti\-fraud\. To our knowledge, there is currently no AI agent framework specifically designed for classifier construction tasks in real\-world anti\-fraud scenarios\.

This gap prompted us to propose an innovative design concept: building an LLM agent specifically for constructing fraud detection classifiers for real\-world business scenarios\. In this agent, the model has sufficient autonomy to reason about data features and the classifier itself, while being subject to structural and reward\-based constraints, ensuring its decisions are evidence\-based and consistent with anti\-fraud requirements\. Implementing this design concept faces two specific challenges:

- •C1:How can the agent quickly construct a global, realistic picture of the fraud dataset without being overwhelmed by the massive amount of raw column\-level data, thus selecting an efficient and correct classifier model?
- •C2:How can the agent’s iterative process adhere entirely to principles rather than being arbitrary, ensuring that each optimization is based on the task objective and existing evidence?

To answer these questions, we propose SAGE, an end\-to\-end multi\-agent framework for structured data fraud detection\. SAGE decomposes the workflow into three dedicated agents driven by LLM, operating strictly in sequence: first, the “Profiling Agent” interprets the dataset into a Data Diagnostic Tree \(DDT\) to semantically perceive the full features of the fraud dataset, avoiding agent context loss due to excessively long datasets by profiling the dataset; second, the “Planning Agent” uses the DDT described by the “Profiling Agent” to understand the overall dataset, selects the optimal algorithm, and synthesizes an initial classifier model customized for the dataset; finally, the “Optimization Agent” iteratively optimizes the model in the code space through a finite\-time Markov decision process\. In this process, the language model first emits a natural language gradient to evaluate the current model, then transforms this evaluation into localized code optimization, ultimately achieving the recall and precision constraints required for the fraud business scenario set by the reward mechanism\. Users only need to provide the fraud dataset; the entire process, from data analysis to model optimization, can be completed without human intervention, saving anti\-fraud experts a significant amount of time previously spent on data cleaning, feature engineering, and parameter tuning\.

In summary, our main contributions are as follows:

- •As far as we kown, SAGE is the first end\-to\-end LLM\-based multi\-agent framework designed specifically for the detection of individual\-level tabular fraud datasets\.
- •We introduce the DDT, a six\-layer structured prior that characterizes the feature information of extremely long raw datasets as a fraud\-aware tree\-structured representation, allowing the agent to make dataset\-grounded decisions within a limited context window\.
- •We formalize agent code optimization as a finite\-horizon MDP driven by natural\-language gradients, in which natural language is translated into concrete code modifications under fraud\-specific recall and precision constraints\.
- •We evaluate SAGE on five fraud datasets \(four public benchmarks and one real\-world industrial dataset\) and across five LLM backbones\. The results show that it significantly outperforms existing AutoML systems, LLM coding agents, and human experts, while remaining insensitive to changes in the underlying language model\.

The remainder of this paper is organized as follows\. Section[2](https://arxiv.org/html/2606.08146#S2)analyzes the current background and existing fraud detection paradigms, and compares related works\. Section[3](https://arxiv.org/html/2606.08146#S3)formalizes the anti\-fraud problem and introduces the modeling primitives that run through SAGE\. Section[4](https://arxiv.org/html/2606.08146#S4)details the SAGE framework and its design principles, including the Data Diagnostic Tree and the natural\-language\-gradient\-guided Markov decision process \(MDP\)\. Section[5](https://arxiv.org/html/2606.08146#S5)introduces the experimental setup, main results, robustness and sensitivity analysis, interpretability case study, and ablation study\. Section[6](https://arxiv.org/html/2606.08146#S6)discusses the limitations and future research directions, and Section[7](https://arxiv.org/html/2606.08146#S7)summarizes the entire paper\.

## 2Related Work

### 2\.1Fraud detection background

Fraud has become one of the most destructive economic threats worldwide, permeating both financial and telecommunications sectors\. A 2025 survey of 46,000 adults across 42 markets found that 57% of respondents had experienced fraud in the preceding year, with estimated global losses reaching $442 billion\[[18](https://arxiv.org/html/2606.08146#bib.bib25)\]\. Fraud takes many forms: credit\-card and payment fraud, e\-commerce fraud, account takeover, cryptocurrency transaction fraud, and, in telecommunications, voice phishing and SMS phishing \(smishing\)\. A substantial body of work has formalized these settings:\[[9](https://arxiv.org/html/2606.08146#bib.bib17)\]formalizes the credit\-card fraud detection system;\[[42](https://arxiv.org/html/2606.08146#bib.bib27)\]addresses credit\-card fraud through a semi\-supervised gated attention network over transaction graphs;\[[43](https://arxiv.org/html/2606.08146#bib.bib28)\]targets camouflaged frauds in tabular data through cohort augmentation;\[[22](https://arxiv.org/html/2606.08146#bib.bib26)\]introduces an event\-centric, commonsense\-guided framework for fake detection; and\[[14](https://arxiv.org/html/2606.08146#bib.bib29)\]targets covert textual expressions of telecom fraud\. Across these settings, the detection problem reduces to individual\-level classification: judging whether a single transaction, account, or subscriber is fraudulent from its own behavioral record, typically encoded as device attributes, transaction records, calling behavior, and aggregated temporal/frequency statistics\. This task carries a difficult statistical profile: fraud is extremely rare \(legitimate activity outnumbers fraud by tens to hundreds to one\), a severe imbalance recognized as a fundamental learning obstacle\[[21](https://arxiv.org/html/2606.08146#bib.bib40)\]and repeatedly identified as the central challenge of fraud detection\[[27](https://arxiv.org/html/2606.08146#bib.bib19),[46](https://arxiv.org/html/2606.08146#bib.bib20)\]; the cost of errors is asymmetric under a strict false\-alarm budget; and fraud is adversarial and non\-stationary\[[10](https://arxiv.org/html/2606.08146#bib.bib18)\]\. These properties are reflected in widely used benchmarks such as the European credit\-card dataset\[[31](https://arxiv.org/html/2606.08146#bib.bib4)\], the PaySim simulator\[[26](https://arxiv.org/html/2606.08146#bib.bib5)\], and the Elliptic Bitcoin dataset\[[40](https://arxiv.org/html/2606.08146#bib.bib6)\]\.

### 2\.2Fraud detection methods

Classical research has demonstrated the power of data mining classifiers:\[[4](https://arxiv.org/html/2606.08146#bib.bib31)\]compared logistic regression, support vector machines, and random forests on real credit\-card data, and\[[44](https://arxiv.org/html/2606.08146#bib.bib32)\]confirmed the effectiveness of random forests; sequence\-aware variants such as the LSTM model of\[[24](https://arxiv.org/html/2606.08146#bib.bib33)\]and the hidden\-Markov approach of\[[33](https://arxiv.org/html/2606.08146#bib.bib34)\]further exploit transaction order\. Building on this, automated machine learning \(AutoML\) systems such as Auto\-sklearn\[[17](https://arxiv.org/html/2606.08146#bib.bib35)\], FLAML\[[38](https://arxiv.org/html/2606.08146#bib.bib7)\], and AutoGluon\[[16](https://arxiv.org/html/2606.08146#bib.bib8)\]search over algorithms and hyperparameters to build gradient\-boosted tree ensembles like XGBoost\[[6](https://arxiv.org/html/2606.08146#bib.bib24)\], which remains the de facto standard since tree\-based models still match or outperform deep networks on tabular data\[[19](https://arxiv.org/html/2606.08146#bib.bib22)\]\. However, AutoML explores a fixed numerical search space without reasoning about dataset semantics, performs no domain\-grounded feature engineering, and returns a model rather than a human\-readable rationale; its search is statistical, not diagnostic\. Another paradigm models fraudulent behavior on a graph of users, transactions, and devices\[[8](https://arxiv.org/html/2606.08146#bib.bib23)\]\. Representative methods include the semi\-supervised graph attention network of\[[39](https://arxiv.org/html/2606.08146#bib.bib36)\], the camouflage\-resistant CARE\-GNN\[[15](https://arxiv.org/html/2606.08146#bib.bib37)\], and the imbalance\-aware PC\-GNN\[[25](https://arxiv.org/html/2606.08146#bib.bib38)\]\. These methods excel at group\-level and link\-level analysis, but require a pre\-defined construction graph, entailing costly relational\-data collection, substantial entity\-resolution difficulty, and manual architecture design while remaining opaque at the individual\-decision level\. Crucially, in most real\-world fraud\-screening scenarios the available evidence is plain individual\-level tabular data rather than a rich relational graph\[[36](https://arxiv.org/html/2606.08146#bib.bib39),[20](https://arxiv.org/html/2606.08146#bib.bib45)\], so the multi\-relational structure these methods rely on is often absent or prohibitively expensive to construct\. For individual\-level classification dominating real deployments, tree\-based models tailored to tabular data therefore remain both sufficient and preferable\.

### 2\.3Gap for Fraud Detection

However, the rise of large language models has intensified the threat of fraud: generative AI enables fraudsters to produce personalized phishing messages and deepfake content at scale, weakening the lexical and grammatical cues traditional detection methods rely on\[[34](https://arxiv.org/html/2606.08146#bib.bib30)\]\. Meanwhile, high\-quality classical pipelines require substantial expert effort in feature engineering and tuning, limiting their responsiveness as fraud techniques evolve\. More recently, large language models have been used as autonomous agents that reason and act in external environments\. AutoGen\[[41](https://arxiv.org/html/2606.08146#bib.bib41)\]orchestrates multiple conversable agents to solve tasks collaboratively; Reflexion\[[35](https://arxiv.org/html/2606.08146#bib.bib42)\]reinforces agents through verbal self\-reflection rather than weight updates; and textual\-gradient methods\[[32](https://arxiv.org/html/2606.08146#bib.bib1)\]optimize programs by treating natural\-language critiques as gradients\. Closer to data work, Data Interpreter\[[23](https://arxiv.org/html/2606.08146#bib.bib43)\]decomposes data\-science tasks into hierarchical plans and invokes tools for modeling, while coding agents such as Claude Code\[[2](https://arxiv.org/html/2606.08146#bib.bib44)\]autonomously write, execute, and debug code from natural\-language instructions\. These frameworks show that an LLM agent can diagnose, plan, and self\-correct in language, but they have been developed on generic reasoning, coding, and data\-science benchmarks, with none tailored to fraud detection or to the strict recall and precision constraints of production anti\-fraud systems\. Table[1](https://arxiv.org/html/2606.08146#S2.T1)contrasts the main paradigms against the requirements of individual\-level fraud detection\. As the table shows, AutoML suits tabular data but is neither semantic nor fraud\-specific; graph methods are fraud\-oriented but require relational structure and are opaque; and current LLM agents are agentic and interpretable yet not designed for fraud detection\. No existing approach simultaneously satisfies all four requirements, which motivates an LLM\-driven agent purpose\-built for tabular fraud detection\.

Table 1:Representative methods against the four requirements of individual level fraud detection\.

## 3Preliminaries

This section establishes the formal foundations on which SAGE is built\. We define the tabular fraud detection problem \(Section[3\.1](https://arxiv.org/html/2606.08146#S3.SS1)\), characterize the class imbalance that shapes any fraud detector \(Section[3\.2](https://arxiv.org/html/2606.08146#S3.SS2)\), and introduce the modeling and optimization primitives reused throughout Section[4](https://arxiv.org/html/2606.08146#S4)\(Section[3\.3](https://arxiv.org/html/2606.08146#S3.SS3)\)\.

### 3\.1Problem Formalization

We study fraud detection in its tabular, individual\-transaction form\. Let each instance be a feature vector𝐱∈𝒳⊆ℝp\\mathbf\{x\}\\in\\mathcal\{X\}\\subseteq\\mathbb\{R\}^\{p\}describing a single transaction or account, together with a binary labely∈\{0,1\}y\\in\\\{0,1\\\}, wherey=1y=1denotes a fraudulent \(positive\) instance andy=0y=0a legitimate \(negative\) one\. A datasetP=\(X,y\)P=\(X,y\)collectsnnsuch instances as a feature matrixX∈ℝn×pX\\in\\mathbb\{R\}^\{n\\times p\}and a label vectory∈\{0,1\}ny\\in\\\{0,1\\\}^\{n\}\. The goal of a fraud detector is to obtain a scoring function

f:𝒳→\[0,1\],y^=𝟏​\[f​\(𝐱\)≥δ\],f:\\mathcal\{X\}\\to\[0,1\],\\qquad\\hat\{y\}=\\mathbf\{1\}\\\!\\left\[\\,f\(\\mathbf\{x\}\)\\geq\\delta\\,\\right\],\(1\)that maps each instance to a fraud probability, which a decision thresholdδ∈\[0,1\]\\delta\\in\[0,1\]converts into a hard predictiony^\\hat\{y\}\. In the remainder of the paper, this scoring function is realized as the modelM∈ℳM\\in\\mathcal\{M\}produced by SAGE, and the thresholdδ\\deltacorresponds to the tunablethreshold\_tunehandle of Section[4\.2](https://arxiv.org/html/2606.08146#S4.SS2)\. Following\[[11](https://arxiv.org/html/2606.08146#bib.bib16)\]and\[[10](https://arxiv.org/html/2606.08146#bib.bib18)\], we treat fraud detection not as plain binary classification but as a constrained problem: maximizing detection quality subject to a tight false\-alarm budget and a minimum\-recall requirement, a view that directly motivates the composite reward of Section[4\.3\.3](https://arxiv.org/html/2606.08146#S4.SS3.SSS3)\. We useδ\\deltathroughout for the decision threshold and reserveτ\\tau\(introduced in Section[3\.3\.4](https://arxiv.org/html/2606.08146#S3.SS3.SSS4)\) for the dataset\-specific recall constraint in the reward function\.

### 3\.2Class Imbalance

A defining property of fraud data is extreme class imbalance, which we quantify by the*imbalance ratio*

IR=nnegnpos,\\mathrm\{IR\}\\;=\\;\\frac\{n\_\{\\mathrm\{neg\}\}\}\{n\_\{\\mathrm\{pos\}\}\},\(2\)wherenposn\_\{\\mathrm\{pos\}\}andnnegn\_\{\\mathrm\{neg\}\}are the numbers of positive and negative instances; in real fraud datasetsIR\\mathrm\{IR\}commonly spans two orders of magnitude\. Under such skew, a trivial classifier predicting every instance as legitimate attains near\-perfect accuracy while detecting no fraud, rendering accuracy uninformative\[[27](https://arxiv.org/html/2606.08146#bib.bib19),[46](https://arxiv.org/html/2606.08146#bib.bib20)\]\. Common remedies span*data\-level*resampling that rebalances the training distribution \(e\.g\., SMOTE\[[5](https://arxiv.org/html/2606.08146#bib.bib21)\]\) and*algorithm\-level*loss reweighting such as a positive\-class weightwpos≈IRw\_\{\\mathrm\{pos\}\}\\approx\\mathrm\{IR\}\.

### 3\.3Formal Definitions of SAGE

#### 3\.3\.1SAGE as a Composite Mapping

Let𝒫\\mathcal\{P\}denote the*problem space*, the set of tabular fraud detection problemsP=\(X,y\)P=\(X,y\)introduced in Section[3\.1](https://arxiv.org/html/2606.08146#S3.SS1)\. Letℳ\\mathcal\{M\}denote the*model space*of trained fraud classifiers,𝒞\\mathcal\{C\}the*code space*of executable training scripts, andℝd\\mathbb\{R\}^\{d\}thedd\-dimensional metric vector space \(in this workd=4d=4, covering AUPRC, F1, MCC, andR​@​FPR10−4\\mathrm\{R\}@\\mathrm\{FPR\}\_\{10^\{\-4\}\}\)\. SAGE is then formalized as the composite mapping

SAGE\\displaystyle\\mathrm\{SAGE\}:𝒫→𝒞×ℳ×ℝd,\\displaystyle:\\mathcal\{P\}\\to\\mathcal\{C\}\\times\\mathcal\{M\}\\times\\mathbb\{R\}^\{d\},SAGE​\(P\)\\displaystyle\\mathrm\{SAGE\}\(P\)=𝒜3​\(P,𝒜2​\(P,𝒜1​\(P\)\)\),\\displaystyle=\\mathcal\{A\}\_\{3\}\\\!\\big\(P,\\,\\mathcal\{A\}\_\{2\}\(P,\\,\\mathcal\{A\}\_\{1\}\(P\)\)\\big\),\(3\)in which the three agents𝒜1,𝒜2,𝒜3\\mathcal\{A\}\_\{1\},\\mathcal\{A\}\_\{2\},\\mathcal\{A\}\_\{3\}are applied strictly in sequence\. The final output is the triple\(c∗,M∗,𝐦∗\)\(c^\{\*\},M^\{\*\},\\mathbf\{m\}^\{\*\}\)recorded at the best iterationt∗=arg⁡maxt⁡R​\(𝐦t\)t^\{\*\}=\\arg\\max\_\{t\}R\(\\mathbf\{m\}\_\{t\}\), whereR​\(⋅\)R\(\\cdot\)is the composite reward defined in Eq\. \([11](https://arxiv.org/html/2606.08146#S3.E11)\) and𝐦t∈ℝd\\mathbf\{m\}\_\{t\}\\in\\mathbb\{R\}^\{d\}is the metric vector of the model at iterationtt\.

#### 3\.3\.2The structure of Three Agents

Each agent in Eq\. \([3](https://arxiv.org/html/2606.08146#S3.E3)\) is a typed mapping between intermediate spaces:

𝒜1\\displaystyle\\mathcal\{A\}\_\{1\}:𝒫→ℛ×𝒯,\\displaystyle:\\mathcal\{P\}\\to\\mathcal\{R\}\\times\\mathcal\{T\},𝒜2\\displaystyle\\mathcal\{A\}\_\{2\}:𝒫×ℛ×𝒯→𝒞×ℳ,\\displaystyle:\\mathcal\{P\}\\times\\mathcal\{R\}\\times\\mathcal\{T\}\\to\\mathcal\{C\}\\times\\mathcal\{M\},𝒜3\\displaystyle\\mathcal\{A\}\_\{3\}:𝒫×𝒞×ℳ→𝒞×ℳ×ℝd,\\displaystyle:\\mathcal\{P\}\\times\\mathcal\{C\}\\times\\mathcal\{M\}\\to\\mathcal\{C\}\\times\\mathcal\{M\}\\times\\mathbb\{R\}^\{d\},\(4\)whereℛ\\mathcal\{R\}is the profiling report space and𝒯\\mathcal\{T\}the DDT space \(Section[3\.3\.3](https://arxiv.org/html/2606.08146#S3.SS3.SSS3)\)\. Internally, each agent decomposes as follows\. The Profiling Agent𝒜1\\mathcal\{A\}\_\{1\}runs a deterministic statistical computationStats​\(⋅\)\\mathrm\{Stats\}\(\\cdot\), builds the Data Diagnostic Tree, and adds a semantic interpretation through an LLM call:

𝒜1​\(P\)=\(ρ,T\),ρ=LLMinterpret​\(Stats​\(P\)\),T=DDT​\(P\)\.\\begin\{split\}&\\mathcal\{A\}\_\{1\}\(P\)=\(\\rho,T\),\\\\ &\\rho=\\mathrm\{LLM\}\_\{\\mathrm\{interpret\}\}\(\\mathrm\{Stats\}\(P\)\),\\\\ &T=\\mathrm\{DDT\}\(P\)\.\\end\{split\}\(5\)The Planning Agent𝒜2\\mathcal\{A\}\_\{2\}then consumes\(ρ,T\)\(\\rho,T\)and synthesizes an initial training programc0c\_\{0\}, executed in a sandbox runner to obtain the initial modelM0M\_\{0\}and its metric vector𝐦0\\mathbf\{m\}\_\{0\}:

𝒜2​\(P,ρ,T\)=\(c0,M0\),\\displaystyle\\mathcal\{A\}\_\{2\}\(P,\\rho,T\)=\(c\_\{0\},M\_\{0\}\),c0=LLMcodegen​\(ρ,T\),\\displaystyle c\_\{0\}=\\mathrm\{LLM\}\_\{\\mathrm\{codegen\}\}\(\\rho,T\),M0=Execute​\(c0,P\)\.\\displaystyle M\_\{0\}=\\mathrm\{Execute\}\(c\_\{0\},P\)\.\(6\)Finally, the Optimization Agent𝒜3\\mathcal\{A\}\_\{3\}iteratively refinesc0c\_\{0\}for at mostKKrounds, returning the triple from the best iteration:

𝒜3​\(P,c0,M0\)=\(c∗,M∗,𝐦∗\),t∗=arg⁡maxt∈\[0,K\]⁡R​\(𝐦t\)\.\\mathcal\{A\}\_\{3\}\(P,c\_\{0\},M\_\{0\}\)=\(c^\{\*\},M^\{\*\},\\mathbf\{m\}^\{\*\}\),\\quad t^\{\*\}=\\arg\\max\_\{t\\in\[0,K\]\}R\(\\mathbf\{m\}\_\{t\}\)\.\(7\)HereLLMinterpret\\mathrm\{LLM\}\_\{\\mathrm\{interpret\}\}andLLMcodegen\\mathrm\{LLM\}\_\{\\mathrm\{codegen\}\}are independent LLM calls, andExecute​\(⋅,P\)\\mathrm\{Execute\}\(\\cdot,P\)returns both the trained model and its metric vector𝐦∈ℝd\\mathbf\{m\}\\in\\mathbb\{R\}^\{d\}\. The structural innovation of𝒜1\\mathcal\{A\}\_\{1\}is the DDTTTdefined next; the optimization loop driving𝒜3\\mathcal\{A\}\_\{3\}is formalized in Section[3\.3\.4](https://arxiv.org/html/2606.08146#S3.SS3.SSS4); the engineering instantiation of each𝒜i\\mathcal\{A\}\_\{i\}is presented in Section[4](https://arxiv.org/html/2606.08146#S4)\.

#### 3\.3\.3Data Diagnostic Tree

The Data Diagnostic TreeTTproduced by𝒜1\\mathcal\{A\}\_\{1\}in Eq\. \([5](https://arxiv.org/html/2606.08146#S3.E5)\) is a rooted attributed treeT=\(N,A\)T=\(N,A\), whereNNis a set of nodes organized hierarchically with a unique rootnrootn\_\{\\mathrm\{root\}\}representing the target dataset, andA:N→2𝒦×𝒱A:N\\to 2^\{\\mathcal\{K\}\\times\\mathcal\{V\}\}assigns to each node a set of key–value pairs drawn from a predefined key space𝒦\\mathcal\{K\}and value space𝒱\\mathcal\{V\}\. Every non\-root node has exactly one parent\. The DDT is structured around six*semantic layers*, each forming one principal branch of the tree:

T=\{Lscale,Llabel,Lfeature,Lquality,Lstructure,Ldiagnosis\}\.\\begin\{split\}&T=\\big\\\{\\,L\_\{\\mathrm\{scale\}\},\\;L\_\{\\mathrm\{label\}\},\\;L\_\{\\mathrm\{feature\}\},\\\\ &\\qquad\\;L\_\{\\mathrm\{quality\}\},\\;L\_\{\\mathrm\{structure\}\},\\;L\_\{\\mathrm\{diagnosis\}\}\\,\\big\\\}\.\\end\{split\}\(8\)The encoded content of each layer and the procedure that constructsTTare detailed in Section[4\.1](https://arxiv.org/html/2606.08146#S4.SS1)\.

#### 3\.3\.4MDP, Natural\-Language Gradient, and Reward

The iterative code refinement of𝒜3\\mathcal\{A\}\_\{3\}in Eq\. \([7](https://arxiv.org/html/2606.08146#S3.E7)\) is cast as a finite\-horizon Markov Decision Process\[[37](https://arxiv.org/html/2606.08146#bib.bib2)\], a tuple⟨𝒮,𝒜,𝒯trans,R,K⟩\\langle\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{T\}\_\{\\mathrm\{trans\}\},R,K\\ranglewith state space𝒮\\mathcal\{S\}, action space𝒜\\mathcal\{A\}, transition𝒯trans:𝒮×𝒜→𝒮\\mathcal\{T\}\_\{\\mathrm\{trans\}\}:\\mathcal\{S\}\\times\\mathcal\{A\}\\to\\mathcal\{S\}, rewardR:𝒮→ℝR:\\mathcal\{S\}\\to\\mathbb\{R\}, and horizonKK\. The statests\_\{t\}at iterationttis

st=\(ct,𝐦t,Ht\),s\_\{t\}=\(c\_\{t\},\\,\\mathbf\{m\}\_\{t\},\\,H\_\{t\}\),\(9\)wherect∈𝒞c\_\{t\}\\in\\mathcal\{C\}is the current code,𝐦t=Eval​\(Mt\)∈ℝd\\mathbf\{m\}\_\{t\}=\\mathrm\{Eval\}\(M\_\{t\}\)\\in\\mathbb\{R\}^\{d\}is the metric vector ofMt=Execute​\(ct,P\)M\_\{t\}=\\mathrm\{Execute\}\(c\_\{t\},P\)evaluated on the held\-out validation split \(Section[5\.1\.3](https://arxiv.org/html/2606.08146#S5.SS1.SSS3)\), andHt=\{\(ai,Ri,Di\)\}i=1t−1H\_\{t\}=\\\{\(a\_\{i\},R\_\{i\},D\_\{i\}\)\\\}\_\{i=1\}^\{t\-1\}records past actions, rewards, and natural\-language diagnoses \(withH1=∅H\_\{1\}=\\emptyset\)\. The action space𝒜\\mathcal\{A\}consists of seven handle\-targeted edits:missing\_strategy,feature\_transform,regularization,depth\_and\_estimators,sampling\_params,threshold\_tune, andalgo\_switch\.

Following ReAct\[[45](https://arxiv.org/html/2606.08146#bib.bib3)\]and textual\-gradient methods\[[32](https://arxiv.org/html/2606.08146#bib.bib1)\], each iteration decomposes into two independent, stateless LLM calls:

Dt=LLMcritique​\(ct,𝐦t,Ht\),ct\+1=LLMcodegen​\(ct,Dt,at\)\.\\begin\{split\}&D\_\{t\}=\\mathrm\{LLM\}\_\{\\mathrm\{critique\}\}\(c\_\{t\},\\mathbf\{m\}\_\{t\},H\_\{t\}\),\\\\ &c\_\{t\+1\}=\\mathrm\{LLM\}\_\{\\mathrm\{codegen\}\}\(c\_\{t\},D\_\{t\},a\_\{t\}\)\.\\end\{split\}\(10\)The first call \(*reasoning*\) emits a natural\-language diagnosisDt∈𝒟D\_\{t\}\\in\\mathcal\{D\}; the second call \(*acting*\) translatesDtD\_\{t\}, together with the selected actionat∈𝒜a\_\{t\}\\in\\mathcal\{A\}, into a concrete code edit\.DtD\_\{t\}is the natural\-language analog of a numerical gradient: while the latter specifies the descent direction in continuous parameter space,DtD\_\{t\}specifies the edit direction in discrete code space\.

Finally, the scalar rewardR​\(𝐦t\)R\(\\mathbf\{m\}\_\{t\}\)that drives the selection oft∗t^\{\*\}in Eq\. \([7](https://arxiv.org/html/2606.08146#S3.E7)\) consolidates the multi\-objective requirements of fraud detection:

R​\(𝐦t\)=\(w1​F1\+w2​AUPRC\+w3​R​@​FPR10−4\+w4​Recall\)⏟primary score⋅Wrecall⋅Wprecision\+B,R\(\\mathbf\{m\}\_\{t\}\)=\\underbrace\{\\big\(w\_\{1\}\\mathrm\{F1\}\+w\_\{2\}\\mathrm\{AUPRC\}\+w\_\{3\}\\mathrm\{R\}@\\mathrm\{FPR\}\_\{10^\{\-4\}\}\+w\_\{4\}\\mathrm\{Recall\}\\big\)\}\_\{\\text\{primary score\}\}\\cdot W\_\{\\mathrm\{recall\}\}\\cdot W\_\{\\mathrm\{precision\}\}\\,\+\\,B,

\(11\)where\(w1,w2,w3,w4\)=\(0\.40,0\.30,0\.20,0\.10\)\(w\_\{1\},w\_\{2\},w\_\{3\},w\_\{4\}\)=\(0\.40,0\.30,0\.20,0\.10\)\. The two multiplicative gates and additive bonus realize three fraud\-specific design intents:WrecallW\_\{\\mathrm\{recall\}\}penalizes recall below a dataset\-specific thresholdτ\\tauproduced by𝒜1\\mathcal\{A\}\_\{1\};WprecisionW\_\{\\mathrm\{precision\}\}prevents degeneration into a trivial high\-recall classifier; andBBadds a small bonus when F1 or recall exceeds the initial baseline𝐦0\\mathbf\{m\}\_\{0\}produced by𝒜2\\mathcal\{A\}\_\{2\}, providing a non\-zero gradient even when both gates saturate\. The operational consequences of each term are discussed in Section[4\.3\.3](https://arxiv.org/html/2606.08146#S4.SS3.SSS3)\.

## 4SAGE

![Refer to caption](https://arxiv.org/html/2606.08146v1/x1.png)Figure 1:Overall architecture of the SAGE framework\. ① SAGE takes a heterogeneous tabular fraud dataset as input, ② profiles it through statistical analysis and a six\-layer Data Diagnostic Tree \(DDT\), ③ and uses the resulting data diagnosis to select an algorithm and synthesize the initial training code\. ④ The Optimization Agent then iteratively refines the code through an NLG\-guided MDP loop, where code edits are evaluated by a fraud\-specific composite reward and stored in trajectory memory\. The process stops after convergence and returns the optimized model, final script, and evaluation metrics\.SAGE balances LLM*autonomy*\(each agent independently chooses algorithms, designs features, diagnoses bottlenecks, and edits code\) with explicit*mathematical constraints*\(the DDT grounds early decisions in data, the finite\-horizon MDP\[[37](https://arxiv.org/html/2606.08146#bib.bib2)\]with natural\-language gradient\[[32](https://arxiv.org/html/2606.08146#bib.bib1)\]signals constrains the iterative loop, and a composite reward enforces fraud\-specific objectives\), enabling SAGE to run without human intervention\. SAGE decomposes the end\-to\-end process into three specialized, collaborative LLM\-driven agents \(Eq\. \([3](https://arxiv.org/html/2606.08146#S3.E3)\)\), each responsible for a classic operation in the fraud detection workflow\. TheProfiling Agent\(𝒜1\\mathcal\{A\}\_\{1\}, Section[4\.1](https://arxiv.org/html/2606.08146#S4.SS1)\) examines the input dataset and generates a data\-aware profiling result containing statistical reports and a DDT\. ThePlanning Agent\(𝒜2\\mathcal\{A\}\_\{2\}, Section[4\.2](https://arxiv.org/html/2606.08146#S4.SS2)\) uses this profiling result to select an appropriate algorithm and synthesize an initial executable training program tailored to the dataset\. TheOptimization Agent\(𝒜3\\mathcal\{A\}\_\{3\}, Section[4\.3](https://arxiv.org/html/2606.08146#S4.SS3)\) then iteratively optimizes the program in the code space through an NLG\-guided MDP under a fraud\-specific reward, until convergence\. These three agents constitute a strictly sequential process \(Figure[1](https://arxiv.org/html/2606.08146#S4.F1)\)\. Communication flows forward, with each downstream agent relying on the structured output of the upstream agent\. The user supplies only the datasetPPand the target label column; the entire pipeline proceeds without human intervention and returns the triple\(c∗,M∗,𝐦∗\)\(c^\{\*\},M^\{\*\},\\mathbf\{m\}^\{\*\}\)at the best iteration\.

### 4\.1Profiling Agent with DDT

The Profiling Agent𝒜1\\mathcal\{A\}\_\{1\}is the entry point of SAGE\. Following the typed signature in Eq\. \([5](https://arxiv.org/html/2606.08146#S3.E5)\), it transforms a raw fraud detection problemP=\(X,y\)P=\(X,y\)into a structured, fraud\-aware diagnosis\(ρ,T\)\(\\rho,T\)that grounds every subsequent decision in the data itself\. We focus here on the engineering realization of the DDT, which is the central structural innovation of𝒜1\\mathcal\{A\}\_\{1\}\.

#### 4\.1\.1Why a tree

Feeding LLMs with raw fraud datasets exposes three coupled bottlenecks: \(i\)*information overload*, where hundreds of raw feature descriptors dilute key signals; \(ii\)*semantic gap*, where raw statistics \(e\.g\.,imbalance\_ratio=577\.3=577\.3\) require an additional reasoning step to become actionable; and \(iii\)*narrow attention*, where LLMs overemphasize the most salient features while neglecting global structure\. The Data Diagnostic Tree \(DDT\), formally defined in Eq\. \([8](https://arxiv.org/html/2606.08146#S3.E8)\), addresses all three by compressing the dataset into a compact, semantically annotated tree of six layers\. Table[2](https://arxiv.org/html/2606.08146#S4.T2)summarizes the encoded content and downstream usage of each layer\.

Table 2:The six semantic layers of the Data Diagnostic Tree \(DDT\)\.Downstream agents receive the DDT as indented hierarchical text, in which each of the six branches contains raw statistical fields generated byStats​\(P\)\\mathrm\{Stats\}\(P\)together with semantic annotations, with the origin of each annotation \(*rule\-based*or*LLM\-derived*\) marked inline\. This inline origin tag is critical: it allows downstream agents to weigh deterministic facts and LLM inferences differently when their conclusions disagree, and lets human auditors trace every diagnostic claim back to either a numerical rule or a specific LLM call\. Separating these two origins also enables targeted debugging: an incorrect statistic reflects a flaw inStats​\(P\)\\mathrm\{Stats\}\(P\), whereas an unreasonable diagnosis points to theLLMinterpret\\mathrm\{LLM\}\_\{\\mathrm\{interpret\}\}prompt or backbone\. The indented format itself is deliberately chosen to align with the token\-level patterns LLMs have absorbed from code and configuration files during pre\-training, allowing the agent to traverse the six branches with minimal parsing overhead and avoiding the verbosity of JSON or XML serialization that would otherwise inflate the prompt budget\. The same structure remains stable across datasets of vastly different scale, so prompt templates downstream of𝒜1\\mathcal\{A\}\_\{1\}do not need to be rewritten for each new fraud domain\. This representation compresses the statistical fingerprint of IEEE\-CIS, a flattened report of over50,00050\{,\}000tokens covering all433433features, into a six\-layer tree of approximately2,0002\{,\}000tokens, a25×25\\timesreduction while preserving the decision\-critical signals required by downstream agents\. The same compression trend holds on the other four datasets, with reductions ranging from4×4\\timeson the lower\-dimensional Credit Card data to over20×20\\timeson the high\-dimensional industrial telecom dataset\. A partial excerpt for the IEEE\-CIS dataset is shown below; note that values such asτ=0\.75\\tau=0\.75are produced byLLMinterpret\\mathrm\{LLM\}\_\{\\mathrm\{interpret\}\}from dataset\-specific semantics rather than hard\-coded defaults:

DDT\("ieee\_cis"\)

\|\-Scale

\|\|\-n\_rows:590,540

\|\|\-n\_features:433

\|\+\-note:large\-scaledataset

\|\-Label

\|\|\-imbalance\_ratio:27\.58

\|\+\-difficulty\[rule\]:HARD

\|\-Feature

\|\|\-n\_numeric:400,n\_cat:33

\|\+\-top\_correlated

\|\|\-V258\(corr=0\.42\)

\|\+\-V201\(corr=0\.38\)

\|\-Quality

\|\+\-note:heavymissingness,

\|native\-NApreferred

\|\-Structure

\|\+\-note:identifier\-only,

\|notemporalpartition

\+\-Diagnosis\[LLM\]

\|\-type:transactionalfraud

\|\-key\_risk:\[V258,V201,C14\]

\+\-recall\_threshold:tau=0\.75

#### 4\.1\.2DDT Construction Procedure

The DDT is constructed in two phases, separating statistical numerical analysis from LLM\-based semantic interpretation\.

##### Phase 1: Statistical ComputationStats​\(P\)\\mathrm\{Stats\}\(P\)\(LscaleL\_\{\\mathrm\{scale\}\}–LstructureL\_\{\\mathrm\{structure\}\}\)

This phase is entirely rule\-based\. GivenPP,Stats​\(P\)\\mathrm\{Stats\}\(P\)computes the basic dimensions, the imbalance ratioIR=nneg/npos\\mathrm\{IR\}=n\_\{\\mathrm\{neg\}\}/n\_\{\\mathrm\{pos\}\}, feature types, missing rate per column \(50%50\\%high\-missingness threshold, following common practice in tabular preprocessing\), distribution moments \(skewness, kurtosis\), top\-KKfeature–label correlations, and special structure detection \(ID columns with uniqueness\>95%\>95\\%, time columns, subset\-fraud columns\)\. Each statistic is mapped to semantic labels through threshold rules and written into the corresponding branch \(e\.g\.,IR\>100\\mathrm\{IR\}\\\!\>\\\!100becomes “extremely imbalanced, requiring sample weighting” underLlabelL\_\{\\mathrm\{label\}\}\), populating the first five layers\.

##### Phase 2: Diagnostic Inference \(LdiagnosisL\_\{\\mathrm\{diagnosis\}\}\)

The sixth layer is populated by a singleLLMinterpret\\mathrm\{LLM\}\_\{\\mathrm\{interpret\}\}call that reasons over the first five layers and outputs:\(i\) Dataset type\(transactional, behavioral, or profile\-based fraud\), inferred from feature names, top\-correlated features, and the presence of time/ID columns;\(ii\) Key risk features, a short list of fraud\-distinguishing columns drawn from the top\-KKcorrelated columns combined with semantic priors of feature names \(e\.g\., transaction amount, device fingerprint, time field\); and\(iii\) Recall thresholdτ\\tau, subsequently used by the recall\-constraint gateWrecallW\_\{\\mathrm\{recall\}\}in Eq\. \([11](https://arxiv.org/html/2606.08146#S3.E11)\)\. By committing to a complete dataset profile at this analysis phase, rather than letting downstream agents drift on arbitrary cues, SAGE remains grounded in dataset evidence throughout the optimization loop\.

### 4\.2Planning Agent

The Planning Agent𝒜2\\mathcal\{A\}\_\{2\}converts the structured diagnosis\(ρ,T\)\(\\rho,T\)produced by𝒜1\\mathcal\{A\}\_\{1\}into an executable training program tailored to the target dataset, following Eq\. \([6](https://arxiv.org/html/2606.08146#S3.E6)\)\.𝒜2\\mathcal\{A\}\_\{2\}delegates almost the entire generation task to a singleLLMcodegen\\mathrm\{LLM\}\_\{\\mathrm\{codegen\}\}call, with algorithm selection, feature engineering, hyperparameter values, and thresholding logic all autonomously written by the LLM\. This autonomy is safe rather than chaotic, because the structured DDTTTensures it is grounded in dataset evidence\.

#### 4\.2\.1Algorithm Selection Logic

The core responsibility of𝒜2\\mathcal\{A\}\_\{2\}is to select an algorithm family forc0c\_\{0\}\. Here, we deliberately avoid using hard\-coded selection rules\. Instead, leveraging the inference capabilities of the LLM\-driven agent,LLMcodegen\\mathrm\{LLM\}\_\{\\mathrm\{codegen\}\}receives the complete DDTTTin its prompt\. Our prompt instructs the LLM to weigh theLscaleL\_\{\\mathrm\{scale\}\},LlabelL\_\{\\mathrm\{label\}\},LfeatureL\_\{\\mathrm\{feature\}\}, andLqualityL\_\{\\mathrm\{quality\}\}layers ofTT, evaluate a set of common candidate models against these layers, and describe the empirical strengths of each candidate\. After committing to a choice, the LLM is also required to articulate why the selected algorithm is best suited for the diagnosed scenario\. An example of the SAGE built\-in prompt used at this stage is shown below:

Prompt forLLMcodegen\\mathrm\{LLM\}\_\{\\mathrm\{codegen\}\}You are an expert ML engineer for fraud detection\. You will receive a Data Diagnostic Tree \(DDT\) describing the target dataset and must \(i\) choose a suitable algorithm, \(ii\) generate a complete Python training code following the structure \(Loading, Preprocessing, Feature Engineering, Training, Metric Computation\), and \(iii\) provide a brief justification for your algorithm choice\.DDT \(input\): <DDT omitted for brevity\>Empirical cues:•XGBoost— robust missing\-value handling; strong on heterogeneous tabular features\.•LightGBM— high training efficiency on large\-scale datasets \(\>\>1M rows\)\.•CatBoost— native categorical encoding; strong with high\-cardinality categoricals\.•⋯\\cdots\(additional candidates may be appended via prompt extension\)Selection guidelines:ReadLscaleL\_\{\\mathrm\{scale\}\}for size,LlabelL\_\{\\mathrm\{label\}\}for imbalance,LfeatureL\_\{\\mathrm\{feature\}\}for feature typing, andLqualityL\_\{\\mathrm\{quality\}\}for missingness\. Weigh these jointly; a single layer should not dominate\.Output format:JSON with fieldsalgorithm,code,justification\.

#### 4\.2\.2Code Structure Design

Codec0c\_\{0\}must have a fixed five\-block structure: data loading, preprocessing, feature engineering, model training, and metric calculation\. The specific content of each block is filled in by the LLM, but the block boundaries themselves are immutable\. This design provides a consistent working environment for the LLM, reduces quality variation in the generated code, and allows the Agent to intervene and optimize without rewriting the entire script; these locations are referred to as*optimization handles*\. Each annotation marked\# handle: <name\>corresponds to an action type in the action space𝒜\\mathcal\{A\}\(defined in Section[4\.3](https://arxiv.org/html/2606.08146#S4.SS3)\)\. These handle conventions enable the natural\-language gradients of the Optimization Agent to translate textual information such as “the threshold for recalling the target is too high” into precise local edits at the corresponding handle\.

A typicalc0c\_\{0\}generated by𝒜2\\mathcal\{A\}\_\{2\}takes the following form:

Algorithm 1Initial Training Scriptc0c\_\{0\}by𝒜2\\mathcal\{A\}\_\{2\}1:

P=\(X,y\)P=\(X,y\),

ρ\\rho, DDT

TT
2:

M0M\_\{0\},

𝐦0\\mathbf\{m\}\_\{0\}
3:B1: Loading

4:

d​f←Load​\(P\)df\\leftarrow\\textsc\{Load\}\(P\)
5:B2: Preprocessing

6:

d​f←Impute​\(d​f\)df\\leftarrow\\textsc\{Impute\}\(df\)
7:B3: Feature Eng\.

8:

d​f←Transform​\(d​f,T\)df\\leftarrow\\textsc\{Transform\}\(df,T\)
9:B4: Training

10:

θ←\{n\_est,depth,reg,…\}\\theta\\leftarrow\\\{\\texttt\{n\\\_est\},\\texttt\{depth\},\\texttt\{reg\},\\ldots\\\}
11:

M0←Fit​\(θ,Xtr,ytr\)M\_\{0\}\\leftarrow\\textsc\{Fit\}\(\\theta,X\_\{\\text\{tr\}\},y\_\{\\text\{tr\}\}\)
12:B5: Evaluation

13:

δ←0\.5\\delta\\leftarrow 0\.5
14:

𝐦0←Eval​\(M0,Xval,yval,δ\)\\mathbf\{m\}\_\{0\}\\leftarrow\\textsc\{Eval\}\(M\_\{0\},X\_\{\\text\{val\}\},y\_\{\\text\{val\}\},\\delta\)
15:return

\(M0,𝐦0\)\(M\_\{0\},\\mathbf\{m\}\_\{0\}\)

### 4\.3NLG\-Guided MDP Optimization Agent

The Optimization Agent𝒜3\\mathcal\{A\}\_\{3\}iteratively refines the initial codec0c\_\{0\}produced by𝒜2\\mathcal\{A\}\_\{2\}, following Eq\. \([7](https://arxiv.org/html/2606.08146#S3.E7)\), until convergence\. The dependence of𝒜3\\mathcal\{A\}\_\{3\}onM0M\_\{0\}provides the initial baseline metrics𝐦0\\mathbf\{m\}\_\{0\}used by the baseline\-bonus termBBof the reward function \(Eq\. \([11](https://arxiv.org/html/2606.08146#S3.E11)\)\)\. The loop is implemented as a ReAct\-style reasoning–acting cycle\[[45](https://arxiv.org/html/2606.08146#bib.bib3)\]: in each iteration, the LLM first performs inference through NLG text diagnostics, then converts the diagnosis into a code edit\.

#### 4\.3\.1MDP Loop and Convergence

The optimization is a finite\-horizon MDP whose statest=\(ct,𝐦t,Ht\)s\_\{t\}=\(c\_\{t\},\\mathbf\{m\}\_\{t\},H\_\{t\}\)and seven handle\-targeted actions𝒜\\mathcal\{A\}are defined in Eq\. \([9](https://arxiv.org/html/2606.08146#S3.E9)\) and Section[3\.3\.4](https://arxiv.org/html/2606.08146#S3.SS3.SSS4)\. We focus here on the engineering of the loop\. Each tuple\(ai,Ri,Di\)\(a\_\{i\},R\_\{i\},D\_\{i\}\)recorded inHtH\_\{t\}captures the action selected at iterationii, the natural\-language diagnosisDiD\_\{i\}that motivated it, and the scalar rewardRi=R​\(𝐦i\+1\)R\_\{i\}=R\(\\mathbf\{m\}\_\{i\+1\}\)observed after executing the resulting codeci\+1c\_\{i\+1\}\. Although the LLM calls within an iteration are stateless,HtH\_\{t\}is fed into the next iteration’s prompt, granting the agent access to its own trajectory\. Action selection defaults to LLM\-driven decision\-making, but when F1 stagnates for≥2\\geq 2consecutive rounds an action switch is triggered, guiding the LLM to explore new actions in𝒜\\mathcal\{A\}\. The loop terminates either whent=Kt=K, or whenR​\(𝐦t\)R\(\\mathbf\{m\}\_\{t\}\)improves by less than a toleranceε\\varepsilonforkkconsecutive iterations\. We useK=20K=20,k=4k=4, andε=0\.001\\varepsilon=0\.001in this work\.

#### 4\.3\.2NLG as Textual Gradient

A distinguishing feature of𝒜3\\mathcal\{A\}\_\{3\}is that its optimization does not rely solely on metric values to drive code refinement\. Each iteration decomposes into two stateless LLM calls \(Eq\. \([10](https://arxiv.org/html/2606.08146#S3.E10)\)\): the first call acts as a*critic*that produces a natural\-language gradientDtD\_\{t\}in the code space, identifying bottlenecks and proposing improvement directions; the second call acts as an*executor*that translatesDtD\_\{t\}, together with the selected actionata\_\{t\}, into a concrete code edit at the corresponding handle\. The two calls share no memory; onlyDtD\_\{t\}andata\_\{t\}flow between them, making every reasoning–acting cycle auditable throughHtH\_\{t\}\. SAGE’s*self\-reflection*thus materializes throughHtH\_\{t\}: the LLM reviews its own previous code and outcomes before deciding the next action\.

#### 4\.3\.3Composite Reward Design

The composite rewardR​\(𝐦t\)R\(\\mathbf\{m\}\_\{t\}\)defined in Eq\. \([11](https://arxiv.org/html/2606.08146#S3.E11)\) consolidates the multi\-objective requirements of fraud detection into a single scalar that drives the MDP\. The weights\(w1,w2,w3,w4\)=\(0\.40,0\.30,0\.20,0\.10\)\(w\_\{1\},w\_\{2\},w\_\{3\},w\_\{4\}\)=\(0\.40,0\.30,0\.20,0\.10\)reflect the relative priorities of classification balance, ranking quality, low\-false\-positive\-rate detection, and overall recall, while the two multiplicative gates and the additive bonus realize three fraud\-specific design intents:WrecallW\_\{\\mathrm\{recall\}\}penalizes recall below the dataset\-specific thresholdτ\\tauset by𝒜1\\mathcal\{A\}\_\{1\},WprecisionW\_\{\\mathrm\{precision\}\}prevents degeneration into a trivially high\-recall classifier, andBBprovides a non\-zero gradient signal when both gates saturate\. Recall and Precision are computed internally for the reward and are not part of the reported metric vector𝐦\\mathbf\{m\}; conversely, MCC belongs to𝐦\\mathbf\{m\}for final reporting but is excluded from the reward, since F1 and MCC are strongly correlated under extreme imbalance and including both would be redundant for driving optimization\. Together,RR,DtD\_\{t\}, and𝒜\\mathcal\{A\}form a closed optimization loop:𝒜3\\mathcal\{A\}\_\{3\}measures progress throughRR, articulates the cause of any shortfall throughDtD\_\{t\}, and acts on a specific handle throughata\_\{t\}\. Crucially, every quantity insideRR, F1, AUPRC, R@FPR10−4\{\}\_\{10^\{\-4\}\}, Recall, Precision, the baseline\-bonus comparison, and the threshold value used bythreshold\_tune, is computed on the validation split alone, so the test split remains untouched throughout the optimization loop\.

### 4\.4Algorithm and Complexity

We now consolidate the three agents into a single end\-to\-end procedure \(Algorithm[2](https://arxiv.org/html/2606.08146#alg2)\) and analyze its computational complexity\.

Algorithm 2SAGE: End\-to\-End Pipeline1:

PP;

KK;

ε\\varepsilon;

kk
2:

\(c∗,M∗,𝐦∗\)\(c^\{\*\},M^\{\*\},\\mathbf\{m\}^\{\*\}\)
3:Stage 1:

𝒜1\\mathcal\{A\}\_\{1\}Profiling

4:

ρ←Interpret​\(Stats​\(P\)\)\\rho\\leftarrow\\textsc\{Interpret\}\(\\textsc\{Stats\}\(P\)\)
5:

T←BuildDDT​\(P\)T\\leftarrow\\textsc\{BuildDDT\}\(P\)⊳\\triangleright6\-layer

6:Stage 2:

𝒜2\\mathcal\{A\}\_\{2\}Planning

7:

c0←Codegen​\(ρ,T\)c\_\{0\}\\leftarrow\\textsc\{Codegen\}\(\\rho,T\)
8:

\(M0,𝐦0\)←Execute​\(c0,P\)\(M\_\{0\},\\mathbf\{m\}\_\{0\}\)\\leftarrow\\textsc\{Execute\}\(c\_\{0\},P\)
9:Stage 3:

𝒜3\\mathcal\{A\}\_\{3\}Optimization

10:

H1←∅H\_\{1\}\\leftarrow\\emptyset;

t←1t\\leftarrow 1;

\(ct,Mt,𝐦t\)←\(c0,M0,𝐦0\)\(c\_\{t\},M\_\{t\},\\mathbf\{m\}\_\{t\}\)\\leftarrow\(c\_\{0\},M\_\{0\},\\mathbf\{m\}\_\{0\}\)
11:while

t<Kt<Kandnot convergeddo

12:

Dt←Critique​\(ct,𝐦t,Ht\)D\_\{t\}\\leftarrow\\textsc\{Critique\}\(c\_\{t\},\\mathbf\{m\}\_\{t\},H\_\{t\}\)⊳\\trianglerightreasoning

13:

at←π​\(st\)a\_\{t\}\\leftarrow\\pi\(s\_\{t\}\)⊳\\trianglerightLLM\-driven; force\-explore on stagnation

14:

ct\+1←Codegen​\(ct,Dt,at\)c\_\{t\+1\}\\leftarrow\\textsc\{Codegen\}\(c\_\{t\},D\_\{t\},a\_\{t\}\)⊳\\trianglerightacting

15:

\(Mt\+1,𝐦t\+1\)←Execute​\(ct\+1,P\)\(M\_\{t\+1\},\\mathbf\{m\}\_\{t\+1\}\)\\leftarrow\\textsc\{Execute\}\(c\_\{t\+1\},P\)
16:

Rt←R​\(𝐦t\+1\)R\_\{t\}\\leftarrow R\(\\mathbf\{m\}\_\{t\+1\}\)⊳\\trianglerightEq\. \([11](https://arxiv.org/html/2606.08146#S3.E11)\)

17:

Ht\+1←Ht∪\{\(at,Rt,Dt\)\}H\_\{t\+1\}\\leftarrow H\_\{t\}\\cup\\\{\(a\_\{t\},R\_\{t\},D\_\{t\}\)\\\};

t←t\+1t\\leftarrow t\{\+\}1
18:endwhile

19:

t∗←arg⁡maxt⁡R​\(𝐦t\)t^\{\*\}\\leftarrow\\arg\\max\_\{t\}R\(\\mathbf\{m\}\_\{t\}\)
20:

\(c∗,M∗,𝐦∗\)←\(ct∗,Mt∗,𝐦t∗\)\(c^\{\*\},M^\{\*\},\\mathbf\{m\}^\{\*\}\)\\leftarrow\(c\_\{t^\{\*\}\},M\_\{t^\{\*\}\},\\mathbf\{m\}\_\{t^\{\*\}\}\)
21:return

\(c∗,M∗,𝐦∗\)\(c^\{\*\},M^\{\*\},\\mathbf\{m\}^\{\*\}\)

##### Complexity analysis

We characterize the cost of each stage in terms of two atomic operations: one LLM call𝒪​\(L\)\\mathcal\{O\}\(L\), whose latency depends primarily on the model’s response time to the prompt, and one sandboxed code execution𝒪​\(E\)\\mathcal\{O\}\(E\), which trains the model onPP\.

𝒜1\\mathcal\{A\}\_\{1\}costs𝒪​\(n​p\+L\)\\mathcal\{O\}\(np\+L\)from a statistical traversal ofPPplus one LLM call;𝒜2\\mathcal\{A\}\_\{2\}costs𝒪​\(L\+E\)\\mathcal\{O\}\(L\+E\)from one LLM call and one sandbox execution; and𝒜3\\mathcal\{A\}\_\{3\}costs𝒪​\(K​\(L\+E\)\)\\mathcal\{O\}\(K\(L\+E\)\)from at mostKKiterations each consuming two LLM calls and one execution\. The total cost of SAGE is therefore

𝒪​\(n​p⏟Stats\+\(K\+1\)​L⏟LLM calls\+\(K\+1\)​E⏟executions\)=𝒪​\(K​\(L\+E\)\),\\mathcal\{O\}\\\!\\big\(\\underbrace\{np\}\_\{\\text\{Stats\}\}\+\\underbrace\{\(K\+1\)\\,L\}\_\{\\text\{LLM calls\}\}\+\\underbrace\{\(K\+1\)\\,E\}\_\{\\text\{executions\}\}\\big\)\\;=\\;\\mathcal\{O\}\\\!\\big\(K\\,\(L\+E\)\\big\),\(12\)givenK≫1K\\gg 1andn​p≪K​\(L\+E\)np\\ll K\(L\+E\)\. Empirically, each LLM call \(LL\) takes 5–30 seconds on commercial APIs, while each sandboxed model training \(EE\) takes seconds to minutes depending on dataset size\. In our full\-pipeline runs, the average end\-to\-end wall\-clock time is dominated by the Stage 3 Optimization Agent, consistent with the linear\-in\-KKscaling predicted by Eq\. \([12](https://arxiv.org/html/2606.08146#S4.E12)\)\.

## 5Experiments

### 5\.1Experimental Setup

#### 5\.1\.1Datasets

We evaluated SAGE using five tabular fraud detection datasets\. Four of these datasets are publicly available and widely used fraud benchmark datasets from the Kaggle platform; the fifth dataset, named TeleGuard, is a non\-public dataset provided by a telecommunications operator we are collaborating with\. Table[3](https://arxiv.org/html/2606.08146#S5.T3)summarizes their key statistics\.

Table 3:Statistics of the five fraud detection datasets\. IR = imbalance ratio \(nneg/nposn\_\{\\mathrm\{neg\}\}/n\_\{\\mathrm\{pos\}\}\)\.The five datasets together span a wide spectrum of fraud detection scenarios\.Credit Card111[https://www\.kaggle\.com/datasets/mlg\-ulb/creditcardfraud](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud)contains 28 PCA\-anonymized principal components plus the rawAmountandTimecolumns, collected over two days of European credit\-card transactions in September 2013\[[31](https://arxiv.org/html/2606.08146#bib.bib4)\]\.PaySim222[https://www\.kaggle\.com/datasets/ealaxi/paysim1](https://www.kaggle.com/datasets/ealaxi/paysim1)is a synthetic mobile\-payment log simulated from real African operator data, covering five transaction types \(CASH\_IN,CASH\_OUT,TRANSFER,DEBIT,PAYMENT\)\[[26](https://arxiv.org/html/2606.08146#bib.bib5)\]\.IEEE\-CIS333[https://www\.kaggle\.com/c/ieee\-fraud\-detection](https://www.kaggle.com/c/ieee-fraud-detection)is the e\-commerce fraud benchmark released by Vesta Corporation, containing 433 anonymized features \(the V1–V339 block\) with missing rates exceeding75%75\\%in many columns\.Elliptic444[https://www\.kaggle\.com/datasets/ellipticco](https://www.kaggle.com/datasets/ellipticco)is a Bitcoin transaction table annotated with licit/illicit labels; each column carries 166 features combining transaction attributes and graph\-aggregated statistics\[[40](https://arxiv.org/html/2606.08146#bib.bib6)\]\. At last,TeleGuardis a non\-public dataset provided by our industrial collaborator, a telecom operator\. It comprises 47 features covering subscriber profiles, calling behavior, account\-opening patterns, and network metadata, with each record labeled by the operator’s anti\-fraud investigation team\. TeleGuard is included not to expand the public benchmark suite, but to verify that SAGE remains effective on a real industrial dataset whose distribution and feature semantics differ substantially from those of public benchmarks\.

#### 5\.1\.2Baselines

We compared SAGE to the following five baselines:

- •RS \(Random Search\)— For each dataset, RS adopts the same algorithm, data preprocessing, and feature engineering that SAGE selects, and then performs random sampling over the hyperparameter space of that algorithm\. The number of sampling rounds is set equal to the number of optimization iterations SAGE uses on the same task, and the best\-performing round is recorded as the result\. RS thus replaces SAGE’s directed, gradient\-like refinement with undirected random search under an identical search budget\.
- •Manual \(3 human data scientists\)— We invited three anti\-fraud data scientists to manually perform data analysis, feature engineering, model selection, and hyperparameter tuning for each dataset, thereby establishing a robust human benchmark and achieving average\-level results\. This reflects the workload and output of a typical industry anti\-fraud expert team without the use of automation\.
- •FLAML\[[38](https://arxiv.org/html/2606.08146#bib.bib7)\]— Microsoft’s open\-source AutoML framework that uses cost\-aware Bayesian optimization to automatically select algorithms and tune hyperparameters within a given time budget\. Rather than treating all configurations as equally expensive, FLAML explicitly models the training cost of each candidate and prioritizes low\-cost, high\-potential configurations, making it a strong and efficiency\-oriented representative of classical AutoML\. We grant FLAML a time budget comparable to SAGE’s end\-to\-end runtime to ensure a fair comparison\.
- •AutoGluon\[[16](https://arxiv.org/html/2606.08146#bib.bib8)\]— Amazon’s open\-source AutoML framework that builds an ensemble of multiple models via bagging and multi\-layer stacking, returning the strongest combined predictor\. Unlike methods that search for a single best model, AutoGluon trains a diverse pool of base learners \(including gradient\-boosted trees, neural networks, and others\) and combines them through a layered stacking strategy, which typically yields highly competitive accuracy at the cost of greater training and inference overhead\. It represents the ensemble\-based, accuracy\-oriented state of the art in AutoML, and serves as the strongest non\-LLM automated baseline in our comparison\.
- •LLM as Coder— a single\-call LLM baseline\. We pass the entire dataset description directly to the LLM and ask it to generate the complete model code and execute it in one shot, obtaining the result\. We use Claude Opus 4\.7\[[3](https://arxiv.org/html/2606.08146#bib.bib9)\], a powerful LLM as both the data analyst and the coder\. This baseline uses no multi\-agent design, no DDT structure, and no iterative optimization; it represents the simplest form of LLM\-driven coding\.

All baselines operate under the same three\-way data split as SAGE \(Section[5\.1\.3](https://arxiv.org/html/2606.08146#S5.SS1.SSS3)\)\. RS selects its best\-performing round, FLAML and AutoGluon tune internally, Manual experts choose their final hyperparameters and threshold, and LLM\-as\-Coder runs its single\-shot evaluation — in every case, model selection and threshold tuning are performed on the validation split, and the reported test\-split metrics are not used at any selection stage\.

#### 5\.1\.3Evaluation Metrics

Fraud detection is a severely imbalanced classification task with multi\-objective operational constraints, so single\-number metrics such as accuracy are uninformative\. We will use the following four indicators as our main reporting indicators:

- •F1— the harmonic mean of precision and recall, measuring the overall balance between catching more fraud and reducing false alarms\.
- •MCC \(Matthews Correlation Coefficient\)\[[28](https://arxiv.org/html/2606.08146#bib.bib15)\]— a correlation\-based classification metric that takes values in\[−1,1\]\[\-1,1\]and remains informative under extreme class imbalance;MCC=1\\text\{MCC\}=1denotes perfect classification andMCC=0\\text\{MCC\}=0denotes random prediction\.
- •R@FPR10−4\{\}\_\{10^\{\-4\}\}\(Recall at FPR=10−4=10^\{\-4\}\)— the recall achieved when the false\-positive rate is fixed at one in ten thousand, capturing the model’s detection ability under the stringent false\-alarm budget typical of production fraud systems\. On datasets whose test split contains fewer than10,00010\{,\}000negative samples, this operating point falls below the minimum measurable FPR\.
- •AUPRC \(Area Under the Precision–Recall Curve\)\[[11](https://arxiv.org/html/2606.08146#bib.bib16)\]— the threshold independent ranking quality, widely regarded as the most informative scalar metric for imbalanced binary classification\.

##### Data splits

For every dataset we use a strict three\-way split: the data is first partitioned into a training set \(70%\), a validation set \(10%\), and a test set \(20%\) under each random seed\. The validation set is used exclusively inside SAGE’s optimization loop: the metric vector𝐦t\\mathbf\{m\}\_\{t\}that drives the rewardR​\(𝐦t\)R\(\\mathbf\{m\}\_\{t\}\)and the best\-iteration selectiont∗=arg⁡maxt⁡R​\(𝐦t\)t^\{\*\}=\\arg\\max\_\{t\}R\(\\mathbf\{m\}\_\{t\}\)are computed on the validation set, never on the test set\. The actionthreshold\_tunelikewise sweeps the decision thresholdδ\\deltaon the validation set only\. All reported metrics in Section[5\.2](https://arxiv.org/html/2606.08146#S5.SS2)\(F1, MCC, R@FPR10−4\{\}\_\{10^\{\-4\}\}, AUPRC\) are computed on the held\-out test set using the modelM∗M^\{\*\}and thresholdδ∗\\delta^\{\*\}selected on the validation set, ensuring that no test data participates in any model selection or hyperparameter tuning\. The same three\-way split is applied identically to all baselines \(RS, Manual, FLAML, AutoGluon, LLM\-as\-Coder\); each baseline tunes on the validation set and reports on the test set under the matching seed\.

##### Seed protocol

Each dataset is evaluated under five random seeds \(42, 123, 456, 789, 2024\)\. The seed simultaneously controls four sources of randomness: \(i\) the train/val/test partition, \(ii\) model initialization, \(iii\) sampling\-based imbalance handlers \(SMOTE, class weights\), and \(iv\) LLM token sampling via the API\-levelseedparameter\. All reported results are the mean and standard deviation over these five seeds\. To prevent target leakage, all data\-dependent preprocessing statistics, including frequency encodings, target encodings, mean imputation values, and feature standardization parameters, are fit on the training split alone and applied as a frozen transformation to the validation and test splits\.

#### 5\.1\.4Implementation Details

##### LLM backbones

We instantiate SAGE with five different LLM backbones to demonstrate that the framework is not tied to any particular vendor: Claude Opus 4\.7\[[3](https://arxiv.org/html/2606.08146#bib.bib9)\], GPT\-5\.4\[[30](https://arxiv.org/html/2606.08146#bib.bib10)\], DeepSeek\-V3\.2\[[12](https://arxiv.org/html/2606.08146#bib.bib11)\], Qwen3\-Max\[[1](https://arxiv.org/html/2606.08146#bib.bib12)\], and LLaMA\-3\.3\-70B\[[29](https://arxiv.org/html/2606.08146#bib.bib13)\]\. The first four are accessed via their respective vendor APIs, while LLaMA\-3\.3\-70B is deployed locally on our cluster\. The local deployment is deliberate: it verifies that SAGE can run entirely within a private network for confidential settings where data forbids the use of public LLM endpoints\.

##### Temperature and randomness

All LLM calls use temperature0\.10\.1\. The low temperature, combined with the API\-levelseedparameter described in Section[5\.1\.3](https://arxiv.org/html/2606.08146#S5.SS1.SSS3), ensures that each LLM call is reproducible under a fixed seed to the greatest extent possible\. The unique residual variance across the five seeds comes from truly distinct train/val/test partitions and the resulting model variations\.

##### Convergence parameters

We useK=20K=20\(maximum iterations of𝒜3\\mathcal\{A\}\_\{3\}\),ε=0\.001\\varepsilon=0\.001\(reward\-improvement tolerance\),k=4k=4\(patience window before triggering early stopping\), andτ0=0\.85\\tau\_\{0\}=0\.85\(the default recall\-constraint threshold, which𝒜1\\mathcal\{A\}\_\{1\}adapts per dataset; e\.g\.,τ=0\.75\\tau=0\.75on IEEE\-CIS\)\. The reward weights\(w1,w2,w3,w4\)=\(0\.40,0\.30,0\.20,0\.10\)\(w\_\{1\},w\_\{2\},w\_\{3\},w\_\{4\}\)=\(0\.40,0\.30,0\.20,0\.10\)are fixed across all experiments\.

##### Sandbox environment and hardware

Generated code is executed in an isolated sandbox running Python 3\.13\.3 with XGBoost 3\.1\.3, LightGBM 4\.6\.0, scikit\-learn 1\.7\.2, pandas 2\.3\.3, NumPy 2\.1\.3 and other necessary dependencies\. All experiments are run on a server with dual AMD EPYC 7543 CPUs \(128 cores\), 1 TB RAM, and 8×\\timesNVIDIA A40 GPUs \(46 GB each\) running Ubuntu 24\.04\.3 LTS\. The LLaMA\-3\.3\-70B backbone is served locally via vLLM on a subset of these GPUs\.

Building on this setup, our evaluation is organized around four research questions \(RQs\):

- •RQ1:How does SAGE perform on different fraud detection datasets in terms of overall detection performance?
- •RQ2:How sensitive is SAGE’s performance to the choice of the underlying LLM and datasets?
- •RQ3:How interpretable is SAGE’s NLG\-guided optimization and reward process?
- •RQ4:How much does each of the three agents contribute to SAGE’s performance?

### 5\.2Main Results

To answer RQ1, we compare SAGE \(with Claude Opus 4\.7 as the backbone\) against five baselines across five datasets and four main metrics\. The full results are reported in Table[4](https://arxiv.org/html/2606.08146#S5.T4), and we further validate each metric with the Friedman test, complemented by a Friedman–Nemenyi critical\-difference analysis \(Figure[2](https://arxiv.org/html/2606.08146#S5.F2)\)\.

Table 4:SAGE versus five baselines on five datasets and four metrics \(mean±\\pmstd over five seeds\)\. For each \(dataset, metric\) thebestvalue is in bold and thesecond\-bestis underlined\. For Elliptic and TeleGuard, the test split contains only∼\\sim8\.4K and∼\\sim6\.2K negative samples respectively, so the minimum measurable FPR exceeds10−410^\{\-4\}; on these two datasets the reported R@FPR10−4\{\}\_\{10^\{\-4\}\}is therefore equivalent to recall at the zero\-false\-positive operating point\.##### Overall performance

Across the2525method–dataset comparisons \(five baselines×\\timesfive datasets\), SAGE wins2424\(a96\.00%96\.00\\%win rate\), where a method–dataset pair is counted as a SAGE win if SAGE outperforms the baseline on the majority of the four metrics\. SAGE ranks first on1515of the2020individual \(dataset, metric\) cells\. Its gains are largest on the hardest datasets: on PaySim, whose extreme imbalance \(IR=773\.7=773\.7\) drives most baselines to collapse \(RS reaches an F1 of only0\.15930\.1593\), SAGE attains0\.99730\.9973, a135\.43%135\.43\\%average improvement; on IEEE\-CIS, with433433heavily missing features, it improves the baseline average by55\.54%55\.54\\%\. On the easier Elliptic and TeleGuard datasets it still secures top or near\-top results, showing that its advantage does not sacrifice easy regimes\.

##### Honest analysis

The few cases where a baseline edges ahead are isolated single\-metric wins on otherwise SAGE\-dominated datasets, not signs of instability\. AutoGluon leads on Credit Card R@FPR10−4\{\}\_\{10^\{\-4\}\}\(0\.80410\.8041vs\.0\.79180\.7918\) and on Elliptic/TeleGuard AUPRC, and Manual narrowly leads on IEEE\-CIS R@FPR10−4\{\}\_\{10^\{\-4\}\}and AUPRC; in every case SAGE wins the remaining metrics on the same dataset\. Crucially, SAGE never collapses: FLAML’s F1 std reaches0\.34250\.3425on Credit Card and LLM\-as\-Coder’s R@FPR10−4\{\}\_\{10^\{\-4\}\}drops to0\.16770\.1677on TeleGuard, whereas SAGE’s std stays consistently small\.

##### Critical\-difference analysis

We validate significance with the Friedman\[[13](https://arxiv.org/html/2606.08146#bib.bib14)\]test applied per metric \(5 datasets, 6 methods\), which rejects equal performance for all four metrics \(p=0\.0012p=0\.0012,0\.00150\.0015,0\.00320\.0032,0\.00500\.0050\)\. Pooling all2020blocks and adding a Nemenyi post\-hoc test yields the CD diagram in Figure[2](https://arxiv.org/html/2606.08146#S5.F2): SAGE attains the best average rank \(1\.301\.30\), ahead of Manual \(2\.252\.25\), AutoGluon \(3\.053\.05\), FLAML \(4\.004\.00\), RS \(5\.055\.05\), and LLM as Coder \(5\.355\.35\), with the pooled test highly significant \(χ2=72\.46\\chi^\{2\}=72\.46,p<10−13p<10^\{\-13\}\)\. AtCD=1\.69\\mathrm\{CD\}=1\.69, SAGE is significantly better than all baselines except Manual, whose gap \(0\.950\.95\) falls below the CD\.

![Refer to caption](https://arxiv.org/html/2606.08146v1/x2.png)Figure 2:Critical\-difference diagram \(Friedman \+ Nemenyi,α=0\.05\\alpha=0\.05\) over all2020\(dataset, metric\) blocks\.Takeaway1SAGE wins96\.00%96\.00\\%of the2525method–dataset comparisons and attains the best average rank \(1\.301\.30\), with statistically significant superiority over all automated baselines\. It reaches, and on most metrics surpasses, expert human data scientists without any manual intervention\.

### 5\.3Robustness and Sensitivity

To answer RQ2, we examine whether the effectiveness of SAGE depends on the specific LLM driving its agents\. We re\-run the entire pipeline on all five datasets using five backbone: Claude Opus 4\.7, GPT\-5\.4, DeepSeek\-V3\.2, Qwen3\-Max, and LLaMA\-3\.3\-70B, and analyze both the win rate against the baselines and the performance spread across backbones\.

##### Cross\-backbone Robustness

Figure[3](https://arxiv.org/html/2606.08146#S5.F3)reports, for each backbone, the win rate of SAGE over the2525method–dataset comparisons \(five baselines×\\timesfive datasets\)\. All five backbones place SAGE clearly ahead of the baseline field: the win rate ranges from94\.00%94\.00\\%\(Claude Opus 4\.7\) down to85\.00%85\.00\\%\(Qwen3\-Max\), a spread of only9\.009\.00percentage points, with an average of88\.80%88\.80\\%\. Even the weakest backbone wins more than four out of five comparisons, and the locally deployed LLaMA\-3\.3\-70B reaches90\.00%90\.00\\%, confirming that SAGE delivers strong results even under the on\-premise, privacy\-preserving deployment required for confidential data Figure[3](https://arxiv.org/html/2606.08146#S5.F3)uses a finer granularity than the96\.00%96\.00\\%figure in Section[5\.2](https://arxiv.org/html/2606.08146#S5.SS2): the latter counts a win at the method–dataset pair level, whereas Figure[3](https://arxiv.org/html/2606.08146#S5.F3)counts wins independently on each metric, so Claude Opus 4\.7’s per\-metric average \(94\.00%94\.00\\%\) is consistent with, and slightly below, the aggregate96\.00%96\.00\\%\. A complementary perspective is the*spread*: a smaller spread on the main metrics means the framework is less sensitive to the choice of backbone\. As summarized in Table[5](https://arxiv.org/html/2606.08146#S5.T5), SAGE’s average AUPRC spread is only1\.13%1\.13\\%and its average F1 spread is only3\.60%3\.60\\%\.

![Refer to caption](https://arxiv.org/html/2606.08146v1/x3.png)Figure 3:Per\-metric win rate of SAGE over the2525method–dataset comparisons \(five baselines×\\timesfive datasets\) under five LLM backbones, with the four\-metric average shown on the right\.Table 5:Cross\-backbone performance spread per dataset, reported as the relative range across the five LLM backbones\. The last row gives the average over datasets\.
##### Sensitivity across data splits

Besides its robustness to the backbone, SAGE is also the least sensitive to the five random data splits\. Returning to the main results \(Table[4](https://arxiv.org/html/2606.08146#S5.T4)\), SAGE achieves the smallest standard deviation on the vast majority of \(dataset, metric\) cells, whereas the baseline methods are highly sensitive in the worst cases: FLAML’s F1 standard deviation on Credit Card reaches0\.34250\.3425, as a single failed split causes its mean to plummet, while the LLM\-as\-Coder baseline fluctuates by0\.31620\.3162in F1 on PaySim and drops to0\.16770\.1677in R@FPR10−4\{\}\_\{10^\{\-4\}\}on TeleGuard\. In contrast, SAGE’s standard deviation remains consistently small: for example,0\.00040\.0004in F1 on PaySim and only0\.01660\.0166in F1 on Credit Card\. Figure[4](https://arxiv.org/html/2606.08146#S5.F4)visualizes this contrast as a per\-seed standard\-deviation heatmap on the F1 metric, where SAGE is consistently among the most stable methods, attaining the lowest average F1 standard deviation on the five datasets\. Crucially, unlike FLAML and LLM\-as\-Coder, SAGE never exhibits a catastrophic worst case\.

![Refer to caption](https://arxiv.org/html/2606.08146v1/x4.png)Figure 4:Per\-seed standard deviation across the five datasets \(lighter = more stable\)\. SAGE attains the lowest or near\-lowest F1 standard deviation on every dataset\.Takeaway2SAGE’s performance is largely insensitive to the underlying LLM: all five backbones keep its win rate above85%85\\%\(average88\.80%88\.80\\%\), with an average spread of only1\.13%1\.13\\%in AUPRC and3\.60%3\.60\\%in F1 across datasets\. It is also the least sensitive method to the random data splits, showing that the framework, rather than any specific model or partition, drives its effectiveness\.

### 5\.4Case Study

To answer RQ3, we examine the interpretability of SAGE’s NLG\-guided optimization and reward process\. We trace a single run on IEEE\-CIS, using Claude Opus 4\.7 as the backbone, and examine the critiques generated by𝒜3\\mathcal\{A\}\_\{3\}and their reward trajectory\. Figure[5](https://arxiv.org/html/2606.08146#S5.F5)shows the reward and related metrics as a function of the number of optimization iterations\.

##### Interpretable Diagnosis and Action

At each iteration, the critic LLM emits a structured natural\-language gradient organized asObservation,Diagnosis, andAction, which is both human\-readable and directly auditable\. The box below reproduces the complete critique for the first iteration\. This diagnosis is not generic but is generated by the LLM from the characteristics of each iteration: it correctly attributes the initial model’s precision collapse \(0\.34470\.3447\) to the interaction between an aggressivescale\_pos\_weight=27\.58=27\.58and the default decision threshold of0\.500\.50, and further points out that the high\-signal IEEE\-fraud columns \(TransactionAmt, the C/D groups,card1,addr1\) are left unexploited\. The corresponding actions are concrete and traceable: raising the threshold, reducing the positive weight, deepening the model, and adding domain\-specific engineered features, each mapped to a specific optimization handle in the generated code, turning the LLM reasoning into a verifiable, reproducible edit history\.

##### Reward\-Based Multi\-Objective Trade\-offs

Figure[5](https://arxiv.org/html/2606.08146#S5.F5)shows that the optimization is not simply about increasing the F1 score, but jointly weighs detection quality and operational constraints at every iteration\. Iterationt=1t=1increases the F1 score from0\.48660\.4866to0\.69070\.6907while maintaining recall above the thresholdτ\\tau, and restores precision from0\.34470\.3447to0\.63190\.6319, thus achieving a peak reward of0\.69540\.6954\. Att=2t=2, the agent further advances feature engineering, obtaining a higher F1 score \(0\.76580\.7658\) and AUPRC \(0\.81770\.8177\), but its recall drops to0\.71760\.7176, belowτ\\tau, triggering the recall\-constraint gateWrecallW\_\{\\mathrm\{recall\}\}in the composite reward and causing the reward to drop to0\.50010\.5001\. Because SAGE selects the iteration that maximizes the reward, i\.e\.,t∗=arg⁡maxt⁡R​\(𝐦t\)t^\{\*\}=\\arg\\max\_\{t\}R\(\\mathbf\{m\}\_\{t\}\), it correctly preserves the recall\-compliantt=1t=1model rather than thet=2t=2model that superficially has a higher F1 score\. Therefore, within the SAGE framework, a higher F1 score obtained by sacrificing the recall constraint is rejected, indicating that the reward performs a true multi\-objective trade\-off rather than pursuing a single metric, which is the exact behavior production fraud systems require when scoring candidate models against business\-defined risk budgets\.

NLG critiqueD1D\_\{1\}at iterationt=1t=1on IEEE\-CISObservationAUPRC=0\.7373 \(<<tau=0\.75\), F1=0\.4866 low due to Precision=0\.3447, Recall=0\.8275 strong\.DiagnosisHigh scale\_pos\_weight=27\.58 combined with threshold=0\.50 inflates positive predictions causing precision collapse; also missing engineered features on rich IEEE\-fraud columns \(TransactionAmt, C1/C13/C14, D1/D2/D15, card1/addr1/emaildomain\) which are known high\-signal\.Action\(1\) Raise threshold 0\.50→\\rightarrow0\.55 to restore precision\. \(2\) Reduce scale\_pos\_weight 27\.58→\\rightarrow15\.0 to reduce false positives while keeping recall\. \(3\) Strengthen model: n\_estimators 500→\\rightarrow800, max\_depth 7→\\rightarrow8, learning\_rate 0\.05→\\rightarrow0\.03\. \(4\) Add features: log1p\(TransactionAmt\), C1/C13 and C14/C1 ratios, D1−\-D2, D15/D1, frequency encoding of card1/addr1/P\_emaildomain, card1×\\timescard2\.

![Refer to caption](https://arxiv.org/html/2606.08146v1/x5.png)Figure 5:Reward and underlying metrics over optimization iterations on IEEE\-CIS\.Takeaway3SAGE’s optimization is fully interpretable: each iteration emits a human\-readable Observation–Diagnosis–Action critique that pinpoints concrete failure causes and maps them to specific code edits\. Its reward process is equally transparent, performing a genuine multi\-objective trade\-off that rejects a higher\-F1 model violating the recall constraint in favor of the constraint\-satisfying one\.

### 5\.5Ablation Study

To answer RQ4, we conduct an ablation study on the IEEE\-CIS benchmark to quantify the contribution of the three agents, using Claude Opus 4\.7 as the backbone\. Starting from the complete SAGE pipeline, we disable one agent at a time and measure the resulting performance change, reported as the mean±\\pmstd of F1 over five seeds\. Removing𝒜1\\mathcal\{A\}\_\{1\}withholds the DDT and the other data\-aware structures of the Profiling Agent, so the downstream agents run without data\-aware guidance\. Removing𝒜2\\mathcal\{A\}\_\{2\}replaces its DDT\-driven algorithm selection with a fixed Random Forest model as the initial code template, disabling adaptive initial code selection and generation\. Removing𝒜3\\mathcal\{A\}\_\{3\}skips the NLG\-guided optimization loop, so the evaluation reflects only the initial code produced by𝒜2\\mathcal\{A\}\_\{2\}\. Table[6](https://arxiv.org/html/2606.08146#S5.T6)reports the results\.

##### Each agent contributes

Removing any one agent lowers the F1 score, confirming that all three agents are necessary\.𝒜3\\mathcal\{A\}\_\{3\}is the most critical: disabling it causes a36\.19%36\.19\\%drop in F1, demonstrating that data analysis and algorithm selection alone cannot produce a high\-quality detector\.𝒜2\\mathcal\{A\}\_\{2\}follows closely: replacing its LLM\-based algorithm selection with a fixed Random Forest code template causes a25\.74%25\.74\\%drop in F1\. Even a fully executed optimization loop by𝒜3\\mathcal\{A\}\_\{3\}does not fully recover from a poor initial algorithm choice on IEEE\-CIS, suggesting that the model\-selection reasoning of𝒜2\\mathcal\{A\}\_\{2\}is a key contributor\.𝒜1\\mathcal\{A\}\_\{1\}provides essential guidance: removing its DDT and semantic interpretation causes a13\.72%13\.72\\%drop in F1, leaving the downstream agents with only generic strategies and lacking a holistic understanding of the dataset\.

Table 6:Ablation on IEEE\-CIS\. A✓indicates the agent is enabled and a✗indicates it is removed\.Δ\\DeltaF1 is the relative F1 drop with respect to the full SAGE pipeline\.Takeaway4The three agents form a progressive value chain:𝒜1\\mathcal\{A\}\_\{1\}supplies data insight,𝒜2\\mathcal\{A\}\_\{2\}makes the optimal algorithmic decision, and𝒜3\\mathcal\{A\}\_\{3\}pushes performance to its ceiling through iterative optimization\. Their contributions are complementary rather than redundant, and removing any one degrades performance\.

## 6Discussion

Our discussion reveals some limitations of SAGE, which also point to specific directions for future research\. First, as observed in the honest analysis \(Section[5\.2](https://arxiv.org/html/2606.08146#S5.SS2)\), SAGE currently generates a single classification model rather than a multi\-model ensemble model\. On low\-dimensional and well\-behaved datasets such as Credit Card, AutoGluon’s re\-stacking method achieves a slightly higher R@FPR value; on Elliptic and TeleGuard, its ensemble model also maintains a slight advantage in the AUPRC metric\. Therefore, one optimization direction is to allow the agent to coordinate the stacking layers in the code space, attempting to combine multiple model variants under a guided reward mechanism, so that SAGE can inherit the advantages of a single method without sacrificing interpretability\. Recent research on hybrid architectures\[[7](https://arxiv.org/html/2606.08146#bib.bib46)\]has demonstrated that a Transformer\-LSTM\-KELM scheme utilizing the synergistic complementarity of three heterogeneous modules outperforms nine mainstream baseline classifiers, including XGBoost and CatBoost\. Second, SAGE’s current evaluation is limited to individual\-level tabular fraud and does not leverage the relational signals available in some real settings\. Integrating graph\-derived features as an additional view in the DDT is therefore a natural next step toward a more relationally aware agentic anti\-fraud framework\.

## 7Conclusion

This paper proposes a novel multi\-agent framework called SAGE for detecting fraudulent behavior in tabular data at the individual level\. Its main contribution lies in addressing the under\-explored intersection between LLM\-driven agent reasoning and the practical constraints of production\-grade anti\-fraud systems\. Traditional AutoML, graph\-based methods, and general LLM agents all suffer from shortcomings in four aspects: agent\-centricity, tabular data processing capabilities, interpretability, and fraud detection specificity\. To our knowledge, SAGE is the first end\-to\-end LLM\-driven agent framework specifically built for tabular fraud detection\. It coordinates data understanding, modeling, and reflective optimization through three dedicated agents\. Our research demonstrates that building the process on structured data diagnostic priors and setting the optimization process as a reward\-driven search guided by natural language comments can transform the previously open agent loop into a process that meets the constraints of anti\-fraud business requirements\. We assess SAGE on multiple authoritative fraud datasets and across multiple LLM backbones in order to distinguish the contributions of the framework itself from those of the underlying language models\. Empirical results confirm that SAGE outperforms strong AutoML, human\-expert, and LLM\-as\-coder baselines\.

## References

- \[1\]Alibaba Qwen Team\(2025\)Qwen3\-Max\.Note:[https://qwen\.ai/research](https://qwen.ai/research)Cited by:[§5\.1\.4](https://arxiv.org/html/2606.08146#S5.SS1.SSS4.Px1.p1.1)\.
- \[2\]Anthropic\(2025\)Claude code\.Note:[https://www\.anthropic\.com/claude\-code](https://www.anthropic.com/claude-code)Cited by:[§2\.3](https://arxiv.org/html/2606.08146#S2.SS3.p1.1),[Table 1](https://arxiv.org/html/2606.08146#S2.T1.4.9.8.1)\.
- \[3\]Anthropic\(2026\)Claude opus 4\.7 system card\.Note:[https://www\.anthropic\.com/claude/opus](https://www.anthropic.com/claude/opus)Cited by:[5th item](https://arxiv.org/html/2606.08146#S5.I1.i5.p1.1),[§5\.1\.4](https://arxiv.org/html/2606.08146#S5.SS1.SSS4.Px1.p1.1)\.
- \[4\]S\. Bhattacharyya, S\. Jha, K\. Tharakunnel, and J\. C\. Westland\(2011\)Data mining for credit card fraud: a comparative study\.Decision support systems,pp\. 602–613\.Cited by:[§2\.2](https://arxiv.org/html/2606.08146#S2.SS2.p1.1)\.
- \[5\]N\. V\. Chawla, K\. W\. Bowyer, L\. O\. Hall, and W\. P\. Kegelmeyer\(2002\)SMOTE: synthetic minority over\-sampling technique\.Journal of artificial intelligence research,pp\. 321–357\.Cited by:[§3\.2](https://arxiv.org/html/2606.08146#S3.SS2.p1.4)\.
- \[6\]T\. Chen and C\. Guestrin\(2016\)Xgboost: a scalable tree boosting system\.InKDD,pp\. 785–794\.Cited by:[§2\.2](https://arxiv.org/html/2606.08146#S2.SS2.p1.1)\.
- \[7\]Y\. Chen, L\. Wang, and X\. Xie\(2025\)TLK: an industrial internet of things attack detection method based on transformer\-lstm\-kelm\.InCCNS,pp\. 61–65\.Cited by:[§6](https://arxiv.org/html/2606.08146#S6.p1.1)\.
- \[8\]D\. Cheng, Y\. Zou, S\. Xiang, and C\. Jiang\(2025\)Graph neural networks for financial fraud detection: a review\.Frontiers of Computer Science,pp\. 199609\.Cited by:[§2\.2](https://arxiv.org/html/2606.08146#S2.SS2.p1.1)\.
- \[9\]A\. Dal Pozzolo, G\. Boracchi, O\. Caelen, C\. Alippi, and G\. Bontempi\(2017\)Credit card fraud detection: a realistic modeling and a novel learning strategy\.IEEE transactions on neural networks and learning systems,pp\. 3784–3797\.Cited by:[§2\.1](https://arxiv.org/html/2606.08146#S2.SS1.p1.1)\.
- \[10\]A\. Dal Pozzolo, O\. Caelen, Y\. Le Borgne, S\. Waterschoot, and G\. Bontempi\(2014\)Learned lessons in credit card fraud detection from a practitioner perspective\.Expert systems with applications,pp\. 4915–4928\.Cited by:[§2\.1](https://arxiv.org/html/2606.08146#S2.SS1.p1.1),[§3\.1](https://arxiv.org/html/2606.08146#S3.SS1.p1.14)\.
- \[11\]J\. Davis and M\. Goadrich\(2006\)The relationship between precision\-recall and roc curves\.InICML,pp\. 233–240\.Cited by:[§3\.1](https://arxiv.org/html/2606.08146#S3.SS1.p1.14),[4th item](https://arxiv.org/html/2606.08146#S5.I2.i4.p1.1)\.
- \[12\]DeepSeek\-AI\(2025\)DeepSeek\-V3\.2\.Note:[https://www\.deepseek\.com](https://www.deepseek.com/)Cited by:[§5\.1\.4](https://arxiv.org/html/2606.08146#S5.SS1.SSS4.Px1.p1.1)\.
- \[13\]J\. Demšar\(2006\)Statistical comparisons of classifiers over multiple data sets\.Journal of Machine learning research,pp\. 1–30\.Cited by:[§5\.2](https://arxiv.org/html/2606.08146#S5.SS2.SSS0.Px3.p1.15)\.
- \[14\]J\. Ding and H\. Zhou\(2026\)Telecom fraud detection based on large language models: a multi\-role, multi\-layer prompting strategy\.Applied Sciences,pp\. 544\.Cited by:[§2\.1](https://arxiv.org/html/2606.08146#S2.SS1.p1.1)\.
- \[15\]Y\. Dou, Z\. Liu, L\. Sun, Y\. Deng, H\. Peng, and P\. S\. Yu\(2020\)Enhancing graph neural network\-based fraud detectors against camouflaged fraudsters\.InCIKM,pp\. 315–324\.Cited by:[§2\.2](https://arxiv.org/html/2606.08146#S2.SS2.p1.1),[Table 1](https://arxiv.org/html/2606.08146#S2.T1.4.5.4.1)\.
- \[16\]N\. Erickson, J\. Mueller, A\. Shirkov, H\. Zhang, P\. Larroy, M\. Li, and A\. J\. Smola\(2020\)AutoGluon\-tabular: robust and accurate automl for structured data\.arXiv preprint arXiv:2003\.06505\.Cited by:[§2\.2](https://arxiv.org/html/2606.08146#S2.SS2.p1.1),[Table 1](https://arxiv.org/html/2606.08146#S2.T1.4.4.3.1),[4th item](https://arxiv.org/html/2606.08146#S5.I1.i4.p1.1)\.
- \[17\]M\. Feurer, A\. Klein, K\. Eggensperger, J\. Springenberg, M\. Blum, and F\. Hutter\(2015\)Efficient and robust automated machine learning\.InNeurIPS,Cited by:[§2\.2](https://arxiv.org/html/2606.08146#S2.SS2.p1.1),[Table 1](https://arxiv.org/html/2606.08146#S2.T1.4.2.1.1)\.
- \[18\]Global Anti\-Scam Alliance and Feedzai\(2025\)The global state of scams 2025 report\.Note:[https://www\.gasa\.org/](https://www.gasa.org/)Cited by:[§2\.1](https://arxiv.org/html/2606.08146#S2.SS1.p1.1)\.
- \[19\]L\. Grinsztajn, E\. Oyallon, and G\. Varoquaux\(2022\)Why do tree\-based models still outperform deep learning on typical tabular data?\.InNeurIPS,pp\. 507–520\.Cited by:[§2\.2](https://arxiv.org/html/2606.08146#S2.SS2.p1.1)\.
- \[20\]P\. Grover, J\. Xu, J\. Tittelfitz, A\. Cheng, Z\. Li, J\. Zablocki, J\. Liu, and H\. Zhou\(2022\)Fraud dataset benchmark and applications\.arXiv preprint arXiv:2208\.14417\.Cited by:[§2\.2](https://arxiv.org/html/2606.08146#S2.SS2.p1.1)\.
- \[21\]H\. He and E\. A\. Garcia\(2009\)Learning from imbalanced data\.IEEE Transactions on knowledge and data engineering,pp\. 1263–1284\.Cited by:[§2\.1](https://arxiv.org/html/2606.08146#S2.SS1.p1.1)\.
- \[22\]J\. He, H\. Zhang, Y\. Xiao, W\. Guo, S\. Yao, and R\. Liu\(2026\)FACTGUARD: event\-centric and commonsense\-guided fake news detection\.InAAAI,pp\. 363–371\.Cited by:[§2\.1](https://arxiv.org/html/2606.08146#S2.SS1.p1.1)\.
- \[23\]S\. Hong, Y\. Lin, B\. Liu, B\. Liu, B\. Wu, C\. Zhang, D\. Li, J\. Chen, J\. Zhang, J\. Wang,et al\.\(2025\)Data interpreter: an llm agent for data science\.InACL,pp\. 19796–19821\.Cited by:[§2\.3](https://arxiv.org/html/2606.08146#S2.SS3.p1.1),[Table 1](https://arxiv.org/html/2606.08146#S2.T1.4.8.7.1)\.
- \[24\]J\. Jurgovsky, M\. Granitzer, K\. Ziegler, S\. Calabretto, P\. Portier, L\. He\-Guelton, and O\. Caelen\(2018\)Sequence classification for credit\-card fraud detection\.Expert systems with applications,pp\. 234–245\.Cited by:[§2\.2](https://arxiv.org/html/2606.08146#S2.SS2.p1.1)\.
- \[25\]Y\. Liu, X\. Ao, Z\. Qin, J\. Chi, J\. Feng, H\. Yang, and Q\. He\(2021\)Pick and choose: A gnn\-based imbalanced learning approach for fraud detection\.InWWW,pp\. 3168–3177\.Cited by:[§2\.2](https://arxiv.org/html/2606.08146#S2.SS2.p1.1),[Table 1](https://arxiv.org/html/2606.08146#S2.T1.4.6.5.1)\.
- \[26\]E\. Lopez\-Rojas, A\. Elmir, and S\. Axelsson\(2016\)PaySim: a financial mobile money simulator for fraud detection\.InEMSS,pp\. 249–255\.Cited by:[§2\.1](https://arxiv.org/html/2606.08146#S2.SS1.p1.1),[§5\.1\.1](https://arxiv.org/html/2606.08146#S5.SS1.SSS1.p2.1)\.
- \[27\]Y\. Lucas and J\. Jurgovsky\(2020\)Credit card fraud detection using machine learning: a survey\.arXiv preprint arXiv:2010\.06479\.Cited by:[§2\.1](https://arxiv.org/html/2606.08146#S2.SS1.p1.1),[§3\.2](https://arxiv.org/html/2606.08146#S3.SS2.p1.4)\.
- \[28\]B\. W\. Matthews\(1975\)Comparison of the predicted and observed secondary structure of t4 phage lysozyme\.Biochimica et Biophysica Acta \(BBA\)\-Protein Structure,pp\. 442–451\.Cited by:[2nd item](https://arxiv.org/html/2606.08146#S5.I2.i2.p1.3)\.
- \[29\]Meta AI\(2024\)LLaMA\-3\.3\-70B\.Note:[https://www\.llama\.com](https://www.llama.com/)Cited by:[§5\.1\.4](https://arxiv.org/html/2606.08146#S5.SS1.SSS4.Px1.p1.1)\.
- \[30\]OpenAI\(2026\)GPT\-5\.4\.Note:[https://openai\.com](https://openai.com/)Cited by:[§5\.1\.4](https://arxiv.org/html/2606.08146#S5.SS1.SSS4.Px1.p1.1)\.
- \[31\]A\. D\. Pozzolo, O\. Caelen, R\. A\. Johnson, and G\. Bontempi\(2015\)Calibrating probability with undersampling for unbalanced classification\.InSSCI,pp\. 159–166\.Cited by:[§2\.1](https://arxiv.org/html/2606.08146#S2.SS1.p1.1),[§5\.1\.1](https://arxiv.org/html/2606.08146#S5.SS1.SSS1.p2.1)\.
- \[32\]R\. Pryzant, D\. Iter, J\. Li, Y\. T\. Lee, C\. Zhu, and M\. Zeng\(2023\)Automatic prompt optimization with ”gradient descent” and beam search\.InEMNLP,pp\. 7957–7968\.Cited by:[§2\.3](https://arxiv.org/html/2606.08146#S2.SS3.p1.1),[§3\.3\.4](https://arxiv.org/html/2606.08146#S3.SS3.SSS4.p2.6),[§4](https://arxiv.org/html/2606.08146#S4.p1.5)\.
- \[33\]W\. N\. Robinson and A\. Aria\(2018\)Sequential fraud detection for prepaid cards using hidden markov model divergence\.Expert Systems with Applications,pp\. 235–251\.Cited by:[§2\.2](https://arxiv.org/html/2606.08146#S2.SS2.p1.1)\.
- \[34\]M\. Schmitt and I\. Flechais\(2024\)Digital deception: generative artificial intelligence in social engineering and phishing\.Artificial Intelligence Review,pp\. 324\.Cited by:[§2\.3](https://arxiv.org/html/2606.08146#S2.SS3.p1.1)\.
- \[35\]N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao\(2023\)Reflexion: language agents with verbal reinforcement learning\.InNeurIPS,Cited by:[§2\.3](https://arxiv.org/html/2606.08146#S2.SS3.p1.1)\.
- \[36\]R\. Shwartz\-Ziv and A\. Armon\(2022\)Tabular data: deep learning is not all you need\.Information fusion,pp\. 84–90\.Cited by:[§2\.2](https://arxiv.org/html/2606.08146#S2.SS2.p1.1)\.
- \[37\]R\. S\. Sutton, A\. G\. Barto,et al\.\(1998\)Reinforcement learning: an introduction\.MIT press Cambridge\.Cited by:[§3\.3\.4](https://arxiv.org/html/2606.08146#S3.SS3.SSS4.p1.9),[§4](https://arxiv.org/html/2606.08146#S4.p1.5)\.
- \[38\]C\. Wang, Q\. Wu, M\. Weimer, and E\. Zhu\(2021\)FLAML: A fast and lightweight automl library\.InMLSys,Cited by:[§2\.2](https://arxiv.org/html/2606.08146#S2.SS2.p1.1),[Table 1](https://arxiv.org/html/2606.08146#S2.T1.4.3.2.1),[3rd item](https://arxiv.org/html/2606.08146#S5.I1.i3.p1.1)\.
- \[39\]D\. Wang, J\. Lin, P\. Cui, Q\. Jia, Z\. Wang, Y\. Fang, Q\. Yu, J\. Zhou, S\. Yang, and Y\. Qi\(2019\)A semi\-supervised graph attentive network for financial fraud detection\.InICDM,pp\. 598–607\.Cited by:[§2\.2](https://arxiv.org/html/2606.08146#S2.SS2.p1.1)\.
- \[40\]M\. Weber, G\. Domeniconi, J\. Chen, D\. K\. I\. Weidele, C\. Bellei, T\. Robinson, and C\. E\. Leiserson\(2019\)Anti\-money laundering in bitcoin: experimenting with graph convolutional networks for financial forensics\.arXiv preprint arXiv:1908\.02591\.Cited by:[§2\.1](https://arxiv.org/html/2606.08146#S2.SS1.p1.1),[§5\.1\.1](https://arxiv.org/html/2606.08146#S5.SS1.SSS1.p2.1)\.
- \[41\]Q\. Wu, G\. Bansal, J\. Zhang, Y\. Wu, S\. Zhang, E\. Zhu, B\. Li, L\. Jiang, X\. Zhang, and C\. Wang\(2023\)AutoGen: enabling next\-gen LLM applications via multi\-agent conversation framework\.arXiv preprint arXiv:2308\.08155\.Cited by:[§2\.3](https://arxiv.org/html/2606.08146#S2.SS3.p1.1),[Table 1](https://arxiv.org/html/2606.08146#S2.T1.4.7.6.1)\.
- \[42\]S\. Xiang, M\. Zhu, D\. Cheng, E\. Li, R\. Zhao, Y\. Ouyang, L\. Chen, and Y\. Zheng\(2023\)Semi\-supervised credit card fraud detection via attribute\-driven graph representation\.InAAAI,pp\. 14557–14565\.Cited by:[§2\.1](https://arxiv.org/html/2606.08146#S2.SS1.p1.1)\.
- \[43\]F\. Xiao, S\. Cai, G\. Chen, H\. V\. Jagadish, B\. C\. Ooi, and M\. Zhang\(2024\)VecAug: unveiling camouflaged frauds with cohort augmentation for enhanced detection\.InKDD,pp\. 6025–6036\.Cited by:[§2\.1](https://arxiv.org/html/2606.08146#S2.SS1.p1.1)\.
- \[44\]S\. Xuan, G\. Liu, Z\. Li, L\. Zheng, S\. Wang, and C\. Jiang\(2018\)Random forest for credit card fraud detection\.InICNSC,pp\. 1–6\.Cited by:[§2\.2](https://arxiv.org/html/2606.08146#S2.SS2.p1.1)\.
- \[45\]S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. R\. Narasimhan, and Y\. Cao\(2023\)ReAct: synergizing reasoning and acting in language models\.InICLR,Cited by:[§3\.3\.4](https://arxiv.org/html/2606.08146#S3.SS3.SSS4.p2.6),[§4\.3](https://arxiv.org/html/2606.08146#S4.SS3.p1.7)\.
- \[46\]N\. Yousefi, M\. Alaghband, and I\. Garibay\(2019\)A comprehensive survey on machine learning techniques and user authentication approaches for credit card fraud detection\.arXiv preprint arXiv:1912\.02629\.Cited by:[§2\.1](https://arxiv.org/html/2606.08146#S2.SS1.p1.1),[§3\.2](https://arxiv.org/html/2606.08146#S3.SS2.p1.4)\.

Similar Articles

AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

arXiv cs.CL

This paper introduces AgentForesight, a framework for online auditing and early failure prediction in LLM-based multi-agent systems. It presents a new dataset, AFTraj-22K, and a specialized model, AgentForesight-7B, which outperforms leading proprietary models in detecting decisive errors during trajectory execution.

DECOR: Auditing LLM Deception via Information Manipulation Theory

arXiv cs.CL

Introduces DECOR, a multi-agent framework grounded in Information Manipulation Theory for fine-grained auditing of strategic deception in LLM responses, achieving state-of-the-art performance on deception detection benchmarks across 15 frontier models.