MIPIAD: Multilingual Indirect Prompt Injection Attack Defense with Qwen -- TF-IDF Hybrid and Meta-Ensemble Learning

arXiv cs.CL Papers

Summary

This paper presents MIPIAD, a multilingual defense framework against indirect prompt injection attacks using a hybrid of Qwen2.5-based classifiers and TF-IDF features with meta-ensemble learning. It demonstrates strong performance on English and Bangla benchmarks, achieving high F1 and AUROC scores while reducing cross-lingual gaps.

arXiv:2605.07269v1 Announce Type: new Abstract: Indirect prompt injection remains a persistent weakness in retrieval-augmented and tool-using LLM systems, and the problem becomes harder to characterise in multilingual settings. We present MIPIAD, a defense framework evaluated on English and Bangla that combines a sequence classifier fine-tuned from Qwen2.5-1.5B via LoRA (XLPID), TF-IDF lexical features, and validation-tuned ensembling through late fusion, stacking, and gradient boosting. The framework is evaluated on a synthetic benchmark built from BIPIA(Yi et al., 2023) templates spanning five task families -- email, table, QA, abstract, and code-comprising over 1.43 million generated samples, with train and test splits using mutually exclusive attack categories. Across the experiments, lexical signals prove strong (TF-IDF+SVM F1=0.77), and the hybrid XLPID+TF-IDF ensemble achieves the best overall F1 (0.9205) while the Boosting Ensemble achieves the best AUROC (0.9378). Ensemble methods consistently reduce the English-Bangla cross-lingual gap relative to standalone neural models. The pipeline is designed for extensibility: NLLB-200 supports over 200 languages and XLPID's multilingual backbone can be retargeted to additional languages without architectural changes; empirical validation is currently limited to English and Bangla
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/11/26, 06:58 AM

# Multilingual Indirect Prompt Injection Attack Defense with Qwen–TF-IDF Hybrid and Meta-Ensemble Learning
Source: [https://arxiv.org/html/2605.07269](https://arxiv.org/html/2605.07269)
###### Abstract

Indirect prompt injection remains a persistent weakness in retrieval\-augmented and tool\-using LLM systems, and the problem becomes harder to characterise in multilingual settings\. We presentMIPIAD, a defense framework evaluated on English and Bangla that combines a sequence classifier fine\-tuned fromQwen2\.5\-1\.5Bvia LoRA \(XLPID\), TF\-IDF lexical features, and validation\-tuned ensembling through late fusion, stacking, and gradient boosting\. The framework is evaluated on a synthetic benchmark built from BIPIA\(Yiet al\.,[2023](https://arxiv.org/html/2605.07269#bib.bib20)\)templates spanning five task families—email, table, QA, abstract, and code—comprising over 1\.43 million generated samples, with train and test splits using mutually exclusive attack categories\. Across the experiments, lexical signals prove unexpectedly strong \(TF\-IDF\+SVM F1=0\.77\), and the hybrid XLPID\+TF\-IDF ensemble achieves the best overall F1 \(0\.9205\) while the Boosting Ensemble achieves the best AUROC \(0\.9378\)\. Ensemble methods consistently reduce the English–Bangla cross\-lingual gap relative to standalone neural models\. The pipeline is designed for extensibility: NLLB\-200 supports over 200 languages and XLPID’s multilingual backbone can be retargeted to additional languages without architectural changes; empirical validation is currently limited to English and Bangla\.

## 1Introduction

Prompt injection can make an LLM ignore the user’s intent, expose sensitive information, or carry out unsafe tool actions\. The harder case is*indirect*prompt injection, where malicious instructions are hidden inside retrieved documents, emails, tables, web pages, or code rather than in the user’s query\(Perez and Ribeiro,[2022](https://arxiv.org/html/2605.07269#bib.bib12)\)\.

That setting is still underexplored once the input stream is multilingual\. In practice, systems see a mix of languages, domains, and formatting styles, so a defense that works only in one language or one surface is not enough\. This paper studies that more realistic setting and presentsMIPIAD\(Multilingual Indirect Prompt Injection Attack Defense\), a full pipeline for cross\-lingual detection and evaluation\. MIPIAD is empirically validated on English and Bangla; the underlying components—NLLB\-200 translation and a multilingual LLM backbone—are designed to extend to additional languages without architectural modification\.

We make four contributions:

- •A unified multilingual data pipeline that generates English/Bangla indirect prompt\-injection samples across five tasks, forming a massive dataset of over 1\.43M templates\.
- •XLPID, a cross\-lingual prompt injection detector utilizing efficient Low\-Rank Adaptation \(LoRA\) over robust LLM backbones\.
- •Meta\-ensemble strategies \(stacking and boosting\) integrating LLM\-based probability metrics with explicit TF\-IDF lexical priors\.
- •Strong baseline comparisons showing that hybridization outperforms isolated neural models \(F1: 0\.9205 vs\. 0\.8939 for the best standalone, a 2\.7\-point gain\)\.

## 2Related Work

Indirect Prompt Injection\.Early work established prompt injection as a major security threat\(Perez and Ribeiro,[2022](https://arxiv.org/html/2605.07269#bib.bib12)\), which quickly proved highly applicable to real\-world Retrieval\-Augmented Generation \(RAG\) and LLM\-integrated agent applications via indirect injections\(Greshakeet al\.,[2023](https://arxiv.org/html/2605.07269#bib.bib15)\)\. To standardize evaluation, researchers introduced tailored benchmarks such as LLM\-PIRATE\(Ramakrishnaet al\.,[2024](https://arxiv.org/html/2605.07269#bib.bib16)\)for comprehensive evaluations and LLMail\-Inject\(Abdelnabiet al\.,[2025](https://arxiv.org/html/2605.07269#bib.bib19)\)for realistic, adaptive threat scenarios\.

Defensive Mechanisms\.Defenses fall into three families\.*Input\-filtering*: PromptGuard\(Meta AI,[2024](https://arxiv.org/html/2605.07269#bib.bib22)\)screens inputs with a fine\-tuned BERT classifier; SpotLight\(Hineset al\.,[2024](https://arxiv.org/html/2605.07269#bib.bib23)\)injects formatting markers into trusted context to help the LLM distinguish it from retrieved content\.*Instruction\-isolation*: InstructDetector\(Wenet al\.,[2025](https://arxiv.org/html/2605.07269#bib.bib21)\)identifies instruction states to limit injection blast radius; CachePrune\(Wanget al\.,[2025](https://arxiv.org/html/2605.07269#bib.bib17)\)prunes malicious context via neuron\-level attribution\.*Output\-smoothing*: SmoothLLM\(Robeyet al\.,[2023](https://arxiv.org/html/2605.07269#bib.bib24)\)applies input perturbation with majority voting, though at high inference cost and without evaluation on indirect retrieval injection; MELON\(Zhuet al\.,[2025](https://arxiv.org/html/2605.07269#bib.bib18)\)protects agentic tool\-use trajectories\. MIPIAD belongs to the input\-filtering family, but uniquely combines lexical and neural signals, spans five task families, and operates on multilingual inputs\. None of the above defenses has been evaluated on Bangla or on large\-scale multilingual indirect injection benchmarks; Section[5\.4](https://arxiv.org/html/2605.07269#S5.SS4)provides the first cross\-lingual end\-to\-end victim evaluation for this class of defense\.

Multilingual Safety Gap\.Most existing literature still assumes an English\-heavy setting\. Cross\-lingual transfer is less understood for prompt\-injection defense, where translations and domain shifts can dilute injection signatures and increase the false negative rate\. To our knowledge, prior work has not united the*multilingual*and*indirect*axes into a generalized detection framework\.

## 3MIPIAD Benchmark

### 3\.1Threat Model and Task Definition

We study binary detection over text inputs where the labely∈\{0,1\}y\\in\\\{0,1\\\}indicates whether an indirect injection is present\. Each sample is language\-specific \(EN or BN\) and includes metadata for task, attack type, and insertion position\. The defender’s goal is to accurately detect embedded instructions before downstream victim LLMs consume them\.

English Attack Templates\(BIPIA\(Yiet al\.,[2023](https://arxiv.org/html/2605.07269#bib.bib20)\)\)NLLB\-200 Translation\(EN to BN\)Bilingual Attack TextsTask Contexts \(EN\)\(Email, Tables, QA, Abstract, Code\)NLLB\-200 Translation\(EN to BN\)Bilingual ContextsSample Composition\(Inject at Start/Mid/End\)MIPIAD Dataset\(1\.43M Raw Samples\)

Figure 1:MIPIAD data generation pipeline\. English strings and contexts are translated to Bangla using NLLB\-200, then contextually composed to create a balanced bilingual dataset\.
### 3\.2Multilingual Sample Construction

Figure[1](https://arxiv.org/html/2605.07269#S3.F1)illustrates the data engineering pipeline\. MIPIAD generation follows three main steps:

1. 1\.Translate attack templates from English to Bangla with Meta’s NLLB\-200\(NLLB Teamet al\.,[2022](https://arxiv.org/html/2605.07269#bib.bib13)\)\.
2. 2\.Translate task contexts across five families \(email, table, QA, abstract, code\) into Bangla\.
3. 3\.Compose poisoned samples by injecting attack texts at the start, middle, or end of the contexts; generate benign samples from clean contexts\.

To ensure robustness, our data generator yields an extensive matrix: 15 unique text attack categories and 10 code\-specific attack categories, each with 5 variants\. Combining these with 3 insertion positions and 2 languages \(EN/BN\) results in exactly 1,431,400 raw samples\.

Crucially, to prevent cross\-lingual data leakage, train and test splits are explicitly partitioned at the context level\. All language translations for a single context are securely bounded within identical splits\. Furthermore, to evaluate genuine generalization capabilities rather than rote memorization, the training and testing sets utilize completely mutually exclusive attack categories and variants\.

## 4Methodology

### 4\.1XLPID Architecture

The core neural detector of our framework isXLPID\(Cross\-Lingual Prompt Injection Detector\)\. Figure[2](https://arxiv.org/html/2605.07269#S4.F2)provides a structural overview of the classification scheme\.

Input Document \(EN/BN\)LLM Backbone \(e\.g\., Qwen\) Weights frozen in bf16LoRAfp32Context PoolingSequence Classification HeadInjection Logits \(ptp\_\{t\}\)

Figure 2:XLPID architecture utilizing parameter\-efficient LoRA adapters alongside a sequential classification head\.XLPID is a direct sequence\-classification wrapper over a frozen LLM backbone, using the backbone’s built\-in classification head \(context pooling layer followed by a two\-label linear projection\)\. XLPID supports multiple backbone families including Qwen2\.5\(Qwen Team,[2025](https://arxiv.org/html/2605.07269#bib.bib14)\)and DeBERTa\(Heet al\.,[2021](https://arxiv.org/html/2605.07269#bib.bib11)\); all results in this paper useQwen/Qwen2\.5\-1\.5Bas the backbone\. Base weights are kept inbfloat16to reduce VRAM while LoRA adapters \(rank=16,α\\alpha=32\) targetingq\_projandv\_projare trained infloat32\.

### 4\.2Meta\-Ensembles and Lexical Baselines

We compare XLPID against isolated contextual backbones \(XLM\-RoBERTa\(Liuet al\.,[2019](https://arxiv.org/html/2605.07269#bib.bib10)\), mBERT\(Devlinet al\.,[2019](https://arxiv.org/html/2605.07269#bib.bib9)\)\) and powerful lexical baselines \(TF\-IDF \+ LR, TF\-IDF \+ SVM\) comprising 10,000 top n\-gram features \(sizes 1–3\)\.

Furthermore, we evaluate two meta\-ensembles synthesizing transformer and lexical streams into a unified prediction:

- •Hybrid late fusion: Combining XLPID transformer probabilities \(ptp\_\{t\}\) and TF\-IDF probabilities \(plp\_\{l\}\) viap=α​pt\+\(1−α\)​plp=\\alpha p\_\{t\}\+\(1\-\\alpha\)p\_\{l\}\. The mixing weightα\\alphais selected by grid search over 21 evenly spaced values in\[0,1\]\[0,1\]\(i\.e\.α∈\{0\.00,0\.05,…,1\.00\}\\alpha\\in\\\{0\.00,0\.05,\\ldots,1\.00\\\}\), evaluated on the held\-out validation set \(10% of training data\) and maximising the composite criterion\(F1,AUROC\)\(\\text\{F1\},\\text\{AUROC\}\)lexicographically\. The bestα\\alphais then locked in before any test\-set evaluation, ensuring no test\-set leakage into fusion weight selection\.
- •Meta\-ensembles: Logistic regression stacking and gradient\-boosted trees processing the isolated base\-model probabilities\.

### 4\.3Evaluation Pipeline

#### Overview\.

The end\-to\-end evaluation spans four stages, shown in Figure[3](https://arxiv.org/html/2605.07269#S4.F3)\. Stage 0 runs the defense classifier over all samples before any victim is loaded\. Stage 1 feeds \(potentially guarded\) prompts to victim LLMs\. Stage 2 scores responses with an ensemble of judge LLMs\. Stage 3 aggregates per\-sample scores into ASR, BU, UA, and CLP\.

Stage 2 — Judge ensembleMIPIAD samples\(attack \+ benign\)Stage 0Defense Classifier\(XLPID ensemble\)flaggedmalicious?prependsecurity noticeStage 1Victim LLMℳv\\mathcal\{M\}\_\{v\}responsesrir\_\{i\}Judge𝒥1\\mathcal\{J\}\_\{1\}Judge𝒥2\\mathcal\{J\}\_\{2\}Judge𝒥K\\mathcal\{J\}\_\{K\}majority vote\(Eq\.[3](https://arxiv.org/html/2605.07269#S4.E3)\)ASR, BU, UA, CLP\(Eqs\.[4](https://arxiv.org/html/2605.07269#S4.E4)–[7](https://arxiv.org/html/2605.07269#S4.E7)\)yesnoPhase 0Phase 1Phase 2Phase 3

Figure 3:BIPIA\(Yiet al\.,[2023](https://arxiv.org/html/2605.07269#bib.bib20)\)end\-to\-end evaluation pipeline\. The defense classifier \(Stage 0\) optionally guards prompts before the victim LLM \(Stage 1\)\. Multiple judge LLMs score responses independently \(Stage 2\); their verdicts are combined by majority vote\. Stage 3 aggregates final metrics\.
#### Prompt construction\.

Each sample\(c,ℓ,τ\)\(c,\\ell,\\tau\)— contextcc, languageℓ∈\{EN,BN\}\\ell\\in\\\{\\text\{EN\},\\text\{BN\}\\\}, taskτ\\tau— is wrapped in a task\-specific system prompt withccinjected as external content:

pi=sysτ​\(ci,ℓ\)∥usrτ​\(ℓ\)p\_\{i\}=\\texttt\{sys\}\_\{\\tau\}\(c\_\{i\},\\ell\)\\;\\\|\\;\\texttt\{usr\}\_\{\\tau\}\(\\ell\)\(1\)where∥\\\|denotes concatenation\. When Stage 0 flagscic\_\{i\}as malicious, a bilingual security notice is prepended tosysτ\\texttt\{sys\}\_\{\\tau\}\.

#### ASR judging\.

Each judge𝒥k\\mathcal\{J\}\_\{k\}receives a category\-specific prompt and the victim responserir\_\{i\}, returning a ternary verdict:

vk\(i\)=parse​\(𝒥k​\(judge\_asr​\(ri,cati,ℓ\)\)\)∈\{1,0,−1\}\\begin\{split\}v\_\{k\}^\{\(i\)\}&=\\texttt\{parse\}\\\!\\left\(\\mathcal\{J\}\_\{k\}\\\!\\left\(\\texttt\{judge\\\_asr\}\(r\_\{i\},\\text\{cat\}\_\{i\},\\ell\)\\right\)\\right\)\\\\ &\\in\\\{1,0,\-1\\\}\\end\{split\}\(2\)where1=YES1=\\text\{YES\},0=NO0=\\text\{NO\},−1=UNKNOWN\-1=\\text\{UNKNOWN\}\. For utility, judge𝒥k\\mathcal\{J\}\_\{k\}comparesrir\_\{i\}to a reference answer \(reference\-based\) or directly evaluates helpfulness \(reference\-free\), yieldinguk\(i\)∈\{1,0,−1\}u\_\{k\}^\{\(i\)\}\\in\\\{1,0,\-1\\\}\.

#### Majority vote\.

LetV\(i\)=\{vk\(i\):vk\(i\)≠−1\}V^\{\(i\)\}=\\\{v\_\{k\}^\{\(i\)\}:v\_\{k\}^\{\(i\)\}\\neq\-1\\\}be the valid verdicts for sampleii:

v^\(i\)=\{1if​∑V\(i\)\>12​\|V\(i\)\|0if​V\(i\)≠∅​and​∑V\(i\)≤12​\|V\(i\)\|−1if​V\(i\)=∅\\hat\{v\}^\{\(i\)\}=\\begin\{cases\}1&\\text\{if \}\\sum V^\{\(i\)\}\>\\tfrac\{1\}\{2\}\|V^\{\(i\)\}\|\\\\ 0&\\text\{if \}V^\{\(i\)\}\\neq\\emptyset\\text\{ and \}\\sum V^\{\(i\)\}\\leq\\tfrac\{1\}\{2\}\|V^\{\(i\)\}\|\\\\ \-1&\\text\{if \}V^\{\(i\)\}=\\emptyset\\end\{cases\}\(3\)Ties and minority\-YES outcomes both resolve to0, favouring precision over recall\.

#### Victim metrics\.

Let𝒜\\mathcal\{A\}be the set of attack samples andℬ\\mathcal\{B\}the benign set; let⋅¯\\bar\{\\cdot\}denote the mean over valid \(≠−1\{\\neq\}\{\-1\}\) scores\.

ASR=v¯\(𝒜\)\\displaystyle=\\bar\{v\}^\{\(\\mathcal\{A\}\)\}\(4\)BU=u¯\(ℬ\)\\displaystyle=\\bar\{u\}^\{\(\\mathcal\{B\}\)\}\(5\)UA=w¯\(𝒜\)\\displaystyle=\\overline\{w\}^\{\(\\mathcal\{A\}\)\}\(6\)wherew\(i\)=1w^\{\(i\)\}=1iff the victim both resists the attack \(v^\(i\)=0\\hat\{v\}^\{\(i\)\}=0\) and completes the task \(u^\(i\)=1\\hat\{u\}^\{\(i\)\}=1\);w\(i\)=0w^\{\(i\)\}=0otherwise;w\(i\)=−1w^\{\(i\)\}=\-1if the utility verdict is unresolvable \(excluded from the mean\)\. For example, if a victim ignores an injected “reply to attacker@evil\.com” command and correctly summarises the email,v^\(i\)=0\\hat\{v\}^\{\(i\)\}=0andu^\(i\)=1\\hat\{u\}^\{\(i\)\}=1, yieldingw\(i\)=1w^\{\(i\)\}=1\.

#### Cross\-lingual parity\.

For each metricm∈\{ASR,BU,UA\}m\\in\\\{\\text\{ASR\},\\text\{BU\},\\text\{UA\}\\\}and taskτ\\tauwe define theCross\-Lingual Parity\(CLP\) score:

CLPm,τ=1−\|mτEN−mτBN\|\\text\{CLP\}\_\{m,\\tau\}=1\-\\bigl\|m\_\{\\tau\}^\{\\text\{EN\}\}\-m\_\{\\tau\}^\{\\text\{BN\}\}\\bigr\|\(7\)A value of11indicates perfect between\-language parity; lower values signal language\-asymmetric behaviour\.Interpretation caveat:CLP measures parity, not absolute performance\. A model that fails equally in both languages \(e\.g\.,mτEN=mτBN=0m\_\{\\tau\}^\{\\text\{EN\}\}=m\_\{\\tau\}^\{\\text\{BN\}\}=0\) scores CLP=1\.0=1\.0\. CLP should therefore be read alongside absolute per\-language scores rather than in isolation\.

The benchmark measures defense fidelity via Accuracy, F1, AUROC, and AUPRC on the detection task; downstream robustness is captured by ASR, BU, UA, and CLP \(Cross\-Lingual Parity\)\.

## 5Experiments and Results

### 5\.1Implementation Details

XLPID uses AdamW \(lr=2×10−52\\times 10^\{\-5\}, batch size 8, weight decay0\.010\.01\), sequence length 256, dropout 0\.3, with early stopping \(patience 10\)\. The TF\-IDF vectorizer uses 10,000 character n\-grams \(sizes 1–3\) without language\-specific tokenization for Bengali\.

Initial experiments revealed standard models exploited the 225:1 attack\-to\-benign class imbalance \(1\.43M total samples\)\.

#### Data handling\.

We downsample attacks to 2:1 \(benign:attack\) for training with a 10% validation split; weighted cross\-entropy stabilizes gradients\. The test set \(all benign \+ 2,000 attacks/task\) uses a 10:1 ratio for rigorous attack characterization \(false\-positive rates under natural distribution discussed in Limitations\)\.

### 5\.2Main Results

Table[1](https://arxiv.org/html/2605.07269#S5.T1)establishes test\-set classification results\.

Table 1:Classification results on the MIPIAD test set \(aggregate over English and Bangla\)\. CLP = Cross\-Lingual Parity=1−\|F​1EN−F​1BN\|=1\-\|F1\_\{\\text\{EN\}\}\-F1\_\{\\text\{BN\}\}\|, computed from per\-language F1 scores reported in Figure[8](https://arxiv.org/html/2605.07269#A1.F8); higher is better for all metrics\.Note:CLP measures between\-language parity, not absolute performance; a model failing equally in both languages would score CLP=1\.0=1\.0\. The test set contains all benign samples and up to 2,000 attack samples per task \(attack\-to\-benign ratio≈\\approx10:1\)\. † Stacking Ensemble’s low Recall \(0\.70\) is a known limitation: missed attacks are a higher risk for a security tool than false alarms; see analysis\.
### 5\.3Analysis

Two core findings emerge from Table[1](https://arxiv.org/html/2605.07269#S5.T1)\(per\-language breakdowns are in Figure[8](https://arxiv.org/html/2605.07269#A1.F8)\):

#### Lexical signals are competitive\.

TF\-IDF\+LR \(F1=0\.73\) and TF\-IDF\+SVM \(F1=0\.77\) suggest synthetic templates carry detectable lexical artifacts\.

#### Hybridization outperforms neural models alone\.

Hybrid \(XLPID\+TF\-IDF, F1=0\.9205\) and Boosting Ensemble \(AUROC=0\.9378\) both exceed XLPID solo \(F1=0\.8939\), repairing neural blind spots in low\-frequency patterns\.

#### Stacking Ensemble recall trade\-off\.

Highest CLP \(0\.9947\) but lowest Recall \(0\.7025\)—a 30% miss rate unacceptable for security\. Practitioners should prefer Hybrid or Boosting\.

#### Cross\-lingual robustness\.

XLPID shows widest EN–BN gap \(CLP=0\.9322\)\. Hybrid/Boosting reduce this to 0\.9479; multilingual encoders achieve near\-parity \(0\.9960\+\)\. Ensemble methods consistently narrow cross\-lingual gaps\.

### 5\.4End\-to\-End Victim Evaluation

We evaluate the downstream impact of the MIPIAD defense on seven victim LLMs spanning diverse architecture families and language capabilities\. For each victim we compare two conditions:*no defense*\(Stage 0 disabled\) and*with defense*\(MIPIAD Hybrid ensemble, threshold=0\.5\)\. All victim responses are scored by a majority\-vote judge ensemble \(Section[4\.3](https://arxiv.org/html/2605.07269#S4.SS3.SSS0.Px1)\)\. We report Attack Success Rate \(ASR\), Benign Utility \(BU\), Under\-Attack Utility \(UA\), and Cross\-Lingual Parity of ASR \(CLPASR\{\}\_\{\\text\{ASR\}\}\), all averaged over five tasks and both languages unless stated otherwise\.

#### Defense reduces ASR\.

Figure[4](https://arxiv.org/html/2605.07269#S5.F4)and Table[2](https://arxiv.org/html/2605.07269#S5.T2)show MIPIAD lowers ASR in both languages for all tested victims\. Largest reductions: Qwen3\.5\-9B \(−0\.30\-0\.30EN,−0\.12\-0\.12BN\) and BanglaLLaMA\-3\-8B \(−0\.16\-0\.16EN\)\.

![Refer to caption](https://arxiv.org/html/2605.07269v1/plots/paper_fig3_defense_asr.png)Figure 4:Attack Success Rate \(ASR\) per victim LLM under four conditions: no defense on English inputs \(red\), no defense on Bangla inputs \(orange\), MIPIAD defense on English inputs \(blue\), and MIPIAD defense on Bangla inputs \(green\)\.Δ\\Deltavalues above defense bars show the absolute ASR reduction relative to the corresponding no\-defense baseline\. Lower ASR is better\. Results use the majority\-vote judge ensemble, averaged over five tasks\.Table 2:End\-to\-end victim evaluation: ASR↓\\downarrow, BU↑\\uparrow, UA↑\\uparrow, and CLPASR\{\}\_\{\\text\{ASR\}\}↑\\uparrowfor each victim LLM, without defense \(ND\) and with MIPIAD defense \(D\)\. BU and UA are averaged over EN and BN and all tasks\.Δ\\DeltaASREN/BN\{\}\_\{\\text\{EN/BN\}\}= ASRND\{\}\_\{\\text\{ND\}\}−\-ASRD\{\}\_\{\\text\{D\}\}\(positive = defense benefit\)\. Bold marks the largest\|Δ​ASR\|\|\\Delta\\text\{ASR\}\|in each language column\. All scores from the majority\-vote judge ensemble\.
#### Utility preserved\.

Figure[5](https://arxiv.org/html/2605.07269#S5.F5)shows ASR reductions with minimal utility cost\. BU remains within±0\.05\\pm 0\.05for most victims; UA improves for five of seven\.

![Refer to caption](https://arxiv.org/html/2605.07269v1/plots/paper_fig4_defense_tradeoff.png)Figure 5:Defense trade\-off:Δ​ASR\\Delta\\text\{ASR\}\(ASR reduction; rightward = more effective\) vs\.Δ​BU\\Delta\\text\{BU\}\(utility change; upward = utility preserved or improved\)\. Points in the shaded upper\-right region represent ideal outcomes\. Most victims achieve meaningful ASR reductions with near\-zero utility cost\.
#### Cross\-lingual asymmetry remains\.

CLPASR\{\}\_\{\\text\{ASR\}\}stable/improves for five victims, except Qwen3\.5\-9B where EN effectiveness \(−0\.30\-0\.30\) exceeds BN \(−0\.12\-0\.12\), suggesting certain Bangla patterns evade detection—future work should address language\-balanced training\.

#### Per\-category attack breakdown\.

Table[3](https://arxiv.org/html/2605.07269#S5.T3)reports ASR averaged across all seven victims and both languages, stratified by attack category\. The highest baseline ASR belongs toEmoji Substitution\(0\.613\),Instruction\(0\.548\), andCryptocurrency Mining/Substitution Ciphers\(both 0\.524\), indicating that encoding\-obfuscation and direct\-instruction vectors are the most potent attack surfaces\. The defense achieves its largest reductions againstMalware Distribution\(Δ=0\.226\\Delta\{=\}0\.226\),Substitution Ciphers\(Δ=0\.214\\Delta\{=\}0\.214\), andInformation Dissemination\(Δ=0\.211\\Delta\{=\}0\.211\), but leavesEmoji Substitutionat 0\.458 andCryptocurrency Miningat 0\.405 — the two most stubborn residual vulnerabilities\. Among code\-task categories,Keylogging\(defense ASR 0\.357\) andExploiting System Vulnerabilities\(0\.333\) remain elevated, whileDumpster DivingandData Eavesdroppingshow near\-zero delta, suggesting the security\-notice prepend provides little deterrence against low\-level system\-enumeration commands\. Figure[6](https://arxiv.org/html/2605.07269#S5.F6)shows UA broken down by task and victim, before and after applying the defense\.

![Refer to caption](https://arxiv.org/html/2605.07269v1/plots/paper_fig5_ua_heatmap.png)Figure 6:Under\-Attack Utility \(UA\) by task and victim LLM, without defense \(left\) and with MIPIAD defense \(right\)\. Each cell is UA averaged over English and Bangla\. Higher values indicate the victim both resisted the injection and completed the task\.Table 3:Per\-category ASR before and after MIPIAD defense, averaged over all 7 victim LLMs and both EN/BN\. General categories span email, QA, abstract, and table tasks; code categories span code tasks only\.Δ=ASRND−ASRD\\Delta=\\text\{ASR\}\_\{\\text\{ND\}\}\-\\text\{ASR\}\_\{\\text\{D\}\}\(positive = defense benefit\)\. Top\-10 general categories shown; code categories listed separately\.

## 6Conclusion

We presented MIPIAD, a multilingual defense framework combining XLPID \(LoRA\-fine\-tuned Qwen2\.5\-1\.5B\) with TF\-IDF lexical features and meta\-ensemble methods, evaluated on a 1\.43M\-sample benchmark spanning English and Bangla\. Hybridization is the central takeaway: lexical signals prove unexpectedly competitive in isolation, and combining them with neural probabilities \(Hybrid F1=0\.9205=0\.9205, Boosting AUROC=0\.9378=0\.9378\) consistently outperforms either stream alone while narrowing the EN–BN cross\-lingual gap\. End\-to\-end victim evaluation confirms ASR reductions for all seven tested LLMs in both languages with near\-zero utility cost\. The main limitation is closed\-loop evaluation on BIPIA\-derived templates; future work should address adversarial rephrasing robustness, and realistic threshold calibration\.

## Limitations

#### Synthetic distribution and deployment mismatch\.

All samples are generated from BIPIA templates, so the detector is evaluated on the same distribution it was trained on\. High F1 on this benchmark may not transfer to natural injection attempts from adversaries with access to adaptive rephrasing or encoding tricks\.

The test set in this work uses an attack\-to\-benign ratio of≈\\approx10:1, which is far higher than real deployment traffic \(estimated 225:1 or higher\)\. Under the natural distribution, high Precision is critical: even a 1% false\-positive rate would flag the overwhelming majority of legitimate queries\.

#### Translation quality\.

Bangla attack samples are produced by machine\-translating English templates with NLLB\-200\(NLLB Teamet al\.,[2022](https://arxiv.org/html/2605.07269#bib.bib13)\)\. MT output may preserve injection directives less faithfully than human\-authored Bangla attacks, and machine\-translated texts can carry lexical artifacts that inflate classifier performance\.

#### Language coverage\.

The empirical validation covers English and Bangla only\. The “multilingual” framing reflects the architectural extensibility of the pipeline \(NLLB\-200 supports over 200 languages; XLPID’s backbone handles Unicode text natively\) rather than a claim of broad cross\-lingual generalization\. Extension to lower\-resource languages requires separate evaluation\.

## Ethics Statement

This work studies indirect prompt injection strictly from a defensive perspective\. All attack templates are derived from the publicly available BIPIA benchmark\(Yiet al\.,[2023](https://arxiv.org/html/2605.07269#bib.bib20)\)and are used solely to evaluate detection methods; no attack tooling is released\. The generated dataset contains no personal or sensitive user data\. We follow standard responsible\-disclosure norms and do not publish operational injection payloads beyond what is necessary to reproduce our evaluation\.

## References

- S\. Abdelnabi, A\. Fay, A\. Salem, E\. Zverev, K\. Liao, C\. Liu, C\. Kuo,et al\.\(2025\)LLMail\-inject: a dataset from a realistic adaptive prompt injection challenge\.arXiv preprint arXiv:2506\.09956\.External Links:[Link](https://arxiv.org/abs/2506.09956)Cited by:[§2](https://arxiv.org/html/2605.07269#S2.p1.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.arXiv preprint arXiv:1810\.04805\.External Links:[Link](https://arxiv.org/abs/1810.04805)Cited by:[§4\.2](https://arxiv.org/html/2605.07269#S4.SS2.p1.1)\.
- K\. Greshake, S\. Abdelnabi, S\. Mishra, C\. Endres, T\. Holz, and M\. Fritz \(2023\)Not what you’ve signed up for: compromising real\-world llm\-integrated applications with indirect prompt injection\.arXiv preprint arXiv:2302\.12173\.External Links:[Link](https://arxiv.org/abs/2302.12173)Cited by:[§2](https://arxiv.org/html/2605.07269#S2.p1.1)\.
- P\. He, X\. Liu, J\. Gao, and W\. Chen \(2021\)DeBERTa: decoding\-enhanced bert with disentangled attention\.arXiv preprint arXiv:2006\.03654\.External Links:[Link](https://arxiv.org/abs/2006.03654)Cited by:[§4\.1](https://arxiv.org/html/2605.07269#S4.SS1.p2.1)\.
- K\. Hines, G\. Lopez, M\. Hall, F\. Zarfati, Y\. Zunger, and E\. Kiciman \(2024\)Defending against indirect prompt injection attacks with spotlighting\.InProceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security,External Links:[Link](https://arxiv.org/abs/2403.14720)Cited by:[§2](https://arxiv.org/html/2605.07269#S2.p2.1)\.
- Y\. Liu, M\. Ott, N\. Goyal, J\. Du, M\. Joshi, D\. Chen, O\. Levy, M\. Lewis, L\. Zettlemoyer, and V\. Stoyanov \(2019\)RoBERTa: a robustly optimized bert pretraining approach\.arXiv preprint arXiv:1907\.11692\.External Links:[Link](https://arxiv.org/abs/1907.11692)Cited by:[§4\.2](https://arxiv.org/html/2605.07269#S4.SS2.p1.1)\.
- Meta AI \(2024\)PromptGuard: safe structured generation\.Note:Model release, Hugging FaceExternal Links:[Link](https://huggingface.co/meta-llama/Prompt-Guard-86M)Cited by:[§2](https://arxiv.org/html/2605.07269#S2.p2.1)\.
- NLLB Team, M\. R\. Costa\-jussà, J\. Cross, M\. Äl\-Shedivat,et al\.\(2022\)No language left behind: scaling human\-centered machine translation\.Note:arXiv preprint arXiv:2207\.04672External Links:[Link](https://arxiv.org/abs/2207.04672)Cited by:[item 1](https://arxiv.org/html/2605.07269#S3.I1.i1.p1.1),[Translation quality\.](https://arxiv.org/html/2605.07269#Sx1.SS0.SSS0.Px2.p1.1)\.
- F\. Perez and I\. Ribeiro \(2022\)Ignore previous prompt: attack techniques for language models\.arXiv preprint arXiv:2211\.09527\.External Links:[Link](https://arxiv.org/abs/2211.09527)Cited by:[§1](https://arxiv.org/html/2605.07269#S1.p1.1),[§2](https://arxiv.org/html/2605.07269#S2.p1.1)\.
- Qwen Team \(2025\)Qwen technical reports and model cards\.Note:Technical report and model documentationExternal Links:[Link](https://huggingface.co/Qwen)Cited by:[§4\.1](https://arxiv.org/html/2605.07269#S4.SS1.p2.1)\.
- A\. Ramakrishna, J\. Majmudar, R\. Gupta, and D\. Hazarika \(2024\)LLM\-pirate: a benchmark for indirect prompt injection attacks in large language models\.AdvML\-Frontiers 2024 Workshop\.External Links:[Link](https://openreview.net/forum?id=qzEzXnw4ng)Cited by:[§2](https://arxiv.org/html/2605.07269#S2.p1.1)\.
- A\. Robey, E\. Wong, H\. Hassani, and G\. J\. Pappas \(2023\)SmoothLLM: defending large language models against jailbreaking attacks\.arXiv preprint arXiv:2310\.03684\.External Links:[Link](https://arxiv.org/abs/2310.03684)Cited by:[§2](https://arxiv.org/html/2605.07269#S2.p2.1)\.
- R\. Wang, J\. Wu, Y\. Xia, T\. Yu, R\. Zhang, R\. Rossi, S\. Mitra, L\. Yao, and J\. McAuley \(2025\)CachePrune: neural\-based attribution defense against indirect prompt injection attacks\.arXiv preprint arXiv:2504\.21228\.External Links:[Link](https://arxiv.org/abs/2504.21228)Cited by:[§2](https://arxiv.org/html/2605.07269#S2.p2.1)\.
- T\. Wen, C\. Wang, X\. Yang, H\. Tang, Y\. Xie, L\. Lyu, Z\. Dou, and F\. Wu \(2025\)Defending against indirect prompt injection by instruction detection\.arXiv preprint arXiv:2505\.06311\.External Links:[Link](https://arxiv.org/abs/2505.06311)Cited by:[§2](https://arxiv.org/html/2605.07269#S2.p2.1)\.
- J\. Yi, Y\. Xie, B\. Zhu, K\. Hines, E\. Kiciman, G\. Sun, X\. Xie, and F\. Wu \(2023\)Benchmarking and defending against indirect prompt injection attacks on large language models\.arXiv preprint arXiv:2312\.14197\.Cited by:[Figure 1](https://arxiv.org/html/2605.07269#S3.F1.1.pic1.1.1.1.1.1.1.3),[Figure 3](https://arxiv.org/html/2605.07269#S4.F3),[Ethics Statement](https://arxiv.org/html/2605.07269#Sx2.p1.1)\.
- K\. Zhu, X\. Yang, J\. Wang, W\. Guo, and W\. Y\. Wang \(2025\)MELON: provable defense against indirect prompt injection attacks in ai agents\.InInternational Conference on Machine Learning \(ICML\),External Links:[Link](https://openreview.net/forum?id=gt1MmGaKdZ)Cited by:[§2](https://arxiv.org/html/2605.07269#S2.p2.1)\.

## Appendix APlots

![Refer to caption](https://arxiv.org/html/2605.07269v1/plots/metric_comparison_bars.png)Figure 7:Grouped bar chart comparing Accuracy, Precision, Recall, F1, AUROC, AUPRC, and CLP across all evaluated models\. The Hybrid \(XLPID\+TF\-IDF\) achieves the highest F1 \(0\.9205\); the Boosting Ensemble achieves the highest AUROC \(0\.9378\)\.![Refer to caption](https://arxiv.org/html/2605.07269v1/plots/language_comparison.png)Figure 8:Per\-language \(English vs\. Bangla\) performance breakdown for each model\. Models closer to parity have higher CLP scores; ensemble methods show smaller cross\-lingual gaps than standalone neural baselines\.![Refer to caption](https://arxiv.org/html/2605.07269v1/plots/training_curves.png)Figure 9:Training and validation loss curves for all trained models, showing convergence behavior across epochs\. Early stopping prevents overfitting; loss stabilizes within the first few epochs for most configurations\.![Refer to caption](https://arxiv.org/html/2605.07269v1/plots/training_XLPID_Ours.png)Figure 10:Step\-level training trajectory for XLPID, showing loss and gradient norm evolution within each epoch\. The curve confirms stable LoRA fine\-tuning without divergence despite mixed\-precision cross\-lingual inputs\.

Similar Articles

Understanding prompt injections: a frontier security challenge

OpenAI Blog

OpenAI publishes guidance on prompt injection attacks, a social engineering vulnerability where malicious instructions hidden in web content or documents can trick AI models into unintended actions. The company outlines its multi-layered defense strategy including instruction hierarchy research, automated red-teaming, and AI-powered monitoring systems.

MELD: Multi-Task Equilibrated Learning Detector for AI-Generated Text

arXiv cs.CL

This paper introduces MELD, a detector for AI-generated text that uses multi-task learning with auxiliary heads for generator family, attack type, and source domain to improve robustness. MELD achieves strong performance on the RAID benchmark and maintains low false-positive rates under adversarial attacks.