STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming
Summary
STAR-Teaming introduces a multiplex-network-driven multi-agent framework that automates LLM red-teaming, achieving higher attack success rates with lower compute by organizing attack strategies into interpretable semantic communities.
View Cached Full Text
Cached at: 04/22/26, 08:29 AM
# STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming
Source: [https://arxiv.org/html/2604.18976](https://arxiv.org/html/2604.18976)
MinJae Jung1YongTaek Lim1Chaeyun Kim1 Junghwan Kim1††footnotemark:Kihyun Kim1††footnotemark:Minwoo Kim1 1DATUMO INC
###### Abstract
While Large Language Models \(LLMs\) are widely used, they remain susceptible to jailbreak prompts that can elicit harmful or inappropriate responses\. This paper introduces STAR\-Teaming, a novel black\-box framework for automated red teaming that effectively generates such prompts\. STAR\-Teaming integrates a Multi\-Agent System \(MAS\) with a Strategy\-Response Multiplex Network and employs network\-driven optimization to sample effective attack strategies\. This network\-based approach recasts the intractable high\-dimensional embedding space into a tractable structure, yielding two key advantages: it enhances the interpretability of the LLM’s strategic vulnerabilities, and it streamlines the search for effective strategies by organizing the search space into semantic communities, thereby preventing redundant exploration\. Empirical results demonstrate that STAR\-Teaming significantly surpasses existing methods, achieving a higher attack success rate \(ASR\) at a lower computational cost\. Extensive experiments validate the effectiveness and explainability of the Multiplex Network\. The code is available at[https://github\.com/selectstar\-ai/STAR\-Teaming\-paper](https://github.com/selectstar-ai/STAR-Teaming-paper)\.WARNING:This paper contains model outputs that can be offensive in nature\.
STAR\-Teaming: A Strategy\-Response Multiplex Network Approach to Automated LLM Red Teaming
MinJae Jung1YongTaek Lim1Chaeyun Kim1††thanks:Work done while at DATUMO INC\.Junghwan Kim1††footnotemark:Kihyun Kim1††footnotemark:Minwoo Kim1††thanks:Corresponding author\. Email:mwkim@selectstar\.ai1DATUMO INC\.
## 1Introduction
Large Language Models have demonstrated impressive performance across a wide range of tasksAchiam et al\. \([2023](https://arxiv.org/html/2604.18976#bib.bib1)\); Team et al\. \([2023](https://arxiv.org/html/2604.18976#bib.bib31)\); Touvron et al\. \([2023](https://arxiv.org/html/2604.18976#bib.bib33)\)and in addressing real\-world challengesCosta\-Jussà et al\. \([2022](https://arxiv.org/html/2604.18976#bib.bib8)\); Schick et al\. \([2023](https://arxiv.org/html/2604.18976#bib.bib27)\)\. As their deployment extends into safety\-critical domainsSinghal et al\. \([2023](https://arxiv.org/html/2604.18976#bib.bib29)\); Li et al\. \([2023](https://arxiv.org/html/2604.18976#bib.bib17)\), ensuring robust and responsible behavior has become a critical concern\. In particular, evaluating how LLMs respond to harmful, illegal or violent prompts is now essentialWei et al\. \([2023](https://arxiv.org/html/2604.18976#bib.bib35)\); Zou et al\. \([2023](https://arxiv.org/html/2604.18976#bib.bib38)\); Lin et al\. \([2025](https://arxiv.org/html/2604.18976#bib.bib19)\)\. This has led to growing interest in red\-teaming methods that assess LLM robustness against jailbreak\-style attacks\.
To address growing safety concerns, recent work has shifted from manual to automated red teaming for scalable and systematic evaluation\. These approaches can be categorized into two broad categories: optimization\-based attacks\(Zou et al\.,[2023](https://arxiv.org/html/2604.18976#bib.bib38); Chao et al\.,[2023](https://arxiv.org/html/2604.18976#bib.bib7); Liu et al\.,[2023](https://arxiv.org/html/2604.18976#bib.bib21); Guo et al\.,[2024](https://arxiv.org/html/2604.18976#bib.bib10); Mehrotra et al\.,[2024](https://arxiv.org/html/2604.18976#bib.bib24); Liao and Sun,[2024](https://arxiv.org/html/2604.18976#bib.bib18)\)and strategy\-based attacks\(Zeng et al\.,[2024](https://arxiv.org/html/2604.18976#bib.bib37); Shen et al\.,[2024](https://arxiv.org/html/2604.18976#bib.bib28); Samvelyan et al\.,[2024](https://arxiv.org/html/2604.18976#bib.bib26); Jin et al\.,[2024](https://arxiv.org/html/2604.18976#bib.bib15); Anil et al\.,[2024a](https://arxiv.org/html/2604.18976#bib.bib3); Liu et al\.,[2024](https://arxiv.org/html/2604.18976#bib.bib20)\)\. By iteratively querying models, these approaches enable large\-scale discovery of model vulnerabilities with minimal human intervention\.
Figure 1:Overview of STAR\-Teaming\.STAR\-Teaming samples and presents attack strategies\. These strategies are passed to the attacker LLM, which generates harmful prompts accordingly\.Despite their effectiveness, these methods face two key limitations\. First, most require extensive computational resources due to repeated querying or reinforcement\-based optimization, limiting scalability\. Second, while strategy\-based methods incorporate human\-developed jailbreak patterns, they lack transparency into why specific strategies work\. They typically sample based on embedding similarity without analyzing causal patterns of success, making it hard to refine attacks or understand model vulnerabilities\. As a result, both theefficiencyandinterpretabilityof automated attack generation remain limited\.
In response to these challenges, we develop STAR\-Teaming, which stands for Strategy\-Response multiplex networkKivelä et al\. \([2014](https://arxiv.org/html/2604.18976#bib.bib16)\)Approach to automated Red\-Teaming\. At its core, STAR\-Teaming builds a multiplex network that explicitly captures the statistical relationships between attack strategies and LLM responses\. This structure supports two key objectives\. It first enables efficient sampling of promising strategies via structured community exploration, avoiding the over\-sampling issues common in similarity\-based methods\. This network structure allows for easy adjustment of the search space size and offers high interpretability, facilitating the straightforward editing of strategies\. Additionally, our method provides interpretability by identifying which types of strategies consistently induce harmful behavior in specific model contexts\.
As shown in Figure[1](https://arxiv.org/html/2604.18976#S1.F1), STAR\-Teaming constructs a strategy\-response network from past attack logs\. The framework identifies communities of attack strategiesH\(s\)H\(s\)and response patternsG\(r\)G\(r\), and estimates a mapping matrixZZlinking them\. This matrix guides future strategy sampling and is continuously updated as attacks proceed, enabling adaptive improvement over time\. This mapping is represented as a 2D matrix, where each element quantifies the interaction strength between a strategy community and a response community\. This matrix representation offering high interpretability\.
In contrast to prior work that retrieves context from a fixed, embedding\-based database, STAR\-Teaming formulates strategy selection as an optimization problem\. It explores strategies through probabilistic community\-level sampling, conditioned on observed successes and failures, thereby enabling the generation of more diverse and interpretable attacks\. A key feature of STAR\-Teaming is that instead of learning about specific individual strategies, it learns an effective probability distribution over a diverse set of them\. This allows our framework to effectively search the space of strategies\.
The main contributions of our work are summarized as follows:
- •Novel Red Teaming Framework:We propose STAR\-Teaming, a novel automated red\-teaming framework that models the statistical links between strategies and responses through a multiplex network, and further supports modularity\-guided dynamic expansion to assimilate emerging attack patterns at runtime\.
- •Optimization of Strategy Selection: STAR\-Teaming unifies optimization and strategy approaches by treating strategy selection as an optimization task, enabling efficient sampling and adaptive learning\.
- •Effective and Efficient Performance: Our method significantly outperforms SOTA baselines in terms of Attack Success Rate \(ASR\) and time efficiency\.
Figure 2:Overview of the STAR\-Teaming architecture, consisting of \(A\) an Automated Red\-Teaming Multi\-Agent System and \(B\) a Multiplex Network for Strategy Sampling\. In \(A\), the attacker crafts a prompt using a seed and strategy, and queries the target model\. The scorer evaluates the response and assigns a success score\. If the score is low, a new strategy is sampled from \(B\) to refine the next prompt\.
## 2Related Work
Recent work on adversarial prompt generation falls into two main categories: optimization\-based and strategy\-based approaches\.
Optimization\-based approachestreat the LLM as a white\-box system, generating jailbreak prompts through procedures like iterative querying, gradient\-based token updates, or loss\-guided feedback\. Notable examples include GCGZou et al\. \([2023](https://arxiv.org/html/2604.18976#bib.bib38)\), AmpleGCGLiao and Sun \([2024](https://arxiv.org/html/2604.18976#bib.bib18)\), and COLD\-AttackGuo et al\. \([2024](https://arxiv.org/html/2604.18976#bib.bib10)\)\. AutoDANLiu et al\. \([2023](https://arxiv.org/html/2604.18976#bib.bib21)\)uses a genetic algorithm to refine DAN\-style prompts\. PAIRChao et al\. \([2023](https://arxiv.org/html/2604.18976#bib.bib7)\)leverages LLM feedback for iterative prompt refinement, while TAPMehrotra et al\. \([2024](https://arxiv.org/html/2604.18976#bib.bib24)\)extends this by adding pruning and branching to accelerate search and improve success rates\.
Strategy\-based approachesgenerate prompts using predefined or learned attack patterns, targeting higher level semantic variation\. Methods like PAPZeng et al\. \([2024](https://arxiv.org/html/2604.18976#bib.bib37)\), DANShen et al\. \([2024](https://arxiv.org/html/2604.18976#bib.bib28)\), and Many\-shot JailbreakingAnil et al\. \([2024a](https://arxiv.org/html/2604.18976#bib.bib3)\)use fixed templates, while Rainbow TeamingSamvelyan et al\. \([2024](https://arxiv.org/html/2604.18976#bib.bib26)\)defines eight types of strategies including emotional or indirect cues\. AutoDAN\-TurboLiu et al\. \([2024](https://arxiv.org/html/2604.18976#bib.bib20)\)adopts a multi\-agent loop to iteratively discover and refine new strategies through attack–response–evaluation cycles\. These methods offer greater semantic diversity and improve interpretability\. However, they often involve multiple modules, leading to high computational overhead and their fixed structure limits adaptability to new scenariosLin et al\. \([2025](https://arxiv.org/html/2604.18976#bib.bib19)\)\. Additionally, sequential or embedding\-based strategy selection struggles with inefficient search and poor alignment between semantic similarity and attack success, often resulting in low ASR\.
We proposeSTAR\-Teaming, a novel automated red\-teaming framework that combines strategy\-response multiplex networkMagnani et al\. \([2021](https://arxiv.org/html/2604.18976#bib.bib22)\); Traag et al\. \([2019](https://arxiv.org/html/2604.18976#bib.bib34)\)with a statisticallyNguyen et al\. \([2017](https://arxiv.org/html/2604.18976#bib.bib25)\)grounded sampling module, achieving high ASR and time efficiency\.
## 3STAR\-Teaming
### 3\.1Overall Framework
As shown in Figure[2](https://arxiv.org/html/2604.18976#S1.F2), our approach integrates two core components to enable efficient and interpretable automated red\-teaming: \(A\) a Multi\-Agent System \(MAS\) and \(B\) a strategy\-response Multiplex Network\.
The MAS consists of three LLM\-based agents, attacker, target, and a scorer that interact in an iterative loop\. The attacker generates a modified jailbreak prompt based on a given seed and selected strategy\. The target responds to this prompt, and the scorer evaluates whether the attack was successful based on both the prompt and the response\. If the score does not meet a predefined threshold, the attacker updates its strategy and retries, up to a maximum number of attempts\. This loop automates adversarial prompt generation and evaluation with minimal human intervention\.
To guide the attacker’s strategy selection, we introduce a novel retrieval mechanism based on a probabilistic multiplex network\. This network is constructed using past attack logs, modeling the relationship between clusters of strategy types and clusters of LLM responses\. By optimizing the mapping matrix between these clusters, our method enables adaptive sampling of promising strategies based on statistical patterns\. This improves both the attack success rate and the diversity of strategy exploration\.
In the following sections, we detail each part of the framework: Section[3\.2](https://arxiv.org/html/2604.18976#S3.SS2)describes the MAS pipeline, Section[3\.3](https://arxiv.org/html/2604.18976#S3.SS3)explains the construction of the multiplex network, and Section[3\.4](https://arxiv.org/html/2604.18976#S3.SS4)outlines the probabilistic strategy sampling procedure\.
### 3\.2Multi Agent System
Recently, many MAS approaches have been proposed for automated red\-teaming\. These approaches usually have an attacker, target, and judge agentMehrotra et al\. \([2024](https://arxiv.org/html/2604.18976#bib.bib24)\); Samvelyan et al\. \([2024](https://arxiv.org/html/2604.18976#bib.bib26)\); Liu et al\. \([2024](https://arxiv.org/html/2604.18976#bib.bib20)\)\.
In this study, STAR\-Teaming is based on MAS system that orchestrates an iterative process\. An Attacker LLM generates jailbreak prompts for a Target LLM, whose responses are then scored by a Scorer LLM to assess malicious intent and evaluate jailbreak success\. Additionally, we adopt anLLM\-based strategy extractor\. Inspired by AutoDAN\-TurboLiu et al\. \([2024](https://arxiv.org/html/2604.18976#bib.bib20)\), which systematically identifies effective jailbreak strategies from logs, populating a database indexed by response embeddings with prompts that improved the score\. This database then guides future attacks by dynamically injecting historically effective strategies into the Attacker LLM’s prompt\. The detailed mechanism of the strategy extractor, including the prompts used, is described in Appendix[F](https://arxiv.org/html/2604.18976#A6)\.
### 3\.3Multiplex Network Construction
We begin by constructing the network from an initial set of attack logs\. At this stage, no predefined strategies exist, so these logs are generated by leveraging the inherent stochasticity of the attacker LLM \(e\.g\., by setting temperature \> 0\)\. These raw logs are then processed using our LLM\-based strategy extractor to create a structured dataset of ‘\(strategy, response\)’ pairs\. This dataset serves as the foundation for two networks: aResponse Networkand aStrategy Network\. These are treated as layers within a single multiplex network, and their construction follows an identical procedure\.
The process begins with the construction of the Response network, which models the relationships between different target responses\. Text embeddings\(𝐞r\\mathbf\{e\}\_\{r\}\) are extracted for each target response \(e\.g\., “I cannot fulfill that request …”, “Sure, I will help you …”\)\. These embeddings are then used to compute a similarity map,𝕊r\\mathbb\{S\}\_\{r\}, which is anNl×NlN\_\{l\}\\times N\_\{l\}matrix where each element\(𝕊r\)l,l′\(\\mathbb\{S\}\_\{r\}\)\_\{l,l^\{\\prime\}\}quantifies the pairwise similarity \(e\.g\., using cosine similarity\) between the embeddings of responserlr\_\{l\}and responserl′r\_\{l^\{\\prime\}\}\.NlN\_\{l\}is the total number of unique responses\.
From this similarity map, an adjacency matrix,ArA^\{r\}, for the Response network is derived by applying a predefined threshold,αr\\alpha\_\{r\}\. The elements ofArA^\{r\}are defined as:
Al,l′r=\{1,if\(𝕊r\)l,l′≥αr0,otherwiseA^\{r\}\_\{l,l^\{\\prime\}\}=\\begin\{cases\}1,&\\text\{if \}\(\\mathbb\{S\}\_\{\\text\{r\}\}\)\_\{l,l^\{\\prime\}\}\\geq\\alpha\_\{\\text\{r\}\}\\\\ 0,&\\text\{otherwise\}\\end\{cases\}\(1\)
A Leiden algorithmTraag et al\. \([2019](https://arxiv.org/html/2604.18976#bib.bib34)\)is then applied to the adjacency matrixArA^\{r\}to identify distinct communities of responses\. LetC\(r\)jC\(r\)\_\{j\}denote thejj\-th community of responses identified by the algorithm, wherej∈\{1,…,NJ\}j\\in\\\{1,\\dots,N\_\{J\}\\\}andNJN\_\{J\}is the total number of detected communities in the Response network\. The community membership of each responserlr\_\{l\}is represented by a vectorhlh\_\{l\}, and these vectors collectively form the matrix𝐆\(r\)\\mathbf\{G\}\(r\)\. The vectorhlh\_\{l\}is defined ashl=\[hl1,hl2,…,hlNJ\]Th\_\{l\}=\[h\_\{l1\},h\_\{l2\},\\dots,h\_\{lN\_\{J\}\}\]^\{T\}, where
hlj=\{1,if responserl∈C\(r\)j0,otherwiseh\_\{lj\}=\\begin\{cases\}1,&\\text\{if response \}r\_\{l\}\\in C\(r\)\_\{j\}\\\\ 0,&\\text\{otherwise\}\\end\{cases\}\(2\)for each responsel∈\{1,…,Nl\}l\\in\\\{1,\\dots,N\_\{l\}\\\}and each communityj∈\{1,…,NJ\}j\\in\\\{1,\\dots,N\_\{J\}\\\}\.
An analogous procedure is followed for the Strategy network\. Text embeddings,𝐞stg\\mathbf\{e\}\_\{\\text\{stg\}\}, are generated for the names and definitions of various strategies \(e\.g\., ‘Framing,’ ‘Social Proof,’ and ‘Role\-playing’\)\. These embeddings are used to compute a strategy similarity map,𝕊stg\\mathbb\{S\}\_\{\\text\{stg\}\}, anNk×NkN\_\{k\}\\times N\_\{k\}matrix where\(𝕊stg\)k,k′\(\\mathbb\{S\}\_\{\\text\{stg\}\}\)\_\{k,k^\{\\prime\}\}is the similarity between strategySkS\_\{k\}and strategySk′S\_\{k^\{\\prime\}\}\.NkN\_\{k\}is the total number of unique strategies\. The adjacency matrix for the strategy network,ASA^\{S\}, is then derived from𝕊stg\\mathbb\{S\}\_\{\\text\{stg\}\}using a predefined thresholdαstg\\alpha\_\{\\text\{stg\}\}:
Ak,k′S=\{1,if\(𝕊stg\)k,k′≥αstg0,otherwiseA^\{S\}\_\{k,k^\{\\prime\}\}=\\begin\{cases\}1,&\\text\{if \}\(\\mathbb\{S\}\_\{\\text\{stg\}\}\)\_\{k,k^\{\\prime\}\}\\geq\\alpha\_\{\\text\{stg\}\}\\\\ 0,&\\text\{otherwise\}\\end\{cases\}\(3\)
Applying a Leiden algorithm toASA^\{S\}yields communities of strategies\. LetC\(S\)iC\(S\)\_\{i\}denote theii\-th community of strategies, wherei∈\{1,…,NI\}i\\in\\\{1,\\dots,N\_\{I\}\\\}andNIN\_\{I\}is total number of detected communities in the Strategy network\. The community membership of each strategySkS\_\{k\}is represented by a vectorhkh\_\{k\}, forming the matrix𝐇\(S\)\\mathbf\{H\}\(S\)\. The vectorhkh\_\{k\}is defined ashk=\[hk1,hk2,…,hkNI\]Th\_\{k\}=\[h\_\{k1\},h\_\{k2\},\\dots,h\_\{kN\_\{I\}\}\]^\{T\}, where
hki=\{1,if strategySk∈C\(S\)i−1NI−1,otherwiseh\_\{ki\}=\\begin\{cases\}1,&\\text\{if strategy \}S\_\{k\}\\in C\(S\)\_\{i\}\\\\ \-\\frac\{1\}\{N\_\{I\}\-1\},&\\text\{otherwise\}\\end\{cases\}\(4\)for each strategyk∈\{1,…,Nk\}k\\in\\\{1,\\dots,N\_\{k\}\\\}and each communityi∈\{1,…,NI\}i\\in\\\{1,\\dots,N\_\{I\}\\\}\. This negative term serves a dual purpose: it acts as a regularizer to prevent parameter divergence during optimization and allows for proper adjustment of the probability distribution\. Specifically, when the probability of sampling a successful strategy increases, this term ensures that the probabilities of other, less effective strategies decrease accordingly\.
Figure[2](https://arxiv.org/html/2604.18976#S1.F2)\(B\) provides a visual representation: the upper panel depicts the Response network, while the lower panel shows the Strategy network\. In these visualizations, nodes correspond to individual responses or strategies, and their colors signify distinct community affiliations\. For instance, if a response node belongs to the second community out of five total communities \(i\.e\.,NJ=5N\_\{J\}=5\), its community membership vector would be\[0,1,0,0,0\]T\[0,1,0,0,0\]^\{T\}\. The community structure described above can also be extended dynamically during red\-teaming via a modularity\-based expansion criterion; we defer the details to Section[4\.6](https://arxiv.org/html/2604.18976#S4.SS6)and Appendix[J](https://arxiv.org/html/2604.18976#A10)\.
### 3\.4Probabilistic Optimization and Sampling For Strategy Retrieval
Upon constructing the multiplex network, which comprises the Response community matrix𝐆\(r\)\\mathbf\{G\}\(r\)and the strategy community matrix𝐇\(S\)\\mathbf\{H\}\(S\), we aim to sample relevant strategies based on the learned network topology\. To achieve this, we formulate an energy function based on the Hamiltonian of the Inverse Ising ProblemNguyen et al\. \([2017](https://arxiv.org/html/2604.18976#bib.bib25)\)\. The energyE\(rp,sq\)E\(r\_\{p\},s\_\{q\}\)for a given responserpr\_\{p\}and strategysqs\_\{q\}pair is defined as:
E\(rp,sq\)=−∑ijZij𝐎pqij\.E\(r\_\{p\},s\_\{q\}\)=\-\\sum\_\{ij\}Z\_\{ij\}\\mathbf\{O\}^\{ij\}\_\{pq\}\.\(5\)We introduce𝐎pqij\\mathbf\{O\}^\{ij\}\_\{pq\}to express𝐆\(rp\)j𝐇\(sq\)i\\mathbf\{G\}\(r\_\{p\}\)\_\{j\}\\mathbf\{H\}\(s\_\{q\}\)\_\{i\}concisely, where𝐆\(rp\)j\\mathbf\{G\}\(r\_\{p\}\)\_\{j\}is thejj\-th component of the community vector for responserpr\_\{p\}\(indicating membership in response communityjj\),𝐇\(sq\)i\\mathbf\{H\}\(s\_\{q\}\)\_\{i\}is theii\-th component of the community vector for strategysqs\_\{q\}\(indicating membership in strategy communityii\)\.ZijZ\_\{ij\}represents the learned coupling strength \(interaction parameter\) between thejj\-th response community and theii\-th strategy community\. These parametersZ=\{Zij\}Z=\\\{Z\_\{ij\}\\\}are learned and subsequently used to determine strategy probabilities\.
The probability of a response\-strategy pair\(rp,sq\)\(r\_\{p\},s\_\{q\}\)givenZZis modeled using a Boltzmann distributionp\(x\)∝exp\(−E\(x\)\)p\(x\)\\propto\\exp\(\-E\(x\)\)\. The likelihoodL\(Z\|D\)L\(Z\|D\)of parametersZZgiven the datasetDDof\(rp,sq\)\(r\_\{p\},s\_\{q\}\)pairs is:
L\(Z\|D\)∝∏\(rp,sq\)∈Dexp\(∑ijZij𝐎pqij\)\\begin\{split\}&L\(Z\|D\)\\propto\\prod\_\{\(r\_\{p\},s\_\{q\}\)\\in D\}\\exp\\left\(\\sum\_\{ij\}Z\_\{ij\}\\mathbf\{O\}^\{ij\}\_\{pq\}\\right\)\\end\{split\}\(6\)
The parametersZijZ\_\{ij\}are optimized by maximizing the log\-likelihood:
Z∗=argmax𝑍logL\(Z\|D\)Z^\{\*\}=\\underset\{Z\}\{\\text\{argmax\}\}\\log L\(Z\|D\)\(7\)This optimization, an instance of the Inverse Ising Problem, can be solved efficiently using gradient ascent\. The problem is typically formulated to be convex, leading to a unique solution forZZ\. Detailed proofs regarding the convexity and its foundation in the Maximum Entropy Principle are provided in Appendix[C](https://arxiv.org/html/2604.18976#A3)\. The resulting interaction matrixZZhas dimensionsNI×NJN\_\{I\}\\times N\_\{J\}\. This represents a significantly smaller parameter space compared to contemporary large\-scale models, offering advantages in terms of computational efficiency and learning speed\.
The update rule forZijZ\_\{ij\}at iterationt\+1t\+1using gradient ascent on the log\-likelihood is:
Zijt\+1=Zijt\+lr⋅∂logL\(Zt\)∂ZijZ^\{t\+1\}\_\{ij\}=Z^\{t\}\_\{ij\}\+\\text\{lr\}\\cdot\\frac\{\\partial\\log L\(Z^\{t\}\)\}\{\\partial Z\_\{ij\}\}\(8\)
∂logL\(Zt\)∂Zij=∑\(rp,sq\)∈D𝐎pqij−ND⟨𝐎ij⟩\\frac\{\\partial\\log L\(Z^\{t\}\)\}\{\\partial Z\_\{ij\}\}=\\sum\_\{\(r\_\{p\},s\_\{q\}\)\\in D\}\\mathbf\{O\}^\{ij\}\_\{pq\}\-N\_\{D\}\\left\\langle\\mathbf\{O\}^\{ij\}\\right\\rangle\(9\)wherelris the learning rate, andND=\|D\|N\_\{D\}=\|D\|is the number of data pairs in the dataset\. The first term in parentheses represents the empirical co\-occurrence of response communityjjand strategy communityiiin the datasetDD\. The second term,⟨𝐎ij⟩\\left\\langle\\mathbf\{O\}^\{ij\}\\right\\rangle, denotes the expected co\-occurrence under the model distribution
P\(Zt\)=exp\(−E\(r′,s′\|Zt\)\)𝒵\(Zt\),P\(Z^\{t\}\)=\\frac\{\\exp\(\-E\(r^\{\\prime\},s^\{\\prime\}\|Z^\{t\}\)\)\}\{\\mathcal\{Z\}\(Z^\{t\}\)\},\(10\)where𝒵\(Zt\)\\mathcal\{Z\}\(Z^\{t\}\)is the partition function\. The variabler′r^\{\\prime\}in the final sampling step denotes unseen response encountered during inference\.
In summary, the described procedure yields the response community representations𝐆\(r\)\\mathbf\{G\}\(r\), strategy community representations𝐇\(S\)\\mathbf\{H\}\(S\), and the inter\-layer interaction matrixZZ\. With these components learned from the log data, the model can sample or predict an optimal strategy \(or its community representation𝐇\(S′\)\\mathbf\{H\}\(S^\{\\prime\}\)\) when presented with a new responser′r^\{\\prime\}\.
Given a new responser′r^\{\\prime\}, its community representation𝐆\(r′\)\\mathbf\{G\}\(r^\{\\prime\}\)\(a vector of lengthNJN\_\{J\}\) is determined by assigning it to the community of the most similar central node\. Specifically, we compute the cosine similarity between the embedding ofr′r^\{\\prime\}and the embedding of each community’s central node, and assignr′r^\{\\prime\}to the community with the highest similarity score in a winner\-takes\-all fashion\. The probability of selecting a strategyS′S^\{\\prime\}that belongs to strategy communitykk\(represented by a one\-hot vectorhS′h\_\{S^\{\\prime\}\}wherehS′,k=1h\_\{S^\{\\prime\},k\}=1\), conditioned on𝐆\(r′\)\\mathbf\{G\}\(r^\{\\prime\}\)and the learnedZZparameters, is given by:
P\(𝐇\(sk\)∣𝐆\(r′\),Z\)∝exp\(β∑j=1Zkj𝐆\(r′\)j\)P\(\\mathbf\{H\}\(s\_\{k\}\)\\mid\\mathbf\{G\}\(r^\{\\prime\}\),Z\)\\propto\\exp\\left\(\\beta\\sum\_\{j=1\}Z\_\{kj\}\\mathbf\{G\}\(r^\{\\prime\}\)\_\{j\}\\right\)\(11\)
The term∑j=1NJZkj𝐆\(r′\)j\\sum\_\{j=1\}^\{N\_\{J\}\}Z\_\{kj\}\\mathbf\{G\}\(r^\{\\prime\}\)\_\{j\}is the score \(logit\) for strategy communitykk\. The expressionZ⋅𝐆\(r′\)Z\\cdot\\mathbf\{G\}\(r^\{\\prime\}\)in the original user prompt can be interpreted as the vector of these scores for allNIN\_\{I\}strategy communities\. Strategies can then be sampled based on these probabilities\. Here,β\\betais an inverse\-temperature parameter that controls the sharpness of the sampling distribution: largerβ\\betaconcentrates probability mass on the top\-scoring strategy communities \(exploitation\), while smallerβ\\betayields a more uniform sampling distribution \(exploration\)\. Note thatβ\\betaenters only at sampling time, not during the estimation ofZZin Eq\.[12](https://arxiv.org/html/2604.18976#S3.E12); it is adaptively scheduled so that the top\-3 strategies carry approximately 80% of the total probability mass, as detailed in Appendix[A](https://arxiv.org/html/2604.18976#A1)and Algorithm[2](https://arxiv.org/html/2604.18976#alg2)\.
This model is very economical because it has extremely few parameters, only aboutNI×NJ≈O\(103\)N\_\{I\}\\times N\_\{J\}\\approx O\(10^\{3\}\)\. Unlike the general inverse Ising problem, through one\-hot encoding, the configuration space isO\(N\)O\(N\)instead ofO\(2N\)O\(2^\{N\}\), so𝒵\\mathcal\{Z\}is not intractable even as the system size increases\. Empirically, the optimization time for mapping matrix is less than a second, so it is not a consideration\. Furthermore, by learning using a score during training as shown in the following reward gradient, the model can be managed lifelong:
∂logL\(Zt\)∂Zij=fsc\(rt\)\(𝐎pqij−⟨𝐎pqij⟩\)\\frac\{\\partial\\log L\(Z^\{t\}\)\}\{\\partial Z\_\{ij\}\}=f\_\{sc\}\(r^\{t\}\)\\Bigg\(\\mathbf\{O\}^\{ij\}\_\{pq\}\-\\left\\langle\\mathbf\{O\}^\{ij\}\_\{pq\}\\right\\rangle\\Bigg\)\(12\)
wherertr^\{t\}andsts^\{t\}are values involved in the current MAS iteration\.fsc\(rt\)f\_\{sc\}\(r^\{t\}\)is a function of the score for this attack\. This learning mechanism, embedded in Equation[12](https://arxiv.org/html/2604.18976#S3.E12), is designed to learn from failures as well as successes\. The scoring functionfsc\(rt\)f\_\{sc\}\(r^\{t\}\)is designed to be positive for successful attacks and negative for failed ones\. When an attack fails, the score is negative, and thus the corresponding interaction strengthZijZ\_\{ij\}\(linking the ineffective strategy community and the refusal response community\) is penalized and reduced during the gradient update\.
## 4Experimental Evaluation
### 4\.1Experimental Setup
Datasets\. We evaluated STAR\-Teaming against baselines using a total of 400 diverse malicious requests on the HarmBenchMazeika et al\. \([2024](https://arxiv.org/html/2604.18976#bib.bib23)\)\. Specifically, we utilized the text\-only set of HarmBench, which comprises three distinct taxonomies: Contextual, Standard, and Copyright\. Furthermore, we also evaluated the STAR\-Teaming and baselines on the StrongReject datasetSouly et al\. \([2024](https://arxiv.org/html/2604.18976#bib.bib30)\), which contains a total of 313 malicious requests\.
Implementation\.To evaluate STAR\-Teaming, we conduct experiments by varying the configuration of attacker and target agents\. An attack attempt is assessed by the Judge model when either a maximum of 140 attack repetitions were reached or the attack score exceeded 8\.5\. For evaluating attack attempts on HarmBench, we utilize Llama\-2\-13b\-cls as the judge model, which provided a binary output \(‘yes’ for success, ‘no’ for failure\) to determine the Attack Success Rate \(ASR\)\. For the StrongReject dataset, we perform evaluation using a fine\-tuned Gemma\-2b model, yielding a StrongREJECT score \(0\-1\)\. In both evaluation metrics, a higher score indicates better performance\. For fair comparison, baseline ASR values on models included in the original HarmBench benchmarkMazeika et al\. \([2024](https://arxiv.org/html/2604.18976#bib.bib23)\)are taken directly from their reported results\. For more recent target models \(Gemma3, Qwen3, Llama\-3\.1, GPT\-4o, and Claude\-3\.5\-Sonnet\), which were not covered by the original benchmark, we reproduce the most widely cited attack baselines using their respective official repositories: GCG, AutoDAN, and PAIR via the official HarmBench repository, and TAP and AutoDAN\-Turbo via their own official implementations\. We provide further details in Appendix[A](https://arxiv.org/html/2604.18976#A1)\.
ModelBaselineOursGCGGCG\-MGCG\-TPEZGBDAUATAPSFSZSPAIRTAPTAP\-TAutoDANPAP\-top5HumanDirectAutoDAN\-Turbo\\cellcolor\[rgb\] \.929, \.929, \.929 STAR\-TeamingLlama\-2 7b chat32\.521\.219\.71\.81\.44\.515\.34\.32\.09\.39\.37\.80\.52\.70\.80\.836\.6\\cellcolor\[rgb\] \.929, \.929, \.92971\.0Llama\-2 13b chat30\.011\.316\.41\.72\.21\.516\.36\.02\.915\.014\.28\.00\.83\.31\.72\.834\.6\\cellcolor\[rgb\] \.929, \.929, \.92971\.5Llama\-2 70b chat37\.510\.822\.13\.32\.34\.020\.57\.03\.014\.513\.316\.32\.84\.12\.22\.842\.6\\cellcolor\[rgb\] \.929, \.929, \.92961\.0Vicuna 7b65\.561\.560\.819\.819\.019\.356\.342\.327\.253\.551\.059\.866\.018\.939\.024\.396\.3\\cellcolor\[rgb\] \.929, \.929, \.92993\.8Baichuan 2 7b61\.540\.746\.432\.329\.828\.548\.326\.827\.937\.351\.058\.553\.319\.027\.218\.883\.8\\cellcolor\[rgb\] \.929, \.929, \.92991\.5Qwen 7b chat59\.252\.538\.313\.212\.711\.049\.731\.815\.650\.253\.059\.047\.313\.324\.613\.082\.7\\cellcolor\[rgb\] \.929, \.929, \.92990\.8Solar 10\.7B\-Instruct57\.561\.658\.956\.154\.554\.054\.358\.354\.956\.866\.565\.872\.531\.361\.261\.395\.7\\cellcolor\[rgb\] \.929, \.929, \.92993\.8OpenChat 3\.5 121066\.354\.657\.338\.944\.540\.857\.052\.543\.352\.563\.566\.173\.526\.951\.346\.096\.3\\cellcolor\[rgb\] \.929, \.929, \.92993\.5zephyr69\.562\.561\.062\.562\.862\.360\.562\.060\.058\.866\.569\.375\.032\.966\.065\.896\.3\\cellcolor\[rgb\] \.929, \.929, \.92995\.8Gemma3\-4b\-it28\.215\.813\.6\-\-\-\-\-\-59\.160\.4\-87\.8\-\-5\.834\.0\\cellcolor\[rgb\] \.929, \.929, \.92964\.7Gemma3\-12b\-it19\.513\.720\.7\-\-\-\-\-\-48\.561\.2\-47\.0\-\-31\.524\.8\\cellcolor\[rgb\] \.929, \.929, \.929 56\.6Qwen3\-4b32\.020\.238\.4\-\-\-\-\-\-25\.851\.8\-12\.3\-\-12\.447\.8\\cellcolor\[rgb\] \.929, \.929, \.92972\.5Qwen3\-8b17\.321\.213\.1\-\-\-\-\-\-28\.354\.5\-11\.0\-\-18\.538\.3\\cellcolor\[rgb\] \.929, \.929, \.92971\.5GPT\-4 Turbo 1106\-\-22\.3\-\-\-\-\-13\.933\.036\.458\.5\-11\.12\.69\.383\.8\\cellcolor\[rgb\] \.929, \.929, \.92978\.1GPT\-4o\-\-\-\-\-\-\-\-\-53\.066\.0\-\-\-\-\-76\.0\\cellcolor\[rgb\] \.929, \.929, \.92976\.1Gemini Pro\-\-18\.0\-\-\-\-\-14\.835\.138\.831\.2\-11\.812\.118\.066\.3\\cellcolor\[rgb\] \.929, \.929, \.92972\.5Claude 3\.5 Sonnet\-\-\-\-\-\-\-\-\-4\.05\.0\-\-\-\-\-2\.0\\cellcolor\[rgb\] \.929, \.929, \.92912\.0Average44\.334\.433\.825\.525\.525\.142\.032\.324\.137\.344\.845\.542\.315\.926\.222\.161\.0\\cellcolor\[rgb\] \.929, \.929, \.92974\.5
Table 1:Red\-Teaming attack performance \(ASR\) on HarmBench\(Mazeika et al\.,[2024](https://arxiv.org/html/2604.18976#bib.bib23)\)\.Baselines\.We compare STAR\-Teaming with six baseline approaches\. For consistency in experimental comparison, we operate in black\-box setting of baselines: GCG\-TZou et al\. \([2023](https://arxiv.org/html/2604.18976#bib.bib38)\), PAIRChao et al\. \([2023](https://arxiv.org/html/2604.18976#bib.bib7)\), TAPMehrotra et al\. \([2024](https://arxiv.org/html/2604.18976#bib.bib24)\), PAP\-top5Zeng et al\. \([2024](https://arxiv.org/html/2604.18976#bib.bib37)\), Rainbow TeamingSamvelyan et al\. \([2024](https://arxiv.org/html/2604.18976#bib.bib26)\), and AutoDAN\-TurboLiu et al\. \([2024](https://arxiv.org/html/2604.18976#bib.bib20)\)\. GCG\-T generates universal jailbreak triggers using optimization\-based adversarial search\. PAIR iteratively refines prompts based on judgement LLM feedback to achieve jailbreaks\. TAP enhances PAIR by incorporating branching and pruning strategies for efficient prompt search\. PAP\-top5 selects from a pool of 40 human\-crafted strategies to generate adversarial prompts\. AutoDAN\-Turbo are similar to our method, they use multiple predefined strategies within a Multi\-Agent System to generate jailbreaking prompt\.
Employed LLMs\.We select Gemma\-1\.1\-7b\-itAnil et al\. \([2024b](https://arxiv.org/html/2604.18976#bib.bib4)\)and Llama3\-8b\-InstructAI \([2024](https://arxiv.org/html/2604.18976#bib.bib2)\)as attacker, GPT\-4o\-miniHurst et al\. \([2024](https://arxiv.org/html/2604.18976#bib.bib12)\)as Scorer and strategy extractor\. We primarily used Llama2Touvron et al\. \([2023](https://arxiv.org/html/2604.18976#bib.bib33)\), Llama3\.1Grattafiori et al\. \([2024](https://arxiv.org/html/2604.18976#bib.bib9)\), Gemma3Team et al\. \([2025](https://arxiv.org/html/2604.18976#bib.bib32)\), and Qwen3Yang et al\. \([2025](https://arxiv.org/html/2604.18976#bib.bib36)\)as our experimental open\-source targets\. To ensure diversity in our target models, we also conducted experiments with GPT\-4Hurst et al\. \([2024](https://arxiv.org/html/2604.18976#bib.bib12)\), Gemini ProTeam et al\. \([2023](https://arxiv.org/html/2604.18976#bib.bib31)\)and Claude\-3\.5Anthropic \([2024](https://arxiv.org/html/2604.18976#bib.bib5)\)as closed\-source targets\.
### 4\.2Comparison with state\-of\-the\-art
Table[1](https://arxiv.org/html/2604.18976#S4.T1)presents the primary results of our empirical evaluation on the HarmBench benchmark\. STAR\-Teaming demonstrates a substantial performance improvement over existing baselines\. Although STAR\-Teaming exhibited a tendency for lower Attack Success Rates \(ASR\) against less robust targets when compared to its counterpart, AutoDAN\-Turbo, it significantly outperformed on more challenging models, including the Llama family, Gemma3, and Qwen3\. Of particular note, STAR\-Teaming was the only method to exceed 10% ASR against Claude\-3\.5\-Sonnet, highlighting its effectiveness even against strongly aligned closed\-source models\. Overall, STAR\-Teaming attained an average ASR of 74\.5%, surpassing the next\-best baseline, AutoDAN\-Turbo, by a margin of 13\.5%\.
### 4\.3Performance across attack time
Figure 3:Performance comparison according to iteration \(i\.e\. number of iterations\)\.Figure[3](https://arxiv.org/html/2604.18976#S4.F3)shows a detailed comparison of the ASR for baselines as a function of the number of iterations on HarmBench, specifically for the Llama2\-7b\-chat target\. STAR\-Teaming consistently outperforms our baselines regardless of the computational cost\. This indicates that STAR\-Teaming is highly efficient even under low computational cost\. Furthermore, STAR\-Teaming’s performance continues to increase even when the number of iterations is high \(150\)\. In the next section, we discuss how STAR\-Teaming overwhelms other models by a huge gap\.
### 4\.4Effectiveness of Multiplex Network
First, we illustrate the sampling proportions of attack strategies in the HarmBench test process, both with and without a Multiplex Network\. In this context, the without Multiplex Network condition refers to a setup where the number of nodes for agents, system prompts, and strategies is the same as in the full STAR\-Teaming approach\. However, it differs by not constructing the response\-strategy Multiplex Network or performing probability optimization via a mapping matrix\. Instead, strategy retrieval is solely based on embedding similarity\.
Figure[4](https://arxiv.org/html/2604.18976#S4.F4)reveals that while STAR\-Teaming without a Multiplex Network demonstrates a high concentration of samples on a limited set of strategies, with near 15% originating from a single strategy and over 30% from just four, STAR\-Teaming with a Multiplex Network samples a wider variety\. We then analyze how this phenomenon impacts ASR performance, which is illustrated in Figure[5](https://arxiv.org/html/2604.18976#S4.F5)\. Interestingly, the Multiplex Network enhances the sampling of high\-scoring attack strategies\. This indicates that while conventional embedding\-similarity\-based strategy retrieval tends to oversample ineffective strategies, the use of the Multiplex Network leads to improved sampling of effective strategies\. Thus, the advantage of STAR\-Teaming compared to other strategy\-based approaches appears largely attributable to its modeling for retrieving strategies, particularly these effective ones\.
Figure 4:Distribution of selected strategies by retrieval module \(A\) without Multiplex Network and \(B\) with Multiplex Network\. This distribution shows the degree of uniformity in the selection of attack strategies during attacks\.Figure 5:The correlation between average score using the given strategy and number of attacks using the given strategy \(A\) without Multiplex Network and \(B\) with Multiplex Network\.Overall, Table[2](https://arxiv.org/html/2604.18976#S4.T2)presents metrics from the HarmBench dataset, comparing identical experimental conditions with and without the presence of a multiplex network\. The results indicate an approximate \+6\.0% increase in ASR when a multiplex network is utilized\. Furthermore, measuring the Self\-BLEU of attack prompts demonstrates that multiplex networks positively influence attack diversity\. The presence of a multiplex network leads to increased attack diversity, given that its absence results in a 0\.21 higher Self\-BLEU, indicating more redundant attacks\. This table quantitatively supports this by providing the Gini coefficient of the sampled strategy distribution, which shows how excessively certain strategies are sampled, and by measuring the Pearson correlation between strategy scores and strategy usage, indicating if effective strategies are frequently selected\. Ultimately, the table quantitatively illustrates that using a multiplex network leads to the sampling of diverse strategies, with a tendency to prioritize higher\-performing strategies, as evidenced by the Gini coefficient of the strategy itself and the Pearson correlation between strategy scores and strategy usage\. A community\-level breakdown of which strategies are disproportionately effective against each target model family is further presented in Appendix[O](https://arxiv.org/html/2604.18976#A15)\.
STAR\-TeamingASRSelf\-BLEUGiniPearsonw//MultiPlex Network71\.00\.250\.190\.81w//o MultiPlex Network65\.00\.460\.36\-0\.08
Table 2:Comparative experiment with and without a multiplex network\.
### 4\.5StrongReject
ModelLlama2\-7bLlama3\-8bGemma\-7bGCG\-T0\.120\.100\.10PAIR0\.050\.120\.08TAP0\.040\.130\.16PAP\-top50\.100\.080\.06Rainbow Teaming0\.080\.090\.08\\rowcolorgray\!20 Ours \(Gemma\-7b\)0\.570\.450\.55\\rowcolorgray\!20 Ours \(Llama3\-8b\)0\.500\.430\.61Table 3:Red\-Teaming attack performance \(ASR\) on StrongReject\.To evaluate STAR\-Teaming in various environments, we compared our model with baselines using the StrongReject dataset, and also conducted experiments on another Attacker LLM setting\. Table[3](https://arxiv.org/html/2604.18976#S4.T3)demonstrates that STAR\-Teaming outperforms other baselines by achieving a higher score\. On average, STAR\-Teaming achieved an average score of 0\.52, which is \+0\.41 points higher than that of the second\-highest performing baseline, TAP\. Interestingly, there was almost no difference in score when changing the attacker agent setting\. Ultimately, STAR\-Teaming consistently outperforms other baselines, demonstrating effectiveness across various targets, datasets, and attack agents\.
### 4\.6Effectiveness of Dynamic Network Expansion
While the default configuration of STAR\-Teaming operates on a static topology in which the community structure is fixed after initialization, the framework naturally admits an extension to dynamic network expansion through a modularity\-based criterion\. Specifically, whenever a new node \(i\.e\., a newly observed response or an extracted strategy\) is introduced, we determine its community assignment by evaluating the resulting change in network modularity\(Blondel et al\.,[2008](https://arxiv.org/html/2604.18976#bib.bib6)\):
ΔM=maxc∈𝒞ΔQ\(n,c\)⏟join existing community−λΔQ\(n,cnew\)⏟form new community\\Delta M=\\underbrace\{\\max\_\{c\\in\\mathcal\{C\}\}\\Delta Q\(n,c\)\}\_\{\\text\{join existing community\}\}\-\\lambda\\underbrace\{\\Delta Q\(n,c\_\{\\text\{new\}\}\)\}\_\{\\text\{form new community\}\}\(13\)The first term quantifies the maximum modularity gain achievable by assigning the new nodennto an existing communityc∈𝒞c\\in\\mathcal\{C\}, whereas the second term captures the modularity change induced by instantiating a new singleton communitycnewc\_\{\\text\{new\}\}\. WhenΔM<0\\Delta M<0, the framework instantiates a new community; otherwise, the node is absorbed into the most compatible existing community\. The hyperparameterλ\\lambdaserves as a regularization coefficient that modulates the bias toward merging into existing communities\. The complete derivation and implementation details are provided in Appendix[J](https://arxiv.org/html/2604.18976#A10)\.
As a preliminary validation of the effectiveness of this expansion mechanism, we compare the default static configuration against the dynamic variant under identical conditions on HarmBench with Llama\-2\-7b\-chat as the target\. Table[4](https://arxiv.org/html/2604.18976#S4.T4)reports both the Attack Success Rate \(ASR\) and the average number of attack trials per seed \(N¯\(𝒜\)\\bar\{N\}\(\\mathcal\{A\}\), lower is better\)\.
MethodASR \(%\)N¯\(𝒜\)\\bar\{N\}\(\\mathcal\{A\}\)w/o Dynamic Network71\.061\.1w/ Dynamic Network77\.352\.4Table 4:Effect of Dynamic Network Expansion on HarmBench with Llama\-2\-7b\-chat\. The dynamic variant yields both higher ASR and fewer attack trials per seed\.The results demonstrate a dual improvement: enabling dynamic expansion raises ASR by \+6\.3 percentage points \(71\.0%→\\to77\.3%\) while simultaneously reducing the average number of attack trials from 61\.1 to 52\.4\. This efficiency gain is notable because higher success rates typically require*more*, not fewer, iterations; the fact that dynamic expansion achieves both suggests that the mechanism effectively assimilates novel attack patterns that the static network cannot capture, converting what would have been failed iterations into successful ones\. In particular, the ability to instantiate new communities at runtime allows STAR\-Teaming to adapt to target\-specific defense behaviors that emerge only after deployment, rather than being constrained by the initial warm\-up logs\.
## 5Conclusion
We demonstrate that by sampling optimal strategies, both the attack generation speed and the attack success rate can be significantly increased\. Moreover, a modularity\-guided dynamic expansion mechanism allows the underlying network to evolve alongside the attack process, further enhancing adaptability without compromising efficiency\. We believe that STAR\-Teaming contributes to AI safety by preemptively identifying potential risks in LLM development\. Future work will involve extending STAR\-Teaming to specialized red teaming for vision and multimodal domains\.
## Limitations
Prompt Engineering\.Since each agent in STAR\-Teaming is an LLM, the overall performance of the framework is intrinsically dependent on the inherent capabilities of the LLMs\. For example, the attacker must be proficient at generating creative and potent attacks, the scorer needs to make accurate judgments, and the strategy extractor must effectively extract strategies\. Consequently, improving LLM agents performance necessitates a reliance on prompt engineering, which can demand significant human effort and time\.
Community Drift over Extended Deployment\.Although our Dynamic Network Expansion mechanism \(Section[4\.6](https://arxiv.org/html/2604.18976#S4.SS6)\) enables the network to incorporate new strategies and responses at runtime, the community centroids themselves are not retroactively re\-optimized\. Over sufficiently long deployment horizons, this may lead to gradual concept drift, where early centroids become less representative of the evolving attack\-defense landscape\. Periodic re\-initialization or a fully online community re\-detection scheme remains an open direction for future work\.
Reliance on Scorer AgentThe framework’s effectiveness is dependent on the performance of a single scorer agent, a potential vulnerability shared by most recent automated red\-teaming systems using an “LLM\-as\-a\-judge” approach\. The reliability of the entire system hinges on the scorer’s ability to provide accurate and consistent judgments\. To assess this concern, we empirically investigate an ensemble configuration that aggregates three heterogeneous LLM scorers; the detailed analysis in Appendix[L\.1](https://arxiv.org/html/2604.18976#A12.SS1)shows that while ensembling offers marginal gains in human agreement, the single\-scorer configuration remains preferable under realistic cost constraints\. Periodic human\-in\-the\-loop calibration remains a complementary direction for future work\.
## References
- Achiam et al\. \(2023\)Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, and 1 others\. 2023\.Gpt\-4 technical report\.*arXiv preprint arXiv:2303\.08774*\.
- AI \(2024\)Meta AI\. 2024\.Llama 3 technical report\.[https://llama\.meta\.com/llama3](https://llama.meta.com/llama3)\.Accessed: 2025\-05\-18\.
- Anil et al\. \(2024a\)Cem Anil, Esin Durmus, Nina Panickssery, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Meg Tong, Jesse Mu, Daniel Ford, and 1 others\. 2024a\.Many\-shot jailbreaking\.*Advances in Neural Information Processing Systems*, 37:129696–129742\.
- Anil et al\. \(2024b\)Rohan Anil, Orpaz Goldstein, Yi Tay, Slav Petrov, Wenhan Xiong, Hyung Won Chung, Zhen Qin, Mostafa Dehghani, Aakanksha Chowdhery, Daphne Ippolito, Xuezhi Wang, Jiahui Yu, Jinsung Yoon, Hanxiao Liu, Alex Ku, Barret Zoph, William Fedus, Markus Freitag, Sebastian Gehrmann, and 8 others\. 2024b\.[Gemma: Open models based on gemini](https://arxiv.org/abs/2402.17764)\.*Preprint*, arXiv:2402\.17764\.
- Anthropic \(2024\)Anthropic\. 2024\.The claude 3 model family: Opus, sonnet, haiku\.[https://www\.anthropic\.com/claude\-3\-model\-card\.](https://www.anthropic.com/claude-3-model-card.)
- Blondel et al\. \(2008\)Vincent D Blondel, Jean\-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre\. 2008\.[Fast unfolding of communities in large networks](https://doi.org/10.1088/1742-5468/2008/10/p10008)\.*Journal of Statistical Mechanics: Theory and Experiment*, 2008\(10\):P10008\.
- Chao et al\. \(2023\)Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong\. 2023\.Jailbreaking black box large language models in twenty queries\.*arXiv preprint arXiv:2310\.08419*\.
- Costa\-Jussà et al\. \(2022\)Marta R Costa\-Jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, and 1 others\. 2022\.No language left behind: Scaling human\-centered machine translation\.*arXiv preprint arXiv:2207\.04672*\.
- Grattafiori et al\. \(2024\)Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al\-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others\. 2024\.[The llama 3 herd of models](https://arxiv.org/abs/2407.21783)\.*Preprint*, arXiv:2407\.21783\.
- Guo et al\. \(2024\)Xingang Guo, Fangxu Yu, Huan Zhang, Lianhui Qin, and Bin Hu\. 2024\.Cold\-attack: Jailbreaking llms with stealthiness and controllability\.*arXiv preprint arXiv:2402\.08679*\.
- Han et al\. \(2024\)Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri\. 2024\.[Wildguard: Open one\-stop moderation tools for safety risks, jailbreaks, and refusals of llms](https://arxiv.org/abs/2406.18495)\.*Preprint*, arXiv:2406\.18495\.
- Hurst et al\. \(2024\)Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, and 1 others\. 2024\.Gpt\-4o system card\.*arXiv preprint arXiv:2410\.21276*\.
- Jaynes \(1957\)E\. T\. Jaynes\. 1957\.[Information theory and statistical mechanics](https://doi.org/10.1103/PhysRev.106.620)\.*Phys\. Rev\.*, 106:620–630\.
- Ji et al\. \(2023\)Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Chi Zhang, Ruiyang Sun, Yizhou Wang, and Yaodong Yang\. 2023\.[Beavertails: Towards improved safety alignment of llm via a human\-preference dataset](https://arxiv.org/abs/2307.04657)\.*Preprint*, arXiv:2307\.04657\.
- Jin et al\. \(2024\)Haibo Jin, Ruoxi Chen, Andy Zhou, Yang Zhang, and Haohan Wang\. 2024\.Guard: Role\-playing to generate natural\-language jailbreakings to test guideline adherence of large language models\.*arXiv preprint arXiv:2402\.03299*\.
- Kivelä et al\. \(2014\)Mikko Kivelä, Alex Arenas, Marc Barthelemy, James P Gleeson, Yamir Moreno, and Mason A Porter\. 2014\.Multilayer networks\.*Journal of complex networks*, 2\(3\):203–271\.
- Li et al\. \(2023\)Yunxiang Li, Zihan Li, Kai Zhang, Ruilong Dan, Steve Jiang, and You Zhang\. 2023\.Chatdoctor: A medical chat model fine\-tuned on a large language model meta\-ai \(llama\) using medical domain knowledge\.*Cureus*, 15\(6\)\.
- Liao and Sun \(2024\)Zeyi Liao and Huan Sun\. 2024\.Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms\.*arXiv preprint arXiv:2404\.07921*\.
- Lin et al\. \(2025\)Lizhi Lin, Honglin Mu, Zenan Zhai, Minghan Wang, Yuxia Wang, Renxi Wang, Junjie Gao, Yixuan Zhang, Wanxiang Che, Timothy Baldwin, and 1 others\. 2025\.Against the achilles’ heel: A survey on red teaming for generative models\.*Journal of Artificial Intelligence Research*, 82:687–775\.
- Liu et al\. \(2024\)Xiaogeng Liu, Peiran Li, Edward Suh, Yevgeniy Vorobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, and Chaowei Xiao\. 2024\.Autodan\-turbo: A lifelong agent for strategy self\-exploration to jailbreak llms\.*arXiv preprint arXiv:2410\.05295*\.
- Liu et al\. \(2023\)Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao\. 2023\.Autodan: Generating stealthy jailbreak prompts on aligned large language models\.*arXiv preprint arXiv:2310\.04451*\.
- Magnani et al\. \(2021\)Matteo Magnani, Obaida Hanteer, Roberto Interdonato, Luca Rossi, and Andrea Tagarelli\. 2021\.Community detection in multiplex networks\.*ACM Computing Surveys \(CSUR\)*, 54\(3\):1–35\.
- Mazeika et al\. \(2024\)Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, and 1 others\. 2024\.Harmbench: A standardized evaluation framework for automated red teaming and robust refusal\.*arXiv preprint arXiv:2402\.04249*\.
- Mehrotra et al\. \(2024\)Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi\. 2024\.Tree of attacks: Jailbreaking black\-box llms automatically\.*Advances in Neural Information Processing Systems*, 37:61065–61105\.
- Nguyen et al\. \(2017\)H Chau Nguyen, Riccardo Zecchina, and Johannes Berg\. 2017\.Inverse statistical problems: from the inverse ising problem to data science\.*Advances in Physics*, 66\(3\):197–261\.
- Samvelyan et al\. \(2024\)Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu, Eric Hambro, Aram Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker\-Holder, Jakob Foerster, and 1 others\. 2024\.Rainbow teaming: Open\-ended generation of diverse adversarial prompts\.*Advances in Neural Information Processing Systems*, 37:69747–69786\.
- Schick et al\. \(2023\)Timo Schick, Jane Dwivedi\-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom\. 2023\.Toolformer: Language models can teach themselves to use tools\.*Advances in Neural Information Processing Systems*, 36:68539–68551\.
- Shen et al\. \(2024\)Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang\. 2024\." do anything now": Characterizing and evaluating in\-the\-wild jailbreak prompts on large language models\.In*Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security*, pages 1671–1685\.
- Singhal et al\. \(2023\)Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole\-Lewis, Stephen Pfohl, and 1 others\. 2023\.Large language models encode clinical knowledge\.*Nature*, 620\(7972\):172–180\.
- Souly et al\. \(2024\)Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, and 1 others\. 2024\.A strongreject for empty jailbreaks\.*arXiv preprint arXiv:2402\.10260*\.
- Team et al\. \(2023\)Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean\-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, and 1 others\. 2023\.Gemini: a family of highly capable multimodal models\.*arXiv preprint arXiv:2312\.11805*\.
- Team et al\. \(2025\)Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, and 197 others\. 2025\.[Gemma 3 technical report](https://arxiv.org/abs/2503.19786)\.*Preprint*, arXiv:2503\.19786\.
- Touvron et al\. \(2023\)Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, and 1 others\. 2023\.Llama 2: Open foundation and fine\-tuned chat models\.*arXiv preprint arXiv:2307\.09288*\.
- Traag et al\. \(2019\)Vincent A Traag, Ludo Waltman, and Nees Jan Van Eck\. 2019\.From louvain to leiden: guaranteeing well\-connected communities\.*Scientific reports*, 9\(1\):1–12\.
- Wei et al\. \(2023\)Alexander Wei, Nika Haghtalab, and Jacob Steinhardt\. 2023\.Jailbroken: How does llm safety training fail?*Advances in Neural Information Processing Systems*, 36:80079–80110\.
- Yang et al\. \(2025\)An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others\. 2025\.[Qwen3 technical report](https://arxiv.org/abs/2505.09388)\.*Preprint*, arXiv:2505\.09388\.
- Zeng et al\. \(2024\)Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi\. 2024\.How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 14322–14350\.
- Zou et al\. \(2023\)Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson\. 2023\.Universal and transferable adversarial attacks on aligned language models\.*arXiv preprint arXiv:2307\.15043*\.
## Appendix AImplementation Details
To construct the strategy network, we employ gpt\-4o\-mini as a strategy extractor and text\-embedding\-3\-small as a embedding model with warm\-up attack logs for the 50 seed data from AdvBenchZou et al\. \([2023](https://arxiv.org/html/2604.18976#bib.bib38)\)we selected\. This allowed us to extract 500 response\-strategy nodes with a limited budget, costing under$1\\mathdollar 1\.
The Leiden algorithm is utilized for community detection nodes\. The resulting community counts were 50 for response communities and 15 for strategy communities, with parametersαr=0\.85\\alpha\_\{r\}=0\.85andαstg=0\.9\\alpha\_\{stg\}=0\.9\. Furthermore, to determine the central node of each community, we partitioned each community into a subgraph and obtained the central node by utilizing degree centrality within it\. Consequently, the size of the mapping matrix \(representing the parameters of the probabilistic model\) is 750\. This small size resulted in a training speed negligibly short to measure\. The learning rate is set tolr=0\.5\\text\{lr\}=0\.5, and the parameterβ\\betais dynamically optimized to ensure that the top 3 strategies with the highest probability collectively accounted for 80% of the probability mass\.
## Appendix BDerivation of the Update Rule
To obtain Equation[9](https://arxiv.org/html/2604.18976#S3.E9), we first need to explicitly determinelogL\(Zt\)\\log L\(Z^\{t\}\)\.
logL\(Z\)=∑\(rp,sq\)∈Dlogp\(E\(rp,sq\)\|Z\)=∑\(rp,sq\)∈Dloge−E\(rp,sq\)𝒵\(Z\)=∑\(rp,sq\)∈Dlog\(exp\(∑j=1NJ∑i=1NIZij𝐎pqij\)𝒵\(Z\)\)=∑\(rp,sq\)∈D∑ijZij𝐎pqij−∑\(rp,sq\)∈Dlog𝒵\(Z\)=∑ij∑\(rp,sq\)∈DZij𝐎pqij−∑\(rp,sq\)∈Dlog𝒵\(Z\)=∑ijZij∑\(rp,sq\)∈D𝐎pqij−Mlog𝒵\(Z\)\\begin\{split\}&\\log L\(Z\)\\\\ &=\\sum\_\{\{\}\_\{\(r\_\{p\},s\_\{q\}\)\\in D\}\}\\log p\(E\(r\_\{p\},s\_\{q\}\)\|Z\)\\\\ &=\\sum\_\{\{\}\_\{\(r\_\{p\},s\_\{q\}\)\\in D\}\}\\log\\frac\{e^\{\-E\(r\_\{p\},s\_\{q\}\)\}\}\{\\mathcal\{Z\}\(Z\)\}\\\\ &=\\sum\_\{\{\}\_\{\(r\_\{p\},s\_\{q\}\)\\in D\}\}\\log\\left\(\\frac\{\\exp\\\!\\left\(\\sum\_\{j=1\}^\{N\_\{J\}\}\\sum\_\{i=1\}^\{N\_\{I\}\}Z\_\{ij\}\\mathbf\{O\}^\{ij\}\_\{pq\}\\right\)\}\{\\mathcal\{Z\}\(Z\)\}\\right\)\\\\ &=\\sum\_\{\{\}\_\{\(r\_\{p\},s\_\{q\}\)\\in D\}\}\\sum\_\{ij\}Z\_\{ij\}\\mathbf\{O\}^\{ij\}\_\{pq\}\-\\sum\_\{\{\(r\_\{p\},s\_\{q\}\)\\in D\}\}\\log\\mathcal\{Z\}\(Z\)\\\\ &=\\sum\_\{ij\}\\sum\_\{\{\}\_\{\(r\_\{p\},s\_\{q\}\)\\in D\}\}Z\_\{ij\}\\mathbf\{O\}^\{ij\}\_\{pq\}\-\\sum\_\{\{\(r\_\{p\},s\_\{q\}\)\\in D\}\}\\log\\mathcal\{Z\}\(Z\)\\\\ &=\\sum\_\{ij\}Z\_\{ij\}\\sum\_\{\(r\_\{p\},s\_\{q\}\)\\in D\}\\mathbf\{O\}^\{ij\}\_\{pq\}\-M\\log\\mathcal\{Z\}\(Z\)\\\\ \\end\{split\}
Next, we provide the derivative expansion oflogL\\log Lwith respect toZZfor gradient ascent\.
∂lnℒ∂Zij=∂∂Zij∑ijZij∑\(rp,sq\)∈D𝐎pqij−∂log𝒵\(Z\)∂Zij=∑\(rp,sq\)∈D𝐎pqij−1𝒵\(Z\)∂𝒵\(Z\)∂Zij=∑\(rp,sq\)∈D𝐎pqij−1𝒵\(Z\)∂∂Zij∑\(rp,sq\)∈σexp\(−E\(rp,sq\)\)=∑\(rp,sq\)∈D𝐎pqij−1𝒵\(Z\)∑\(rp,sq\)∈σ∂∂Zijexp\(−E\(rp,sq\)\)=∑\(rp,sq\)∈D𝐎pqij−1𝒵\(Z\)∑\(rp,sq\)∈σexp\(−E\(rp,sq\)\)𝐎pqij=∑\(rp,sq\)∈D𝐎pqij−∑\(rp,sq\)∈σp\(rp,sq∣Z\)𝐎pqij=∑\(rp,sq\)∈D𝐎pqij−ND⟨𝐎pqij⟩\.\\begin\{split\}&\\frac\{\\partial\\ln\\mathcal\{L\}\}\{\\partial\{Z\}\_\{ij\}\}\\\\ &=\\frac\{\\partial\}\{\\partial\{Z\}\_\{ij\}\}\\sum\_\{ij\}Z\_\{ij\}\\sum\_\{\(r\_\{p\},s\_\{q\}\)\\in D\}\\mathbf\{O\}^\{ij\}\_\{pq\}\-\\frac\{\\partial\\log\\mathcal\{Z\}\(Z\)\}\{\\partial\{Z\}\_\{ij\}\}\\\\ &=\\sum\_\{\(r\_\{p\},s\_\{q\}\)\\in D\}\\mathbf\{O\}^\{ij\}\_\{pq\}\-\\frac\{1\}\{\\mathcal\{Z\}\(Z\)\}\\frac\{\\partial\\mathcal\{Z\}\(Z\)\}\{\\partial\{Z\}\_\{ij\}\}\\\\ &=\\sum\_\{\(r\_\{p\},s\_\{q\}\)\\in D\}\\mathbf\{O\}^\{ij\}\_\{pq\}\-\\frac\{1\}\{\\mathcal\{Z\}\(Z\)\}\\frac\{\\partial\}\{\\partial\{Z\}\_\{ij\}\}\\sum\_\{\(r\_\{p\},s\_\{q\}\)\\in\\sigma\}\\exp\(\-E\(r\_\{p\},s\_\{q\}\)\)\\\\ &=\\sum\_\{\(r\_\{p\},s\_\{q\}\)\\in D\}\\mathbf\{O\}^\{ij\}\_\{pq\}\-\\frac\{1\}\{\\mathcal\{Z\}\(Z\)\}\\sum\_\{\(r\_\{p\},s\_\{q\}\)\\in\\sigma\}\\frac\{\\partial\}\{\\partial Z\_\{ij\}\}\\exp\(\-E\(r\_\{p\},s\_\{q\}\)\)\\\\ &=\\sum\_\{\(r\_\{p\},s\_\{q\}\)\\in D\}\\mathbf\{O\}^\{ij\}\_\{pq\}\-\\frac\{1\}\{\\mathcal\{Z\}\(Z\)\}\\sum\_\{\(r\_\{p\},s\_\{q\}\)\\in\\sigma\}\\exp\(\-E\(r\_\{p\},s\_\{q\}\)\)\\,\\mathbf\{O\}^\{ij\}\_\{pq\}\\\\ &=\\sum\_\{\(r\_\{p\},s\_\{q\}\)\\in D\}\\mathbf\{O\}^\{ij\}\_\{pq\}\-\\sum\_\{\(r\_\{p\},s\_\{q\}\)\\in\\sigma\}p\(r\_\{p\},s\_\{q\}\\mid Z\)\\,\\mathbf\{O\}^\{ij\}\_\{pq\}\\\\ &=\\sum\_\{\(r\_\{p\},s\_\{q\}\)\\in D\}\\mathbf\{O\}^\{ij\}\_\{pq\}\-N\_\{D\}\\left\\langle\\mathbf\{O\}^\{ij\}\_\{pq\}\\right\\rangle\.\\end\{split\}
Note that here,σ\\sigmarefers to the entire configuration space, not just the data space\.
## Appendix CTheoretical Justification
### C\.1Boltzmann Distribution and Maximum Entropy
The choice of the Boltzmann distribution in our framework is grounded in the Principle of Maximum EntropyJaynes \([1957](https://arxiv.org/html/2604.18976#bib.bib13)\)\. Our objective is to infer a probability distribution that matches the observed interactions \(correlations\) between strategy communities and response patterns, while making minimal assumptions about unobserved data\. As detailed in the review byNguyen et al\. \([2017](https://arxiv.org/html/2604.18976#bib.bib25)\), the distribution that maximizes Shannon entropy under constraints on the first\- and second\-order moments is uniquely the Boltzmann distribution\. Thus, it serves as the statistically optimal model for capturing these interactions without introducing arbitrary bias\.
### C\.2Proof of Convexity
Letℓ\(Z\):=−logL\(Z∣D\)\\ell\(Z\):=\-\\log L\(Z\\mid D\)denote the negative log\-likelihood of the Boltzmann distribution\. The convexity of our optimization objective can be established by analyzing the Hessian ofℓ\(Z\)\\ell\(Z\)\. The gradient ofℓ\(Z\)\\ell\(Z\)with respect to the interaction parameterZijZ\_\{ij\}is the difference between the model’s expected observables and the empirical values\.
Differentiatingℓ\(Z\)\\ell\(Z\)once more yields the Hessian matrix:
∂2ℓ\(Z\)∂Zij∂Zkl=⟨𝐎ij𝐎kl⟩−⟨𝐎ij⟩⟨𝐎kl⟩\\frac\{\\partial^\{2\}\\ell\(Z\)\}\{\\partial Z\_\{ij\}\\partial Z\_\{kl\}\}=\\langle\\mathbf\{O\}^\{ij\}\\mathbf\{O\}^\{kl\}\\rangle\-\\langle\\mathbf\{O\}^\{ij\}\\rangle\\langle\\mathbf\{O\}^\{kl\}\\rangle\(14\)where⟨⋅⟩\\langle\\cdot\\rangledenotes the expectation under the model distribution\. This expression corresponds exactly to the covariance matrix of the observables under the model distribution\. Since covariance matrices are mathematically guaranteed to be positive semi\-definite,ℓ\(Z\)\\ell\(Z\)is strictly convex\. This convexity ensures that our gradient\-based optimization \(Inverse Ising Problem\) has no local minima and is guaranteed to converge to the unique global optimumNguyen et al\. \([2017](https://arxiv.org/html/2604.18976#bib.bib25)\)\.
## Appendix DComputational Resource Requirement
The memory requirements of the Star\-Teaming framework are contingent upon the memory consumption of the target Large Language Model \(LLM\)\. For instance, operating a 70\-billion\-parameter LLM necessitates 160GB of VRAM, whereas models in the 7 to 13\-billion\-parameter range require 80GB of VRAM\. Furthermore, all computations associated with the Multiplex Network are processed on the CPU, imposing no additional demand on GPU resources\. Our experimental results indicate that the construction of the Multiplex Network required 0\.37 seconds, the optimization of the mapping matrix via Maximum Likelihood Estimation \(MLE\) took 0\.02 seconds, and each strategy sampling instance consumed approximately 0\.1 seconds\. Consequently, the computational overhead introduced by the Multiplex Network is negligible\.
## Appendix EPseudo Code
STAR\-Teaming comprises two primary algorithmic components: \(1\) Multiplex Network Initialization and Mapping Network Optimization, and \(2\) Probabilistic Strategy Sampling\. To facilitate reproducibility and provide a granular view of our implementation, we present the pseudocode for these core components\. Algorithm 1 details the process of constructing the multiplex network and optimizing the mapping matrix through the Inverse Ising Problem formulation\. Subsequently, Algorithm 2 outlines the dynamic strategy retrieval mechanism, which employs adaptive temperature scheduling to balance exploration and exploitation\.
Algorithm 1Multiplex Network Initialization and Mapping Optimization1:Attack logs
DD, Thresholds
α\\alpha, Learning rate
η\\eta
2:Network structures
𝒢\\mathcal\{G\}, Mapping matrices
𝒵\\mathcal\{Z\}
3:Initializeempty structures for graph
𝒢\\mathcal\{G\}and centers
CC
4:foreach column
c∈\{Response, Strategy\}c\\in\\\{\\text\{Response, Strategy\}\\\}do
5:
Ec←E\_\{c\}\\leftarrowExtractEmbeddings\(
D\[c\]D\[c\]\)
6:
Gc←G\_\{c\}\\leftarrowBuildGraph\(
Ec,αcE\_\{c\},\\alpha\_\{c\}\)
7:
Commc←\\text\{Comm\}\_\{c\}\\leftarrowDetectCommunities\(
GcG\_\{c\}\)⊳\\trianglerightLeiden Algorithm
8:
Cc←C\_\{c\}\\leftarrowFindCentralNodes\(
Gc,CommcG\_\{c\},\\text\{Comm\}\_\{c\}\)
9:Store
Gc,Commc,CcG\_\{c\},\\text\{Comm\}\_\{c\},C\_\{c\}into
𝒢\\mathcal\{G\}
10:endfor
11:Optimization \(Inverse Ising Problem\):
12:
N←\|CResponse\|N\\leftarrow\|C\_\{\\text\{Response\}\}\|,
M←\|CStrategy\|M\\leftarrow\|C\_\{\\text\{Strategy\}\}\|
13:
Dataenc←Data\_\{enc\}\\leftarrowIndexEncode\(
DD,
CommResponse\\text\{Comm\}\_\{\\text\{Response\}\},
CommStrategy\\text\{Comm\}\_\{\\text\{Strategy\}\}\)
14:Initialize mapping matrix
Z∈ℝN×MZ\\in\\mathbb\{R\}^\{N\\times M\}
15:whilenot convergeddo
16:Compute gradient
∇ℒ\(Z\)\\nabla\\mathcal\{L\}\(Z\)based on Eq\. \(9\)
17:
Z←Z\+η⋅∇ℒ\(Z\)Z\\leftarrow Z\+\\eta\\cdot\\nabla\\mathcal\{L\}\(Z\)
18:endwhile
19:return
𝒢,Z\\mathcal\{G\},Z
Algorithm 2Probabilistic Strategy Sampling with Adaptive Temperature1:Current Response
rr, Mapping Matrix
ZZ, Initial
β\\beta, Centers
CstgC\_\{stg\}
2:Selected Strategy
s∗s^\{\*\}
3:
j←j\\leftarrowIndex of community for
rr
4:
M←M\\leftarrowTotal number of strategy communities
5:Define Energy
E\(j,k\)=−∑ijZijOpqijE\(j,k\)=\-\\sum\_\{ij\}Z\_\{ij\}O\_\{pq\}^\{ij\}for all
k∈\{1…M\}k\\in\\\{1\.\.\.M\\\}
6:
Estable←E\_\{stable\}\\leftarrowCompute energy vector for all strategy candidates
7:Adaptive Temperature Scheduling \(Top\-3 Logic\):
8:loop
9:
P∝exp\(−β⋅Estable\)P\\propto\\exp\(\-\\beta\\cdot E\_\{stable\}\)
10:
α←∑Top3\(P\)\\alpha\\leftarrow\\sum\\text\{Top3\}\(P\)
11:if
\|α−0\.8\|<0\.1\|\\alpha\-0\.8\|<0\.1ormax\_iter reachedthen
12:break
13:else
14:Update
β\\betato adjust distribution sharpness
15:endif
16:endloop
17:Sample strategy index
k∗∼Pk^\{\*\}\\sim P
18:Retrieve center node text
s∗←Cstg\[k∗\]s^\{\*\}\\leftarrow C\_\{stg\}\[k^\{\*\}\]
19:return
s∗s^\{\*\}
## Appendix FStrategy Exploration and Refinement
To ensure broad strategy exploration, we employ a tempering approach in our attack process for each seed into a maximum of three stages\. The initial 20 attacks are conducted without any specific strategy\. Following this, strategic attacks are iteratively performed up to 100 times\. If the attack score still does not exceed 8\.5 at this point, the process enters an exploration phase\.
To mitigate the risk of converging to a suboptimal set of strategies and to overcome the cold\-start problem, our framework incorporates a dedicated exploration phase\. If an attack on a given seed fails to achieve a success threshold after a set number of iterations \(e\.g\., 100 attempts\), the system prompts the attacker LLM to generate a novel attack\. The prompt explicitly instructs the model to devise a creative approach while avoiding the previously failed strategies: “The strategies you have attempted so far \(A, B, C, …\) have been unsuccessful\. Please devise a creative new attack, avoiding these previous strategies\.” If this new, exploratory attack succeeds, the LLM\-based strategy extractor is then employed to analyze the successful log and distill the novel strategy, which is then integrated into our framework\. This ensures a continuous expansion of the strategic repertoire\.
The LLM\-based strategy extractoris a critical component for dynamically identifying and cataloging effective attack patterns\. The extraction process is initiated during the warm\-up phase using pre\-defined seed prompts\. For a given seed, we identify two attack logs, i and j, that yielded different scores\. The extractor, an LLM itself, is then prompted to perform a comparative analysis with the following instruction: “Given the two attack attempts i and j with their respective prompts, responses, and scores, explain why attack j was more successful than attack i from a strategic perspective\. Based on this analysis, extract the strategy employed in attack j and provide its name and a concise definition\.” This method allows the system to build a rich, dynamically updated database of named strategies, which forms the basis for constructing the Strategy Network\.


Figure 6:Illustration of Response Network with \(UP\)αr=0\.75\\alpha\_\{r\}=0\.75and \(DOWN\)αr=0\.85\\alpha\_\{r\}=0\.85

Figure 7:Illustration of Strategy Network with \(UP\)αstg=0\.85\\alpha\_\{stg\}=0\.85and \(DOWN\)αstg=0\.9\\alpha\_\{stg\}=0\.9\.Figure 8:Number of communities \(Top\), average degree \(Middle\), and clustering coefficient \(Bottom\) as a function of the hyperparametersαstg\\alpha\_\{stg\}andαr\\alpha\_\{r\}\.
## Appendix GNetwork Analysis
Figures[6](https://arxiv.org/html/2604.18976#A6.F6)and[7](https://arxiv.org/html/2604.18976#A6.F7)visualize the multiplex network constructed in this study, while Figure[8](https://arxiv.org/html/2604.18976#A6.F8)provides a quantitative analysis of its structural properties\. This analysis reveals key characteristics that validate our community\-based approach\. Figure[6](https://arxiv.org/html/2604.18976#A6.F6)visualizes the Response Network\. Each network visualization consists of 500 nodes, where the large red nodes represent the central node of each community \(determined by degree centrality\)\. The top panel, with a similarity threshold ofαr=0\.75\\alpha\_\{r\}=0\.75, shows a relatively dense structure composed of a few large communities\. In contrast, the bottom panel, with a higher threshold ofαr=0\.85\\alpha\_\{r\}=0\.85, illustrates a more fragmented network with a greater number of smaller communities, as weaker edges have been pruned\. The histograms on the right clearly depict this shift in community size distribution\. This demonstrates that by tuningαr\\alpha\_\{r\}, we can effectively control the number and granularity of communities, allowing for the systematic management of the search space over LLM response patterns\.
Figure[7](https://arxiv.org/html/2604.18976#A6.F7)presents the Strategy Network, which exhibits markedly different characteristics\. Even at a very high threshold ofαstg=0\.9\\alpha\_\{stg\}=0\.9, the network maintains a highly clustered structure dominated by a few large hubs\. This structure strongly suggests a high degree of semantic redundancy among the strategies generated by our strategy extractor; that is, many strategies with distinct names likely correspond to very similar underlying attack patterns\. If these strategies were treated individually and sampled based on embedding similarity alone, it would lead to inefficient exploration by repeatedly sampling functionally identical strategies\. Therefore, our community\-based approach is a critical mechanism for resolving this redundancy, enabling a more diverse and efficient search for effective attack strategies\.
Figure[8](https://arxiv.org/html/2604.18976#A6.F8)offers a macroscopic, quantitative view of these properties as a function of the thresholdα\\alpha\. As expected, increasing decreases the Average Degree and increases the Number of Communities in both networks\. However, the significant difference in the y\-axis scales confirms that the Strategy Network is inherently denser, supporting the strategy redundancy hypothesis\. The most critical insight comes from the Average Clustering Coefficient\. While both networks exhibit a sharp transition aroundα≈0\.9\\alpha\\approx 0\.9, the Strategy Network sustains a high clustering coefficient over a wider range ofα\\alphavalues\. This indicates the presence of exceptionally dense and tightly\-knit core communities within the Strategy Network\. Consequently, as the threshold increases, these core clusters do not easily disintegrate\. The observed increase in the number of communities is likely due to peripheral, weakly\-connected nodes breaking away from these core structures, rather than the fragmentation of the cores themselves\. This provides strong network\-theoretic evidence for the existence of redundant strategy groups and underscores the necessity of a community\-detection framework like STAR\-Teaming\.
To quantitatively validate the semantic coherence of the detected communities, we introduce the Intra\-Community Cosine Similarity metric, denoted as𝒦\\mathcal\{K\}\. This metric measures the average pairwise cosine similarity among node embeddings within a given community, defined as follows:
𝒦=1\|𝒢\|∑C∈𝒢\(2\|C\|\(\|C\|−1\)∑\{ni<nj\}⊆Ccos\(I\(ni\),I\(nj\)\)\)\\mathcal\{K\}=\\frac\{1\}\{\|\\mathcal\{G\}\|\}\\sum\_\{C\\in\\mathcal\{G\}\}\\left\(\\frac\{2\}\{\|C\|\(\|C\|\-1\)\}\\sum\_\{\\\{n\_\{i\}<n\_\{j\}\\\}\\subseteq C\}\\cos\(I\(n\_\{i\}\),I\(n\_\{j\}\)\)\\right\)\(15\)whereI\(⋅\)I\(\\cdot\)represents the text embedding function \(utilizing text\-embedding\-3\-small from OpenAI\),𝒢\\mathcal\{G\}denotes the set of all communities in the network,CCrepresents an individual community, andni,njn\_\{i\},n\_\{j\}denote distinct nodes within that community\.
Community AlgorithmNetwork𝒦\\mathcal\{K\}Random CommunityStrategy0\.778Response0\.562Leiden AlgorithmStrategy0\.998Response0\.882
Table 5:Intra Cosine Similarity of Community Algorithms\.Table[5](https://arxiv.org/html/2604.18976#A7.T5)presents the computed𝒦\\mathcal\{K\}values for both the Strategy and Response networks\. To establish a baseline, we compared our Leiden\-based community detection against a Random Community assignment method\. The empirical results indicate that the Leiden algorithm yields significantly higher𝒦\\mathcal\{K\}values \(0\.998 and 0\.882\) compared to the random baseline\. This substantial margin confirms that the nodes within the communities identified by STAR\-Teaming exhibit a high degree of semantic homogeneity, effectively grouping similar strategies and responses rather than clustering them arbitrarily\.
Figure 9:Illustration of \(A\) Initial mapping matrix and \(B\) Updated mapping matrix\.
## Appendix HExplanation of Mapping Network
Figure[9](https://arxiv.org/html/2604.18976#A7.F9)visualizes the probability of selecting Strategy communityjjgiven Response communityii, where brighter colors indicate a probability closer to 1 and darker colors represent a probability closer to 0\. The left panel of the figure shows the probabilities in the initial state\. It is observable that, initially, the distribution is population\-centric, with a few strategies accounting for the majority of the sampling probability\. The right panel displays the probabilities in the final updated state\. Although some characteristics of the initial state persist, it is evident that a more diverse range of strategies now has a higher probability of being sampled\.
## Appendix IAblation Study on Network Composition
In this section, we evaluate the impact of network composition on the performance of STAR\-Teaming\. As detailed in Section 4\.1, the network is constructed from pre\-existing STAR\-Teaming logs\. The quality of this network is therefore contingent upon four primary factors: the attacker model, the target model, the source of warm\-up seeds, and the network size \(i\.e\., the number of nodes\)\. We conducted an ablation study to measure ASR variations by systematically altering these factors\. For each configuration, we first generated corresponding logs, then constructed a network using the strategy extractor, and finally evaluated the resulting ASR\.
For this ablation study, all experiments were conducted on the HarmBench benchmark, using ‘Gemma\-1\.1\-7b\-it‘ as the default attacker model and ‘Llama2\-7b\-chat‘ as the target model\.
AttackerTargetWarm\-up SourceNode SizeASRG1gemma\-1\.1\-7bllama2\-7bAdvBench50071\.0G2gemma\-1\.1\-7bllama2\-7bAdvBench25073\.0G3gemma\-1\.1\-7bllama2\-7bAdvBench12572\.5G4gemma\-1\.1\-7bllama2\-7bAdvBench150068\.8G5llama2\-7bllama2\-7bAdvBench50061\.3G6gemma\-1\.1\-7bllama3\-8bAdvBench50071\.0G7gemma\-1\.1\-7bllama2\-7bHarmBench50073\.8G8gemma\-1\.1\-7bllama2\-7bStrongReject50067\.3
Table 6:Ablation study on the impact of network composition on ASR\. Each row \(G1\-G8\) represents a network constructed with different components, showing how variations in attacker, target, seed source, and node size affect the final attack success rate\. The underline represents the value changed from the default value\.The results, presented in Table[6](https://arxiv.org/html/2604.18976#A9.T6), indicate the existence of an optimal network size, with peak performance observed at 250 nodes \(G2\)\. This suggests a trade\-off: an excessively large network may introduce noise or complexity into the optimization process, whereas a network that is too small may lack a sufficient diversity of effective strategies\. We hypothesize that the optimal network size is correlated with the complexity of the target dataset\.
Furthermore, our findings reveal that while changing the target model \(from ‘Llama2\-7b’ to ‘Llama3\-8b’ in G6\) had only a marginal impact on performance, the ASR was highly sensitive to the choice of the attacker model \(G5\)\. We attribute this sensitivity to the nature of the strategy extractor, which distills generalizable attack principles by analyzing and comparing the relative success of different prompts within the logs\.
Regarding the warm\-up source, the highest ASR was achieved when using seeds from HarmBench \(G7\), the same dataset used for testing\. As this configuration could be construed as data leakage, we clarify that ‘AdvBench’ was used as the warm\-up source in all other experiments to ensure a fair and rigorous evaluation\.
## Appendix JDynamic Network Expansion
While the proposed STAR\-Teaming framework adaptively samples strategies via the mapping matrixZZand the learning rule in Equation[8](https://arxiv.org/html/2604.18976#S3.E8), the underlying network topology itself remains static after the initial construction phase\. In principle, however, the community structure need not be fixed\. In this appendix, we provide a detailed exposition of a modularity\-based mechanism that enables the multiplex network to incrementally expand during the red\-teaming process\. Such structural plasticity is essential for discovering strategies that were absent in the initial warm\-up logs and for adapting to target\-specific defense behaviors that emerge only at runtime\.
### J\.1Node Insertion Protocol
The expansion mechanism is invoked whenever a new node is introduced into either layer of the multiplex network\. The trigger conditions differ between the two layers:
#### Response Network\.
A new node is added to the Response Network whenever the target model produces a response whose embedding similarity to all existing nodes falls below the construction thresholdαr\\alpha\_\{r\}\. This ensures that only genuinely novel responses—rather than paraphrases of known responses—trigger structural changes\. The embedding of the new response is computed via the same encoder used during initial network construction, and edges are established according to Equation[1](https://arxiv.org/html/2604.18976#S3.E1)\.
#### Strategy Network\.
New strategy nodes are generated through the exploration mechanism described in Appendix[F](https://arxiv.org/html/2604.18976#A6): when existing strategies repeatedly fail to elicit a successful jailbreak for a given seed, the attacker LLM is prompted to devise a novel attack, and the strategy extractor distills the resulting pattern into a named strategy\. Once synthesized, the new strategy is embedded, inserted into the network, and connected to existing nodes via the same thresholding procedure based onαstg\\alpha\_\{\\text\{stg\}\}\.
### J\.2Modularity\-Based Community Assignment
Once a new nodenin\_\{i\}is inserted, its community membership must be determined\. We formulate this as a local decision problem governed by the change in network modularity\(Blondel et al\.,[2008](https://arxiv.org/html/2604.18976#bib.bib6)\), which measures the density of connections within communities relative to what would be expected in a random graph of the same degree sequence\.
The modularity gain when assigningnin\_\{i\}to an existing communitycjc\_\{j\}is:
ΔQ\(ni,cj\)=ki,j2m−∑tot⋅ki\(2m\)2\\Delta Q\(n\_\{i\},c\_\{j\}\)=\\frac\{k\_\{i,j\}\}\{2m\}\-\\frac\{\\sum\_\{\\text\{tot\}\}\\cdot k\_\{i\}\}\{\(2m\)^\{2\}\}\(16\)whereki,jk\_\{i,j\}denotes the sum of edge weights betweennin\_\{i\}and nodes in communitycjc\_\{j\},kik\_\{i\}is the degree ofnin\_\{i\},∑tot\\sum\_\{\\text\{tot\}\}is the sum of degrees of nodes incjc\_\{j\}, andmmis the total edge weight in the network\. The first term rewards strong connectivity between the new node and the target community, while the second term penalizes assignments that inflate the community’s already\-large degree volume\.
Analogously, the modularity change induced by instantiating a new singleton communitycnewc\_\{\\text\{new\}\}fornin\_\{i\}is:
ΔQ\(ni,cnew\)=−ki2\(2m\)2\\Delta Q\(n\_\{i\},c\_\{\\text\{new\}\}\)=\-\\frac\{\{k\_\{i\}\}^\{2\}\}\{\(2m\)^\{2\}\}\(17\)This quantity is always non\-positive, reflecting the structural cost of fragmenting the network into smaller components\.
### J\.3The Dynamism Coefficientλ\\lambda
To balance these two forces in a controllable manner, we combine Equations[16](https://arxiv.org/html/2604.18976#A10.E16)and[17](https://arxiv.org/html/2604.18976#A10.E17)into a single decision criterion parameterized by a dynamism coefficientλ≥0\\lambda\\geq 0:
ΔM\(ni;λ\)=maxcj∈𝒞ΔQ\(ni,cj\)−λ⋅ΔQ\(ni,cnew\)\\Delta M\(n\_\{i\};\\lambda\)=\\max\_\{c\_\{j\}\\in\\mathcal\{C\}\}\\Delta Q\(n\_\{i\},c\_\{j\}\)\-\\lambda\\cdot\\Delta Q\(n\_\{i\},c\_\{\\text\{new\}\}\)\(18\)The assignment rule is then:
ni∈\{cnew,ifΔM\(ni;λ\)<0,argmaxcj∈𝒞ΔQ\(ni,cj\),otherwise\.n\_\{i\}\\in\\begin\{cases\}c\_\{\\text\{new\}\},&\\text\{if \}\\Delta M\(n\_\{i\};\\lambda\)<0,\\\\ \\arg\\max\_\{c\_\{j\}\\in\\mathcal\{C\}\}\\Delta Q\(n\_\{i\},c\_\{j\}\),&\\text\{otherwise\}\.\\end\{cases\}\(19\)
The coefficientλ\\lambdaadmits a clean interpretation\. SinceΔQ\(ni,cnew\)≤0\\Delta Q\(n\_\{i\},c\_\{\\text\{new\}\}\)\\leq 0, the subtracted term−λ⋅ΔQ\(ni,cnew\)\-\\lambda\\cdot\\Delta Q\(n\_\{i\},c\_\{\\text\{new\}\}\)in Equation[18](https://arxiv.org/html/2604.18976#A10.E18)is non\-negative and acts as a merging bias toward existing communities, whose magnitude is controlled byλ\\lambda\. Two limiting regimes are instructive:
- •λ→0\\lambda\\to 0\(highly plastic regime\):The merging bias vanishes, andΔM\\Delta Mreduces to the maximum modularity gain over existing communities\. Whenever no existing community yields a positive gain,ΔM\\Delta Mbecomes negative and the new node spawns its own singleton community, making the network structurally adaptive to novelty\.
- •λ→∞\\lambda\\to\\infty\(static regime\):The merging bias dominates, drivingΔM\\Delta Mstrongly positive for nearly any incoming node\. Each new node is almost always absorbed into the most compatible existing community, and the network recovers its original static behavior in which no new communities are formed\.
Intermediate values ofλ\\lambdatherefore trace a continuous spectrum between these two extremes, and selecting an appropriate value is critical for preserving the semantic coherence of communities while admitting genuine novelty\.
## Appendix KComparison of Sampling Baselines
To validate the effectiveness of our multiplex network, we compared STAR\-Teaming’s strategy sampling mechanism against two baselines\. The first is a Cosine Similarity\-based Retrieval method, which uses the text\-embedding\-3\-small model to retrieve a strategy associated with the most similar past response, but without any network structure or probabilistic optimization\. The second is a Multi\-Armed Bandit \(MAB\) that employs an epsilon\-greedy algorithm \(ϵ\\epsilon=0\.1\), using the scorer’s score as the reward signal\.
All comparative experiments were conducted under the same setting as in the main script, using Llama2\-7b\-chat\-hf as the target model to ensure a fair comparison\.
Sampling MethodASR \(%\)Cosine Similarity65\.0Multi\-Armed\-Bandit69\.0STAR\-Teaming71\.0
Table 7:Comparison of ASR on Llama2\-7b\-chat\-hf across different strategy sampling methods\.As shown in Table[7](https://arxiv.org/html/2604.18976#A11.T7), STAR\-Teaming achieved the highest performance with an ASR of 71\.0%\. This represents a 6\.0% improvement over Cosine Similarity\-based Retrieval and a 2\.0% improvement over the MAB baseline\. This performance gain demonstrates that explicitly modeling the structural relationships between communities of strategies and responses leads to more efficient vulnerability exploration than relying on the similarity of individual responses or a general adaptive learning algorithm\.
## Appendix LScorer Human Agreement
As discussed in the Limitations \(Section 5\), STAR\-Teaming relies heavily on an LLM\-based Scorer to evaluate attack success and update the strategy network\. To validate the reliability of this automated evaluation, we assessed the alignment between our Scorer’s judgments and human annotations\. We utilized two high\-quality safety datasets: BeaverTailsJi et al\. \([2023](https://arxiv.org/html/2604.18976#bib.bib14)\)and WildGuardHan et al\. \([2024](https://arxiv.org/html/2604.18976#bib.bib11)\), both of which provide ground\-truth labels \(Safe vs\. Unsafe\) for prompt\-response pairs\.
For this evaluation, we fed the prompt\-response pairs from the datasets into our Scorer and categorized the resulting scalar scores \(1\-10\) into three intervals: Low \(<5<5\), Medium \(5≤score<8\.55\\leq\\text\{score\}<8\.5\), and High \(≥8\.5\\geq 8\.5\)\. We then calculated the distribution of human ground\-truth labels within each interval\. A robust scorer should assign low scores to human\-labeled ’Safe’ responses \(typically refusals\) and high scores to ’Unsafe’ responses \(successful jailbreaks\)\.
Table[8](https://arxiv.org/html/2604.18976#A12.T8)presents the agreement analysis\. The results demonstrate a strong correlation between our Scorer’s ratings and human judgments:
First, in the high\-score interval \(≥8\.5\\geq 8\.5\), which STAR\-Teaming classifies as a successful attack, our Scorer exhibits exceptional precision\. On the BeaverTails dataset, 95\.1% of the responses in this range were labeled as Unsafe by humans\. Similarly, on WildGuard, 82\.2% were confirmed as Unsafe\. This indicates that when our system claims a jailbreak, it is highly likely to be a genuine vulnerability\. Second, refusal Detection: In the low\-score interval \(<5<5\), the Scorer effectively identifies failed attacks\. On WildGuard, which explicitly focuses on refusal detection, 96\.7% of the low\-scoring responses were human\-labeled as Safe\. Finally, comparison with Baselines: We compared our Scorer against the scoring mechanism used in AutoDAN\-Turbo\. As shown in the table, STAR\-Teaming achieves comparable or superior alignment, particularly in distinguishing clear safety violations \(High score range\) and safe refusals \(Low score range\)\.
These findings quantitatively support that our automated Scorer serves as a reliable proxy for human evaluation, minimizing false positives in jailbreak detection\.
modelscoreBeverTailsWildGuardSafeUnsafeSafeUnsafeSTAR\-Teamingscore<<570\.7%29\.3%96\.7%3\.3%5≤\\leqscore<<8\.519\.0%81\.0%65\.9%34\.1%8\.5≤\\leqscore4\.9%95\.1%17\.8%82\.2%AutoDAN\-Turboscore<<563\.0%37\.0%93\.7%6\.3%5≤\\leqscore<<8\.524\.1%75\.9%78\.8%21\.2%8\.5≤\\leqscore5\.2%94\.8%40\.4%59\.6%
Table 8:Human agreement analysis of the Scorer\. We categorized the Scorer’s outputs into three ranges and measured the proportion of Human Safe and Unsafe labels within each range using the BeaverTails and WildGuard datasets\. Higher percentages of Unsafe in the high\-score range \(≥8\.5\\geq 8\.5\) and Safe in the low\-score range \(<5<5\) indicate better alignment with human judgment\.### L\.1Ensemble Scorer Analysis
ScorerScore RangeBeaverTailsWildGuardCost / QuerySafeUnsafeSafeUnsafeSingle Scorer \(Ours\)Low \(<5<5\)70\.7%29\.3%96\.7%3\.3%6\.45×10−56\.45\\times 10^\{\-5\}Mid \(5∼8\.45\\sim 8\.4\)19\.0%81\.0%65\.9%34\.1%High \(≥8\.5\\geq 8\.5\)4\.9%95\.1%17\.8%82\.2%Ensemble Scorer \(3 LLMs\)Low \(<5<5\)72\.5%27\.5%95\.3%4\.7%23\.97×10−523\.97\\times 10^\{\-5\}Mid \(5∼8\.45\\sim 8\.4\)15\.1%84\.9%52\.2%47\.8%High \(≥8\.5\\geq 8\.5\)3\.2%96\.8%\(\+1\.7\)10\.5%89\.5%\(\+7\.3\)Table 9:Comparison between the default Single Scorer and an Ensemble Scorer that aggregatesclaude\-3\-haiku,gemini\-2\.5\-flash, andgpt\-4\.1\-minivia majority voting\. Parenthesized values indicate the absolute improvement in the Unsafe agreement rate within the High score range\. Costs are reported in USD per query\.Although the single\-scorer reliability analysis in Table[8](https://arxiv.org/html/2604.18976#A12.T8)already demonstrates close alignment with human judgments, STAR\-Teaming ultimately depends on a single LLM evaluator, which raises a natural concern about single\-point failure\. To examine whether scorer ensembling can mitigate this risk, we implement a 3\-LLMEnsemble Scorerbased on majority voting overclaude\-3\-haiku,gemini\-2\.5\-flash, andgpt\-4\.1\-mini, drawn from distinct model families to diversify judgment behavior\. Evaluation is conducted under the same protocol used for the Single Scorer, using BeaverTailsJi et al\. \([2023](https://arxiv.org/html/2604.18976#bib.bib14)\)and WildGuardHan et al\. \([2024](https://arxiv.org/html/2604.18976#bib.bib11)\)as references, and we additionally report the monetary cost per query to characterize the practical trade\-off \(Table[9](https://arxiv.org/html/2604.18976#A12.T9)\)\.
The Ensemble Scorer yields a modest but consistent gain in the decision\-critical high\-score range \(≥8\.5\\geq 8\.5\), improving Unsafe agreement from 95\.1% to 96\.8% on BeaverTails and from 82\.2% to 89\.5% on WildGuard\. However, this comes at roughly3\.7×3\.7\\timesthe per\-query cost, which compounds rapidly across the hundreds of scoring calls required per seed in an iterative red\-teaming run\. Given that the Single Scorer already achieves\>95%\>\\\!95\\%agreement in the critical range, we retain it as the default configuration for its favorable cost\-effectiveness, while viewing the Ensemble Scorer as a complementary option for deployments prioritizing judgment robustness over throughput, such as final safety audits preceding model release\.
## Appendix MStability Experiments
To ensure the reproducibility and robustness of STAR\-Teaming, we evaluated its performance stability across different random initializations\. Since our framework involves probabilistic sampling of strategies, it is crucial to verify that the high Attack Success Rate \(ASR\) is not a result of a fortuitous random seed but a consistent outcome of the optimization process\.
### M\.1Random Seed Stability
We conducted independent attack runs using three distinct random seeds for three representative target models: Llama\-2\-7b\-chat, Qwen3\-8b, and Gemma3\-12b\-it\. Table 10 reports the mean ASR and the standard deviation \(σASR\\sigma\_\{\\text\{ASR\}\}\) derived from these trials\.
Target ModelMean ASR \(%\)σASR\\sigma\_\{ASR\}Llama2\-7b\-chat71\.02\.4Qwen3\-8b71\.52\.4Gemma3\-12b\-it56\.64\.1
Table 10:Stability analysis of STAR\-Teaming across three random seeds\. The table shows the mean Attack Success Rate \(ASR\) and the standard deviation \(σASR\\sigma\_\{\\text\{ASR\}\}\) for each target model, indicating consistent performance\.The results demonstrate a high degree of stability\. The standard deviations are remarkably low, ranging from 2\.4% to 4\.1%\. For instance, on Llama\-2\-7b\-chat, the ASR fluctuated only slightly around the mean of 71\.0% \(σ=2\.4\\sigma=2\.4\)\. This minimal variance indicates that STAR\-Teaming consistently converges to effective attack strategies regardless of the initial seed, confirming the reliability of our multiplex network\-based optimization\.
### M\.2Cross\-Model Stability
We evaluate the transferability of adversarial prompts generated by STAR\-Teaming across different target models, as shown in Table[11](https://arxiv.org/html/2604.18976#A13.T11)\. The results highlight the exceptional generalization of attacks generated from Llama\-2\-7b\-chat, which achieved high ASRs on all targets—even surpassing direct optimization on Gemma3\-12b\-it\. Conversely, Llama\-2\-7b\-chat remained robust against attacks transferred from others\. Within the Qwen3 family, attacks from the larger 8b model transferred effectively to the smaller 4b model, while the reverse was less successful\. Overall, transferred prompts yielded significantly higher ASRs than the direct baseline, confirming that STAR\-Teaming produces potent, generalized attacks without requiring additional optimization during inference\.
Transfer Target→\\rightarrowLlama2\-7b\-chatGemma3\-12b\-itQwen3\-4bQwen3\-8bOriginal Target↓\\downarrow\(transfer\)\(transfer\)\(transfer\)\(transfer\)Llama2\-7b\-chat71\.078\.472\.367\.7Gemma3\-12b\-it33\.856\.674\.664\.4Qwen3\-4b24\.245\.572\.555\.0Qwen3\-8b30\.054\.571\.572\.0Direct0\.846\.618\.817\.6
Table 11:Cross\-model transferability of attack prompts generated by STAR\-Teaming\. Rows indicate the source model used for optimization, and columns indicate the target model being attacked\. Diagonal values represent direct STAR\-Teaming performance\. The Direct row shows the success rate of unoptimized harmful prompts\.
## Appendix NTime Complexity Experiments
As discussed in the main text, STAR\-Teaming operates as a strategy\-based multi\-agent system that iteratively refines attacks up to a maximum of 150 attempts\. Given that such iterative processes can be computationally intensive, evaluating efficiency is as critical as evaluating attack success\. Table[12](https://arxiv.org/html/2604.18976#A14.T12)presents a comparative analysis of computational costs on the HarmBench dataset against the Llama\-2\-7b\-chat target\. We report the Attack Success Rate \(ASR\), the average number of attack iterations \(N¯\(𝒜\)\\bar\{N\}\(\\mathcal\{A\}\)\), and the average number of tokens consumed per attack\.
ModelN¯\(𝒜\)\\bar\{N\}\(\\mathcal\{A\}\)Used TokensASRAttacker ModelTAP78\.4438\.89\.3Gemma\-1\.1\-7b\-itAutoDAN\-Turbo137\.2230\.836\.6Gemma\-1\.1\-7b\-itSTAR\-Teaming61\.1164\.071\.0Gemma\-1\.1\-7b\-it
Table 12:Comparison of time complexity and attack performance\.N¯\(𝒜\)\\bar\{N\}\(\\mathcal\{A\}\)denotes the average number of attack trials\.To ensure a fair comparison, all methods were evaluated using the identical Attacker Model, Gemma\-1\.1\-7b\-it\. For the TAP baseline, each single inference of the attack model was counted as one iteration inN¯\(𝒜\)\\bar\{N\}\(\\mathcal\{A\}\)\. The results demonstrate that STAR\-Teaming is not only superior in terms of ASR but also the most efficient in terms of computational cost, requiring significantly fewer tokens and trials than the baselines to achieve higher success rates\.
## Appendix OCross\-Model Strategy Profile
To substantiate STAR\-Teaming’s interpretability claim, we compute the empirical success rate of each strategy community against each target model, using the best attempt per seed to eliminate early\-stopping bias\. Figure[10](https://arxiv.org/html/2604.18976#A15.F10)sorts communities by their cross\-model average success rate, placing broadly effective strategies on the left\. The resulting profile reveals that model families exhibit qualitatively distinct vulnerabilities rather than a shared robustness ordering: even the most universally potent communities succeed only marginally against Llama\-2\-7b and Claude\-3\.5\-Sonnet while exceeding 20% on Gemma3\-12b, Qwen3\-8b, and GPT\-4o\. This community\-level resolution—obscured when strategies are treated atomically—offers direct guidance for both attackers, who can identify minimal effective subsets per target, and defenders, who can prioritize adversarial training on the communities where their model is most susceptible\.
Figure 10:Cross\-model strategy profile\.
## Appendix PQualitative Results
This section includes records of experimental attack response and score pipeline within the STAR\-Teaming framework\.
Figure 11:Illustration of attack pipeline\.Figure 12:Illustration of attack pipeline\.Figure 13:Illustration of attack pipeline\.Figure 14:Illustration of attack pipeline\.Figure 15:Illustration of attack pipeline\.Similar Articles
CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning
CHASE introduces a co-evolutionary red-blue teaming framework that uses reinforcement learning to harden LLMs against adaptive black-box adversarial attacks, reducing jailbreak success by 43.2% on benchmarks while maintaining zero false refusals on benign prompts.
A Red Teaming Framework for Large Language Models: A Case Study on Faithfulness Evaluation
This paper presents a red teaming framework for LLMs that uses a multi-role architecture to systematically uncover vulnerabilities, particularly in faithfulness. The framework demonstrated a 7.9% increase in attack success rate in QA tasks and highlights the impact of architectural choices over parameter scaling on model safety.
AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability
AdversaBench introduces an automated LLM red-teaming pipeline that uses five mutation operators and a three-judge panel with a meta-judge tiebreaker to confirm failures, revealing that attack difficulty varies by category and that adversarial prompts transfer from smaller to larger models.
TeamTR: Trust-Region Fine-Tuning for Multi-Agent LLM Coordination
This paper identifies a structural failure mode in sequential fine-tuning of shared-context multi-agent LLM teams, formalized as compounding occupancy shift, and proposes TeamTR, a trust-region framework that resamples trajectories and enforces per-agent divergence control, achieving 7.1% average improvement over baselines.
STAR: A Stage-attributed Triage and Repair framework for RCA Agents in Microservices
STAR is a stage-attributed triage and repair framework that decomposes LLM-based RCA agent workflows into four structured stages, enabling stage-wise auditing, counterfactual evaluation, and patch-and-replay repair to improve root cause localization and fault type classification in microservice AIOps.