Your Agent Has a Genome: Sequence-Level Behavioral Analysis and Runtime Governance of LLM-Powered Autonomous Agents
Summary
This paper introduces Base Sequence Analysis, a framework that encodes LLM agent runtime behavior into compact sequences, revealing high-risk patterns like the 'P-X-P' trigram and a verification deficit. It presents Governor, a runtime intervention system that improves task success by 6.2% and reduces token consumption by 44%.
View Cached Full Text
Cached at: 06/16/26, 11:47 AM
# Your Agent Has a Genome Sequence-Level Behavioral Analysis and Runtime Governance of LLM-Powered Autonomous Agents
Source: [https://arxiv.org/html/2606.15579](https://arxiv.org/html/2606.15579)
\(April 2026\)
###### Abstract
We proposeBase Sequence Analysis, a framework that encodes the runtime behavior of LLM\-powered autonomous agents into compact symbolic sequences using a four\-letter alphabet:X\(Explore\),E\(Execute\),P\(Plan\), andV\(Verify\)\. Drawing an analogy to genomic sequence analysis, we apply n\-gram pattern mining, Markov transition matrices, and point\-biserial correlation to 347 real\-world execution traces collected from a production ReAct agent system over 8 days\. Our analysis reveals that \(1\) the trigramP\-X\-Pis the only statistically significant high\-risk pattern, lowering success rate by 10\.4%; \(2\) P\-ratio is the strongest negative predictor of success \(r=−0\.256r\{=\}\{\-\}0\.256,p<0\.0001p\{<\}0\.0001\); and \(3\) theE→\\toVtransition probability is only 2\.1%, indicating a systemic verification deficit\. Based on these findings, we designGovernor, a three\-layer runtime intervention system comprising a rule engine, a statistical accumulator, and a chi\-square\-based threshold adaptor\. Governor’s rules are not hand\-coded heuristics: they emerged from systematic data analysis and continue to evolve through online chi\-square testing, with thresholds that self\-correct when initial assumptions prove wrong\. Governor operates at the sequence level with zero LLM overhead, injecting corrective prompts when high\-risk base patterns are detected\. In a natural before/after deployment evaluation \(N=101N\{=\}101vs\.N=246N\{=\}246\), Governor achieves a \+6\.2% absolute increase in task success rate while simultaneously reducing average token consumption by 44%\. To validate cross\-system generality, we define an adapter interface for porting the XEPV encoding to other agent frameworks and apply it to 2,000 public SWE\-agent trajectories on SWE\-bench, confirming that two of three core findings—exploration spirals and the E→\\toV verification deficit—replicate in an independent system with a structurally different action space\. The analysis further reveals model\-level behavioral fingerprints: larger models exhibit naturally higher verification rates, suggesting that base sequence profiles can serve as behavioral identity signatures\. Building on these results, we outline six research directions—base sequence language models, base\-conditioned decoding, sequence anomaly detection, dual\-stream agent architectures, base sequence reward models, and base sequence fingerprinting—that chart a path from interpretable rules to learned behavioral governance\. We conclude by arguing that base sequence governance represents a “cerebellum” for agent systems—a coordination layer between the LLM brain and the tool\-execution body—whose full potential requires community\-scale data far beyond what any individual deployment can generate\.
Keywords:LLM Agents, ReAct, Behavioral Analysis, Sequence Mining, Runtime Governance, Base Sequence
## 1Introduction
Large language model \(LLM\) powered autonomous agents have emerged as a dominant paradigm for complex task execution\(Yao et al\.,[2023](https://arxiv.org/html/2606.15579#bib.bib14); Shinn et al\.,[2023](https://arxiv.org/html/2606.15579#bib.bib6); Wang et al\.,[2023](https://arxiv.org/html/2606.15579#bib.bib9)\)\. These systems interleave reasoning and action in a ReAct loop\(Yao et al\.,[2023](https://arxiv.org/html/2606.15579#bib.bib14)\): the LLM selects tools, observes results, and iterates until task completion\. While substantial progress has been made in agent architecture design, our understanding of*what agents actually do at runtime*remains surprisingly shallow\. Existing evaluation frameworks\(Liu et al\.,[2024](https://arxiv.org/html/2606.15579#bib.bib3); Yang et al\.,[2024](https://arxiv.org/html/2606.15579#bib.bib13)\)focus on outcome metrics \(pass rate, accuracy\) without analyzing the*behavioral trajectory*that leads to success or failure\.
This gap matters\. Consider two agents that both achieve 90% success rate: one may reach it through efficient explore\-then\-execute sequences, while the other oscillates between planning and exploration before stumbling into correct actions\. These agents have identical outcome metrics but fundamentally different behavioral profiles—and the second is far more fragile to distribution shift\.
We address this gap with a bioinformatics\-inspired approach\. Just as genomic analysis extracts meaning from sequences of four nucleotide bases \(A, T, C, G\), we encode each step of an agent’s execution into one of fourbase types:
- •X\(Explore\): Information gathering—reading files, web searches, directory listing
- •E\(Execute\): State\-changing actions—writing files, running commands, API calls
- •P\(Plan\): Reasoning and strategy—task decomposition, Reflexion, re\-planning
- •V\(Verify\): Validation—running tests, checking outputs, re\-reading written files
A task execution thus becomes abase sequencesuch asX\-X\-P\-E\-E\-V\-E, which can be analyzed using the rich toolkit of sequence analysis: n\-grams, transition matrices, correlation studies, and pattern mining\.
#### Contributions\.
This paper makes four contributions:
1. 1\.Base Sequence Abstraction\(§[3](https://arxiv.org/html/2606.15579#S3)\): A formal encoding scheme that maps heterogeneous agent tool calls to a four\-letter alphabet, along with an 8\-dimensional feature vector, a co\-designed execution trace format, and an adapter interface \(§[3\.4](https://arxiv.org/html/2606.15579#S3.SS4)\) that enables cross\-system portability\.
2. 2\.Empirical Behavioral Analysis\(§[4](https://arxiv.org/html/2606.15579#S4)\): A comprehensive analysis of 347 production execution traces revealing actionable patterns—P\-X\-Poscillation as the sole high\-risk trigram, P\-ratio as the strongest failure predictor, and a systemic 2\.1%E→\\toVverification deficit\.
3. 3\.Governor\(§[5](https://arxiv.org/html/2606.15579#S5), §[6](https://arxiv.org/html/2606.15579#S6)\): A three\-layer runtime intervention system whose rules emerge from data analysis \(§[5\.5](https://arxiv.org/html/2606.15579#S5.SS5)\) and evolve through online chi\-square testing, achieving \+6\.2% success rate and−\-44% token consumption with zero LLM overhead\.
4. 4\.Cross\-System Validation\(§[6\.7](https://arxiv.org/html/2606.15579#S6.SS7)\): Application of the XEPV encoding to 2,000 public SWE\-agent trajectories on SWE\-bench, confirming that exploration spirals and the E→\\toV deficit replicate across systems, while revealing model\-level behavioral fingerprints\.
## 2Related Work
#### LLM Agent Architectures\.
The ReAct framework\(Yao et al\.,[2023](https://arxiv.org/html/2606.15579#bib.bib14)\)established the interleaved reasoning\-action paradigm\. Subsequent work enriched this loop: Reflexion\(Shinn et al\.,[2023](https://arxiv.org/html/2606.15579#bib.bib6)\)adds verbal self\-reflection on failure; Tree of Thoughts\(Yao et al\.,[2024](https://arxiv.org/html/2606.15579#bib.bib15)\)and LATS\(Zhou et al\.,[2024](https://arxiv.org/html/2606.15579#bib.bib16)\)introduce search over reasoning paths; Voyager\(Wang et al\.,[2023](https://arxiv.org/html/2606.15579#bib.bib9)\)adds a persistent skill library for lifelong learning; Toolformer\(Schick et al\.,[2023](https://arxiv.org/html/2606.15579#bib.bib5)\)teaches models to invoke tools autonomously\. CoALA\(Sumers et al\.,[2024](https://arxiv.org/html/2606.15579#bib.bib7)\)provides a unifying cognitive architecture taxonomy\. Our work is orthogonal to architecture design: we analyze the*behavioral output*of any ReAct\-style agent, regardless of its internal architecture\.
#### Agent Evaluation and Benchmarking\.
AgentBench\(Liu et al\.,[2024](https://arxiv.org/html/2606.15579#bib.bib3)\)evaluates LLMs across 8 agent environments; SWE\-bench and SWE\-agent\(Yang et al\.,[2024](https://arxiv.org/html/2606.15579#bib.bib13)\)focus on software engineering tasks; OpenHands\(Wang et al\.,[2024](https://arxiv.org/html/2606.15579#bib.bib10)\)provides a platform for reproducible agent evaluation\. These works measure*what*agents achieve \(pass rate\) but not*how*they achieve it\. Our base sequence analysis fills this gap by providing a behavioral lens on agent execution trajectories\.
#### Process Mining\.
The idea of extracting patterns from execution logs has deep roots in business process mining\(van der Aalst,[2016](https://arxiv.org/html/2606.15579#bib.bib8)\)\. Process mining discovers, monitors, and improves processes by analyzing event logs\. Our base sequence framework can be viewed as process mining applied to LLM agent traces, with the key difference that our “process” is not pre\-defined but emerges from LLM decision\-making, and our intervention \(Governor\) operates in real\-time rather than offline\.
#### LLM Safety and Guardrails\.
Constitutional AI\(Bai et al\.,[2022](https://arxiv.org/html/2606.15579#bib.bib2)\)governs LLM behavior through principles embedded in training\. NeMo Guardrails\(Rebedea et al\.,[2023](https://arxiv.org/html/2606.15579#bib.bib4)\)provides a programmable toolkit for constraining LLM outputs\. These approaches operate at the*semantic level*—analyzing what the model says or intends\. Governor operates at the*sequence level*—analyzing the pattern of actions over time, without interpreting semantic content\. This makes it complementary to semantic guardrails and significantly cheaper to compute\.
#### Agent Self\-Improvement\.
RAGEN\(Wang et al\.,[2025](https://arxiv.org/html/2606.15579#bib.bib11)\)trains agents via multi\-turn reinforcement learning to improve their action selection\. Our approach is non\-learned: Governor uses hand\-crafted rules derived from empirical analysis, with only the thresholds adapted via chi\-square testing\. This design choice reflects our data regime \(N=347N\{=\}347\), where learned approaches would overfit\. We discuss the path from rules to learned models in §[7](https://arxiv.org/html/2606.15579#S7)\.
## 3The Base Sequence Framework
### 3\.1Base Encoding
We define a base classifier function𝒞:\(tool,args,ctx\)→\{E,P,V,X\}\\mathcal\{C\}:\(\\text\{tool\},\\text\{args\},\\text\{ctx\}\)\\to\\\{E,P,V,X\\\}that maps each tool invocation to a base type\. The classifier is deterministic, stateful, and operates withO\(1\)O\(1\)amortized complexity per call\.
#### Classification Rules\.
The classifier follows a priority chain:
1. 1\.V\(highest priority\): Triggered when \(a\) a read operation targets a resource that was recently written \(write\-then\-read verification\), \(b\) the same tool is retried immediately after an error, or \(c\) a compile/test/lint command follows a write operation\.
2. 2\.X: Triggered when \(a\) the tool is a known read/search tool accessing a previously unseen resource, \(b\) the tool is a web search or fetch, or \(c\) an unknown tool’s parameter signature suggests read intent\.
3. 3\.P: Assigned by the LLM itself via structured metadata when it performs reasoning, task decomposition, or Reflexion\. P cannot be reliably inferred from tool calls alone\.
4. 4\.E\(default\): All remaining tool calls—file writes, command execution, API mutations\.
The stateful contextctxtracks successfully accessed resources and recent write operations \(sliding window of 10\), enabling the write\-then\-read verification detection for V\.
#### Formal Representation\.
Given a task execution withnntool calls, the base sequence is:
S=b1\-b2\-⋯\-bn,bi∈\{E,P,V,X\}S=b\_\{1\}\\text\{\-\}b\_\{2\}\\text\{\-\}\\cdots\\text\{\-\}b\_\{n\},\\quad b\_\{i\}\\in\\\{E,P,V,X\\\}\(1\)
For example, a task that reads a directory, reads a file, writes a fix, and runs tests producesS=S=X\-X\-E\-V\.
### 3\.2Feature Extraction
From each base sequenceSS, we extract an 8\-dimensional feature vector𝐟∈ℝ8\\mathbf\{f\}\\in\\mathbb\{R\}^\{8\}:
Table 1:Eight\-dimensional feature vector extracted from base sequences\. All features are computable inO\(n\)O\(n\)wherennis the sequence length \(typically<25<25\)\.The first four features capture immediate behavioral signals \(exploration inertia, sequence length, local exploration density, behavioral stability\)\. The latter four, introduced in v2 based on empirical findings, capture structural patterns \(late planning, verification coverage, execution momentum, exploration dominance\)\.
### 3\.3Trace Co\-Design: Linking Bases to Execution Metadata
A key design principle is that base sequences do not exist in isolation\. Each base is embedded in a richly annotatedexecution tracerecord that co\-stores:
- •Per\-tool token cost:\(prompt\_tokens,completion\_tokens\)\(\\text\{prompt\\\_tokens\},\\text\{completion\\\_tokens\}\)for every tool call, enabling per\-base\-type token attribution\.
- •Context injection metadata: For each task, the system records*what*was injected into the LLM context—how many memory entries were retrieved \(and their similarity scores\), which skills were injected \(and their semantic match scores\), and the total character budget consumed by each partition\.
- •Turn\-level metadata: Each ReAct turn records whether it emitted a P base, how many tool calls it made, whether it was a Reflexion turn, and the tool base types for that turn\.
- •Governor intervention records: When a rule fires, the trace stores the rule name, step index, full 8\-dimensional feature snapshot, and a counterfactual success rate estimate\.
This co\-design enables cross\-cutting analyses that would be impossible with base sequences alone:
#### Skill Injection Optimization\.
By correlating base patterns withcontextInjectionMeta\.skills, we can identify which skill injection configurations lead to shorter, more E\-heavy sequences\. Tasks where the top\-ranked injected skill has a semantic score\>0\.8\>0\.8produce sequences that are 28% shorter on average, with P\-ratio reduced by 3\.1 percentage points\.
#### Memory Retrieval Quality\.
ThecontextInjectionMeta\.memoryfields reveal that 77% ofsearchMemorycalls \(encoded as X bases\) return no results\. Each empty retrieval adds an X step to the sequence without information gain\. By trackingl0AvgScorealongside base sequences, we can quantify the “wasted X” cost: empty memory retrievals account for an estimated 11% of total token consumption in memory\-using tasks\.
#### Token Attribution\.
The per\-tooltokenCostfield enables precise attribution: in our dataset, X bases consume 41% of total tokens, E bases 35%, P bases 19%, and V bases 5%\. Combined with Governor intervention data, this reveals that Governor’s primary token savings come from reducing wasteful X chains—tasks where X\-Brake fires show 38% lower X\-base token consumption\.
### 3\.4Adapter Interface for Cross\-System Portability
The XEPV encoding is defined at the semantic level \(explore, execute, plan, verify\) rather than the syntactic level \(specific tool names\)\. To apply it to a new agent system, anadaptermust be provided that maps the system’s action vocabulary to the four base types:
###### Definition 1\(XEPV Adapter\)\.
An adapter is a function𝒜:𝒯→\{X,E,P,V\}\\mathcal\{A\}:\\mathcal\{T\}\\to\\\{\\text\{X\},\\text\{E\},\\text\{P\},\\text\{V\}\\\}where𝒯\\mathcal\{T\}is the system’s action space\.𝒜\\mathcal\{A\}must satisfy:
1. 1\.Completeness: every action in𝒯\\mathcal\{T\}maps to exactly one base type\.
2. 2\.Semantic consistency: actions that gather information without modifying state map to X; actions that modify artifacts map to E; actions that validate outcomes map to V; turns with no tool call map to P\.
In this paper, we implement two adapters: a DunCrew adapter \(20\+ tools, described in §[4](https://arxiv.org/html/2606.15579#S4)\) and an SWE\-agent adapter \(command\-line actions, described in §[6\.7](https://arxiv.org/html/2606.15579#S6.SS7)\)\. The SWE\-agent adapter illustrates a key design consideration: SWE\-agent’s forced\-action architecture produces P=\{=\}0%, which is architecturally correct \(every turn contains a command\) rather than an adapter error\. Adapters should preserve such structural properties rather than artificially mapping actions to P\.
## 4Empirical Analysis
### 4\.1Dataset
We analyze 347 execution traces collected from a production ReAct agent system over 8 days \(March 27–April 3, 2026\), used for diverse real\-world tasks including software development, web search, document generation, and data analysis\.
#### Experimental Platform\.
All data were collected from DunCrew111[https://duncrew\.com](https://duncrew.com/), an LLM agent operating system that runs on the user’s local device\. DunCrew implements a ReAct execution engine in TypeScript, supporting OpenAI\-compatible function\-calling protocol with 20\+ registered tools \(file I/O, command execution, web search, memory retrieval, etc\.\)\. The underlying LLM for all 347 traces isQwen\-3\.6\-plus\-preview\(Alibaba Cloud\), accessed via function\-calling mode throughout the data collection period\. The system architecture includes three integration points directly relevant to base sequence analysis:
1. 1\.Base classifier embedding point: The classifier \(baseClassifier,∼\\sim300 lines TypeScript\) is mounted as middleware on the tool\-call return path\. After each tool invocation completes, the classifier determines the base type from tool type, argument signature, and stateful context \(a sliding window of the 10 most recent write operations\), writing the result into the current trace’sbaseSequencefield\. Classification latency is<<1ms with no perceptible impact on execution flow\.
2. 2\.Governor mounting point: Governor \(∼\\sim920 lines TypeScript\) executes synchronously after each base classification\. It reads the current base sequence, computes the 8\-dimensional feature vector, evaluates 7 rules, and—if triggered—injects a natural\-language corrective prompt into the LLM’s next\-turn system message\. The entire process completes within the ReAct loop without introducing additional LLM calls\.
3. 3\.Trace recording layer: Upon task completion, the system serializes the full execution trace—including the base sequence, per\-tool token costs, context injection metadata, and Governor intervention records—as JSONL to a localexec\_traces/directory\. All analyses in this paper are based on these raw trace files\.
DunCrew also integrates a skill system \(extensible capability templates defined in Markdown\) and a two\-layer memory system \(ephemeral \+ persistent\), both linked to base sequences via context injection metadata as described in §[3\.3](https://arxiv.org/html/2606.15579#S3.SS3)\.
#### Prompt Freeze\.
To ensure that observed behavioral changes are attributable to Governor rather than prompt engineering, the system prompt was frozen throughout the data collection period \(March 27–April 3\)\. Git history confirms the last prompt modification occurred on March 24—three days before data collection began\. No prompt, tool definition, or skill template was modified during the 8\-day collection window\.
Figure 1:DunCrew system architecture with three base sequence integration points: ① the base classifier on the tool\-call return path, ② Governor executing synchronously after each classification, and ③ the trace recorder serializing full execution traces to JSONL\.Table 2:Dataset summary statistics\.Figure 2:Base sequence analysis overview dashboard\. Four panels show: \(top\-left\) base distribution pie chart; \(top\-right\) success rate by sequence length bucket; \(bottom\-left\) monthly base distribution comparison; \(bottom\-right\) base ratio box plots for successful vs\. failed tasks\.
### 4\.2Base Distribution
The global base distribution is heavily skewed toward exploration and execution:
Table 3:Global base distribution \(N=347N\{=\}347traces, 3,015 total bases\)\.The X\+E ratio of 84\.1% confirms that the agent operates primarily in an explore\-execute loop, with planning and verification as minority activities\. The 3\.3% V ratio is a systemic concern addressed in §[5](https://arxiv.org/html/2606.15579#S5)\.
Figure 3:Sequence length vs\. task duration scatter plot\. Green circles indicate successful tasks; red crosses indicate failures\. A quadratic fit shows the expected cost growth\. Notably, failures cluster at short\-to\-medium lengths rather than at long sequences\.
### 4\.3N\-gram Pattern Analysis
We extract all 2\-grams and 3\-grams from the 347 sequences and compute per\-pattern success rates\.
#### High\-Risk Patterns\.
Table 4:High\-risk trigrams\. P\-X\-P \(plan\-explore\-plan oscillation\) is the only pattern with statistically significant negative impact on success rate\.P\-X\-Prepresents a “planning oscillation” where the agent plans, explores, then plans again without executing\. This pattern indicates that the planning step failed to incorporate exploration results, triggering a re\-plan\. With 83\.3% success rate \(−\-10\.4% vs\. global 92\.5%\), it is the*only*trigram that significantly degrades performance\.
Notably,X\-X\-X\(continuous exploration\) has success rate of 94\.4%,*above*the global average\. This contradicts the intuition that “too much exploration is bad”—sustained exploration is neutral or mildly positive; it is the*failure to transition from exploration to execution*that causes problems\.
#### High\-Efficiency Patterns\.
Table 5:High\-efficiency patterns\. Any pattern containing V achieves 100% success rate \(though with small sample sizes due to the 3\.3% V rate\)\.\(a\)High\-risk vs\. high\-efficiency pattern success rates\.
\(b\)Top 15 trigrams by frequency, colored by success rate\.
Figure 4:N\-gram pattern analysis\. \(a\) P\-X\-P is the only pattern significantly below the global success rate \(dashed line\)\. \(b\) Among the 15 most frequent trigrams, P\-X\-P \(42 occurrences, 83\.3%\) stands out as the lowest\.
### 4\.4Transition Probability Matrix
We compute the first\-order Markov transition matrix over all adjacent base pairs:
Figure 5:First\-order Markov transition probability heatmaps\. Left: global \(N=347N\{=\}347\)\. Center: March subset\. Right: April subset\. The dominant self\-loops \(E→\\toE, X→\\toX\) and the near\-zero E→\\toV transition are consistent across both months\.Table 6:First\-order Markov transition matrix\. Bold values indicate the most probable next state\. E→\\toV is only 2\.1%—the agent almost never verifies after executing\.Three structural insights emerge: \(1\)Strong self\-loops: E→\\toE \(68\.0%\) and X→\\toX \(57\.2%\) create execution chains and exploration spirals; \(2\)P→\\toX dominance\(56\.8%\): After planning, the agent explores rather than executes, potentially contributing to P\-X\-P oscillation; \(3\)E→\\toV deficit\(2\.1%\): The agent almost never verifies after executing, the most significant structural weakness\.
### 4\.5Feature Correlation with Success
We compute point\-biserial correlations between extracted features and binary success/failure\. In addition to the 8\-dimensional feature vector used for Governor’s online rule evaluation \(§[3\.2](https://arxiv.org/html/2606.15579#S3.SS2)\), we include base\-type ratio features \(P\_ratio, E\_ratio, X\_ratio\) and a binary has\_V indicator, which are directly derivable from the sequence and provide more intuitive correlation targets for behavioral analysis:
Figure 6:Point\-biserial correlation coefficients between base sequence features and task success\. P\_ratio \(r=−0\.256r\{=\}\{\-\}0\.256,p<0\.0001p\{<\}0\.0001\) is the strongest predictor\. Features are sorted by absolute correlation; stars indicate significance level\.Table 7:Point\-biserial correlations with task success\. P\_ratio is the strongest predictor; higher planning ratio strongly predicts failure\.The strongest finding:P\_ratio\(r=−0\.256r\{=\}\{\-\}0\.256,p<0\.0001p\{<\}0\.0001\) is by far the most powerful predictor\. This does not mean planning is harmful per se, but that*excessive planning relative to execution*is the clearest behavioral signature of failure\. X\_ratio shows no significant correlation \(r=−0\.033r\{=\}\{\-\}0\.033\), confirming that exploration volume is behaviorally neutral—it is the*pattern*of exploration \(X\-X\-X vs\. P\-X\-P\) that matters\.
### 4\.6Failure Characterization
Failed tasks \(N=26N\{=\}26\) exhibit a distinct “base genome”:
Figure 7:Failure analysis\. Left: error type distribution across failed tasks\. Right: monthly success rate trend showing stable performance between March and April\.Table 8:Base distribution comparison: failed vs\. succeeded tasks\. Failed tasks have doubled P\-ratio and halved E\-ratio\.The failure “base genome” is:high P \+ low E \+ short sequence\. Failed tasks get trapped in plan\-explore oscillation without reaching sufficient execution steps\. This is consistent with the P\_ratio correlation \(r=−0\.256r\{=\}\{\-\}0\.256\) and the P\-X\-P risk pattern \(−\-10\.4%\)\.
## 5Governor: Runtime Sequence\-Level Intervention
Governor translates the empirical findings of §[4](https://arxiv.org/html/2606.15579#S4)into a runtime intervention system\. It evaluates the agent’s base sequence after each tool call and injects corrective prompts when high\-risk patterns are detected\.
### 5\.1Architecture
Governor employs a three\-layer architecture:
- •Layer 1 — Online Rule Engine: Evaluates the current base sequence against 7 rules using the 8\-dimensional feature vector\. Pure if/else logic withO\(n\)O\(n\)complexity \(n≤25n\\leq 25typically\)\. Zero LLM calls\. When triggered, returns a natural\-language prompt injection\.
- •Layer 2 — Statistical Accumulator: After each task completes, records the outcome \(success/failure\) partitioned by a 4D bucket key derived from features, and tracks per\-rule intervention vs\. control group statistics\.
- •Layer 3 — Threshold Adaptor: EveryNNtraces \(default 50\), performs Yates\-corrected chi\-square tests \(α=0\.05\\alpha\{=\}0\.05,χcrit2=3\.841\\chi^\{2\}\_\{crit\}\{=\}3\.841\) on each rule’s intervention vs\. control success rates\. Tightens thresholds when intervention helps; loosens when it hurts\. Requires minimum 20 samples per group\.
### 5\.2Rule Design
Each rule is derived from a specific empirical finding:
Table 9:Governor rules\. Each is derived from empirical findings\.step\_fusewas disabled after data showed long sequences have*higher*success rates, validating data\-driven rule management\.Notably,step\_fuse\(terminate long sequences\) was*disabled*after deployment data showed that tasks reaching\>\>15 steps have 97\.4% success rate\. This demonstrates the value of the Layer 3 feedback loop: incorrect priors are identified and corrected\.
### 5\.3Intervention Mechanism
When Layer 1 triggers, the prompt injection is appended to the LLM’s context before the next turn\. The injection is natural language, e\.g\.:
> \[Base Sequence Warning\] You have performed 14 consecutive exploration operations without progress\. Please stop the current approach, re\-analyze the problem, and devise a completely different strategy\.
This is a “soft” intervention: it does not modify the execution flow, block tool calls, or override LLM decisions\. It adds information to the LLM’s context that the LLM may choose to follow or ignore\. The cost is negligible—typically 50–100 tokens per injection\.
### 5\.4Counterfactual Estimation
For each intervention event, Governor records a counterfactual success rate estimate by querying the Layer 2 bucket table: “given the current feature vector, what is the historical success rate of tasks that were*not*intervened?” This enables Layer 3 to assess whether intervention improved outcomes beyond what would have happened naturally\.
### 5\.5Rule Discovery and Online Evolution
A potential criticism is that Governor’s rules are hand\-crafted heuristics\. We address this by documenting the rule*discovery*process: all seven rules emerged from a systematic analysis of 92 pre\-deployment traces \(documented in our analysis report\), not from intuition\. The pipeline is:
1. 1\.Feature extraction: Compute 8\-dimensional feature vectors from raw base sequences \(X ratio, E ratio, P ratio, V ratio, switch rate, max X\-run, E→\\toV transition probability, P\-in\-late\-half indicator\)\.
2. 2\.Correlation analysis: Rank features by Pearsonrrwith binary success/failure outcome\. Features with\|r\|\>0\.1\|r\|\>0\.1become rule candidates\.
3. 3\.Threshold selection: For each candidate, sweep threshold values and select the one maximizingΔ\\Deltasuccess\-rate between above\-threshold and below\-threshold groups\.
4. 4\.Deployment and adaptation: After deployment, Layer 3 \(theχ2\\chi^\{2\}threshold adaptor\) adjusts thresholds based on accumulating production data\.
Table[10](https://arxiv.org/html/2606.15579#S5.T10)shows three threshold revisions driven by this feedback loop:
Table 10:Threshold evolution from V1 \(initial deployment\) to V4 \(current\)\. Each revision is driven by production data through the Layer 3 feedback loop, demonstrating that Governor rules self\-correct when the data contradicts initial assumptions\.Thestep\_fusecase is particularly instructive: the initial assumption—that long sequences indicate failure—was*wrong*\. The Layer 3 adaptor detected that the rule was hurting performance and flagged it for human review, leading to its deactivation\. This error\-correction capability distinguishes Governor from static rule sets: while the rules are interpretable \(not learned by gradient descent\), the*thresholds*evolve from data, placing the system on a spectrum between pure heuristics and learned controllers\.
## 6Experiments
### 6\.1Setup
Governor was deployed to the production system on March 31, 2026, creating a natural temporal split:
- •Pre\-Governor\(March 27–30\): 101 traces, no intervention
- •Post\-Governor\(March 31\+\): 246 traces, with intervention
Post\-Governor traces are further split by whether any rule fired:
- •Triggered: 193 traces \(at least one rule fired\)
- •Not triggered: 53 traces \(no rules fired\)
#### Limitations\.
This is a before/after deployment study, not a randomized controlled trial\. Confounds include temporal effects \(task distribution may shift over time\), learning effects \(the user may adapt behavior\), and system improvements unrelated to Governor\. We address these in §[7](https://arxiv.org/html/2606.15579#S7)\.
### 6\.2Main Results
Figure 8:Governor deployment effect analysis \(6 dimensions\)\. Top row: success rate comparison, token consumption, and base distribution across pre\-Governor, post\-triggered, and post\-not\-triggered groups\. Bottom row: per\-rule trigger frequency, per\-rule success rates, and sequence length distributions\.Table 11:Main experimental results\. Governor deployment is associated with \+6\.2% success rate and−\-44% token consumption\.#### Key findings:
1. 1\.Success rate \+6\.2%\(88\.1%→\\to94\.3%\): A meaningful improvement on an already\-high baseline\.
2. 2\.Token consumption−\-44%\(275K→\\to154K\): Governor’s primary mechanism is preventing wasteful exploration spirals, which are the dominant source of token cost\.
3. 3\.Triggered group outperforms both\(96\.4%\): Tasks where rules fired have the*highest*success rate, not the lowest\. This means Governor’s interventions are positively correlated with good outcomes\.
4. 4\.Not\-triggered group has lowest success\(86\.8%\): These are predominantly short, simple tasks \(mean 5\.6 steps\) whose failures occur in the first few steps before any rule condition is met\.
5. 5\.Sequence length−\-37%\(22\.2→\\to14\.0 steps\): Shorter sequences with higher success means more efficient execution paths\.
### 6\.3Per\-Rule Ablation
Table 12:Per\-rule statistics\. All active rules achieve≥\\geq93\.8% success rate; none degrades performance\.x\_brake\(consecutive exploration brake\) is the most impactful rule: it fires in 146/246 post\-Governor tasks with 98\.6% success rate and the lowest average token cost \(171K vs\. 275K pre\-Governor\)\. This single rule accounts for the majority of Governor’s benefit, confirming that exploration spirals are the dominant source of inefficiency\.
### 6\.4Token Efficiency
Table 13:Token cost per successful task\. Governor reduces cost\-per\-success by 41% in the triggered group\.
### 6\.5Interaction with Reflexion
Figure 9:Reflexion intervention analysis\. Left: success rate with vs\. without Reflexion across error counts\. Center: error recovery rate by error frequency\. Right: token cost and sequence length comparison between Reflexion and non\-Reflexion tasks\.Reflexion\(Shinn et al\.,[2023](https://arxiv.org/html/2606.15579#bib.bib6)\)is the system’s built\-in error recovery mechanism\. In our dataset, 163/347 tasks trigger Reflexion \(47\.0%\), with a recovery rate of 90\.2% vs\. 66\.7% without Reflexion \(\+\+23\.5%\)\. Governor and Reflexion are complementary:
- •Reflexionhandles*individual tool failures*—when a specific action fails, it reflects and retries\.
- •Governorhandles*sequence\-level pathologies*—when the overall behavioral pattern is drifting toward failure, regardless of whether any individual tool has failed\.
Governor reduces unnecessary Reflexion triggers by preventing the exploration spirals that often lead to tool errors in the first place\. Post\-Governor tasks trigger Reflexion 12% less frequently while maintaining higher success rates\.
#### Controlled Comparison\.
To isolate the contributions of Governor and Reflexion, we compare three configurations on the same task distribution \(Table[14](https://arxiv.org/html/2606.15579#S6.T14)\):
Table 14:Comparison of intervention strategies\. Governor \+ Reflexion recovers 53% of the gap between Reflexion\-only and the no\-error baseline, while reducing token cost by 43%\.The no\-error baseline \(tasks with zero tool failures\) achieves 97\.0% success at 119K tokens, representing the ceiling when no recovery is needed\. Reflexion alone recovers from errors at 89\.6% success but incurs 2\.3×\\timesthe token cost due to reflection and retry cycles\. Adding Governor narrows this gap substantially: Governor \+ Reflexion achieves 94\.3% success \(\+\+4\.7pp over Reflexion alone\) at 154K tokens \(43% reduction vs\. Reflexion alone\)\. The improvement comes from Governor preventing the sequence\-level pathologies \(exploration spirals, planning oscillations\) that trigger many of the tool errors Reflexion must then recover from—in effect, Governor reduces the*demand*for Reflexion rather than replacing it\.
### 6\.6Trace\-Level Optimization Insights
The co\-designed trace format \(§[3\.3](https://arxiv.org/html/2606.15579#S3.SS3)\) reveals optimization opportunities beyond Governor:
\(a\)Skill system analysis: binding effects, skill counts, success rate ranking, and monthly trends\.
\(b\)Memory system analysis: retrieval hit rates, gene pool usage, capsule triggers, and file trends\.
Figure 10:Trace\-level subsystem analysis\. \(a\) Skill\-bound tasks achieve 96\.6% success with 30% shorter sequences\. \(b\) Memory retrieval has 77% miss rate, generating wasted X bases\.#### Skill Injection\.
Tasks with high semantic match between the query and injected skills \(avgSemanticScore\>0\.8\\texttt\{avgSemanticScore\}\>0\.8\) produce sequences with 3\.1% lower P\-ratio\. The base sequence patternX\-E\-E\-E\(explore once, then execute\) appears 2\.4×\\timesmore frequently when the right skill is injected, vs\.X\-P\-X\-E\(explore, plan, explore again, then execute\) when skills are poorly matched\. This suggests that skill injection quality directly determines whether the agent enters a P\-X\-P oscillation\.
#### Memory Retrieval\.
Of 74searchMemorycalls across 51 tasks, only 23% return useful results\. Each empty retrieval adds a wasted X base to the sequence\. Memory\-using tasks have 90\.2% success rate vs\. 92\.9% for non\-memory tasks,*not*because memory is harmful, but because empty X steps increase sequence length and token cost without contributing information\. The 4\.4% success rate advantage of memory*hits*over memory*misses*\(93\.3% vs\. 88\.9%\) suggests that improving retrieval precision would convert wasted X bases into productive ones\.
#### Per\-Base Token Attribution\.
DisaggregatingtokenCostby base type reveals that X bases are the most expensive per\-step \(average 12\.3K tokens/step\) due to web search and file reading returning large contexts\. E bases average 8\.7K, P bases 9\.1K, and V bases 5\.2K\. This explains why reducing X chains \(viax\_brake\) has outsized impact on total token consumption\.
### 6\.7Cross\-System Validation: SWE\-agent on SWE\-bench
To assess whether the behavioral patterns in §[6\.2](https://arxiv.org/html/2606.15579#S6.SS2)generalize beyond DunCrew, we apply the XEPV encoding to 2,000 public SWE\-agent trajectories \(nebius/SWE\-agent\-trajectorieson HuggingFace\(Yang et al\.,[2024b](https://arxiv.org/html/2606.15579#bib.bib17); Jimenez et al\.,[2024](https://arxiv.org/html/2606.15579#bib.bib18)\)\), spanning three Llama model sizes \(70B:n=1,793n\{=\}1\{,\}793; 8B:n=167n\{=\}167; 405B:n=40n\{=\}40\)\. The overall resolution rate is 16\.9% \(338/2,000\)\. The adapter \(§[3\.4](https://arxiv.org/html/2606.15579#S3.SS4)\) maps file navigation commands \(search\_dir,find\_file,open,goto,scroll\_\*,ls\) to X; modification commands \(edit,create\) to E; and testing/submission commands \(pytest,submit\) to V\. SWE\-agent’s forced\-action architecture produces P=\{=\}0% across all trajectories, as every turn must contain a tool command\.
#### Pattern Replication\.
Table[15](https://arxiv.org/html/2606.15579#S6.T15)compares resolved and unresolved trajectories\. Two core DunCrew findings replicate with large effect sizes:
Table 15:SWE\-agent behavioral comparison: resolved vs\. unresolved \(N=2,000N\{=\}2\{,\}000\)\. All metrics show clear separation, confirming that verification frequency and exploration control are cross\-system success predictors\.1. 1\.E→\\toV verification deficit\.Resolved instances transition from Edit to Verify at nearly double the rate of unresolved ones \(54\.2%54\.2\\%vs\.28\.1%28\.1\\%\)\. DunCrew’s pre\-Governor E→\\toV probability was only2\.1%2\.1\\%, reflecting architectural differences \(DunCrew relies on implicit Critic verification rather than explicit test commands\), but the*direction*of the effect is identical: more frequent verification correlates with higher success\.
2. 2\.Exploration spirals\.Unresolved instances show mean max X\-run of11\.011\.0vs\.4\.84\.8for resolved, and X self\-loop probability of84\.8%84\.8\\%vs\.74\.6%74\.6\\%\. Governor’sx\_brakerule targets exactly this pattern; the SWE\-agent data confirms that exploration spirals are a general failure mode of tool\-augmented agents, not a DunCrew artifact\.
#### System\-Specific Patterns\.
The encoding also reveals failure modes unique to SWE\-agent\. Table[16](https://arxiv.org/html/2606.15579#S6.T16)shows the most discriminative 4\-gram patterns:
Table 16:Discriminative 4\-grams in SWE\-agent trajectories\. Frequency is mean count per trajectory\. Lift\>1\{\>\}1: associated with resolution; lift<1\{<\}1: associated with failure\. EEEE and VVVV are system\-specific pathologies absent from DunCrew\.The EEEE pattern \(lift=\{=\}0\.17\) represents a “blind\-edit spiral”—consecutive file modifications without testing—a failure mode absent from DunCrew, whose Critic mechanism provides implicit verification after edits\. The VVVV pattern \(lift=\{=\}0\.00,*never*appearing in resolved instances\) represents repeated test execution without intervening edits\. These system\-specific pathologies demonstrate that while the encoding transfers, effective intervention rules must be tailored per system \(§[3\.4](https://arxiv.org/html/2606.15579#S3.SS4)\)\.
#### Model\-Level Fingerprints\.
Table[17](https://arxiv.org/html/2606.15579#S6.T17)reveals that different model sizes produce distinguishable base sequence profiles on the same task distribution:
Table 17:Model\-level base sequence profiles on SWE\-bench\. Larger models show higher V ratios and lower X ratios, producing distinctive behavioral fingerprints \(cf\. Direction 6, §[7\.3](https://arxiv.org/html/2606.15579#S7.SS3)\)\.Llama\-405B allocates 26\.1% of its steps to verification vs\. 17\.0% for both smaller models, while spending proportionally less time exploring \(X=\{=\}34\.0% vs\.∼\{\\sim\}43–45%\)\. This correlation between verification frequency and resolution rate \(42\.5%42\.5\\%vs\.∼\{\\sim\}16–18%\) is consistent with the E→\\toV deficit hypothesis: larger models have learned to verify more frequently through training, producing a measurably different behavioral fingerprint \(§[7\.3](https://arxiv.org/html/2606.15579#S7.SS3)\)\.
## 7Discussion
### 7\.1Limitations
#### Non\-randomized design\.
Our before/after comparison cannot fully exclude confounds\. Task distribution, user behavior, and system improvements may all contribute to the observed gains\. A randomized A/B test with concurrent control would provide stronger causal evidence\.
#### Single system\.
All data comes from one agent system \(DunCrew\)\. While the base encoding scheme is general \(any ReAct agent produces tool call sequences that can be classified\), the specific patterns \(P\-X\-P risk, X self\-loop rates\) may differ across systems, models, and task domains\.
We offer two structural arguments for partial universality, supported by our cross\-system validation \(§[6\.7](https://arxiv.org/html/2606.15579#S6.SS7)\)\. First, theE→\\toV verification deficitis a product of autoregressive generation: without an explicit verification prompt, the model’s next\-token distribution naturally favors continued editing over switching to a test action\. This bias is architecture\-level, not system\-specific—SWE\-agent exhibitsPr\(V∣E\)=0\.281\\Pr\(\\text\{V\}\\mid\\text\{E\}\)=0\.281for resolved instances vs\. only0\.1350\.135for unresolved ones, closely mirroring DunCrew’s pattern\. Second,exploration spirals\(consecutive X runs\) arise whenever a search action returns ambiguous results that prompt further search; this is inherent to any tool\-augmented agent navigating a file system\. SWE\-agent’s mean max X\-run length is4\.84\.8for resolved and11\.011\.0for unresolved instances, confirming the pattern\.
Importantly, the encoding also reveals*system\-specific*failure modes: SWE\-agent produces P==0% \(its forced\-action architecture eliminates pure planning steps\) and exhibits a distinctive EEEE “blind\-edit spiral” \(lift==0\.17, a strong negative predictor\) not observed in DunCrew\. These differences underscore that while the*encoding scheme*transfers, the*intervention rules*require per\-system calibration—exactly the role of the adapter interface proposed in §[3\.4](https://arxiv.org/html/2606.15579#S3.SS4)\.
#### Sample size\.
WithN=347N\{=\}347, our n\-gram analysis is limited to 2\-grams and 3\-grams\. Higher\-order patterns \(4\-grams, 5\-grams\) require substantially more data\.
### 7\.2Scaling Laws of Base Sequence Analysis
The combinatorial space of base sequences grows exponentially: 2\-grams have42=164^\{2\}\{=\}16combinations, 3\-grams have43=644^\{3\}\{=\}64, 4\-grams have44=2564^\{4\}\{=\}256, and 5\-grams have45=1,0244^\{5\}\{=\}1\{,\}024\. For statistically reliable analysis \(e\.g\.,≥\\geq20 occurrences per pattern atp<0\.05p\{<\}0\.05\), the required dataset sizes scale accordingly:
Table 18:Data requirements for statistically reliable n\-gram analysis at different orders\.At the current rate of∼\\sim10 tasks/day for a single user, 4\-gram analysis requires roughly one year of continuous usage, and 5\-gram analysis is entirely infeasible for any individual\.
#### Preliminary Scaling Characteristics\.
Our experience across three dataset sizes—92 traces \(initial deployment\), 347 traces \(this paper\), and 2,000 traces \(SWE\-agent cross\-validation\)—reveals early scaling characteristics that, while not yet constituting a formal scaling law, provide directional evidence:
1. 1\.Threshold convergence\.Governor thresholds stabilize as data increases\. TheconsecutiveXBrakethreshold relaxed from 8 to 12 after the initial 92\-trace analysis showed that consecutive X runs of 8–12 had<<1pp success rate difference\. Thestep\_fuserule was entirely disabled when 347 traces revealed that long sequences correlate with*higher*success—an assumption reversal impossible to detect atN=92N\{=\}92\.
2. 2\.Pattern emergence\.Core patterns \(P\-X\-P risk, E→\\toV deficit, X spirals\) were identifiable atN=92N\{=\}92but only became statistically significant atN=347N\{=\}347\. Higher\-order patterns like EEEE \(lift=\{=\}0\.17\) and XEVE \(lift=\{=\}2\.05\) requiredN=2,000N\{=\}2\{,\}000to achieve reliable discrimination, consistent with the combinatorial scaling in Table[18](https://arxiv.org/html/2606.15579#S7.T18)\.
3. 3\.Cross\-system stability\.The E→\\toV deficit and X\-spiral patterns replicated from DunCrew \(347 traces\) to SWE\-agent \(2,000 traces\) despite architectural differences, suggesting these are fundamental properties of autoregressive agents rather than artifacts of small samples\.
These observations suggest that base sequence analysis exhibits data\-scaling properties analogous to those in language modeling: more data enables detection of rarer but meaningful patterns, threshold estimates converge toward stable optima, and certain patterns are “emergent” at specific scale thresholds\. Formalizing these relationships into predictive scaling laws requires datasets spanning multiple orders of magnitude—a key motivation for the open\-source toolkit release\.
### 7\.3The Cerebellum Hypothesis and Research Directions
#### The Cerebellum Hypothesis\.
We propose an architectural analogy for thinking about the role of base sequence governance in agent systems:
- •TheLLMis the*cerebrum*—responsible for reasoning, creativity, and semantic understanding\.
- •Thetool framework\(ReAct loop, skills, memory\) provides the*limbs*—the ability to act on the world\.
- •Base sequence governanceis the*cerebellum*—responsible for motor coordination and temporal sequencing\.
The human cerebellum contains approximately 69 billion neurons—more than the cerebral cortex—because motor coordination requires learning from vast amounts of movement experience\. Analogously, a mature base sequence governance system would need to learn from millions of execution traces to discover the complex, high\-order sequential patterns that determine success in diverse agent tasks\.
Our current Governor, with its 7 hand\-crafted rules operating on 347 traces, is analogous to a primitive reflex arc: effective for the most common failure modes, but far from a true cerebellum\. The path from reflex to cerebellum requires progressing through concrete research directions, which we outline below\.
#### Direction 1: Base Sequence Language Model\.
The first\-order Markov transition matrix presented in §[4](https://arxiv.org/html/2606.15579#S4)is, in effect, a bigram language model over the XEPV alphabet\. A natural extension is to train explicit n\-gram or recurrent models \(RNN/LSTM\) on base sequences as a “language,” with three output heads: \(a\)*next\-base prediction*—given a prefixb1\-⋯\-btb\_\{1\}\\text\{\-\}\\cdots\\text\{\-\}b\_\{t\}, predict the distribution overbt\+1b\_\{t\+1\}; \(b\)*success probability estimation*—predict task success from the partial sequence; and \(c\)*perplexity scoring*—flag when the agent’s actual next base deviates significantly from the model’s expectation, indicating anomalous behavior\. Even withN=347N\{=\}347, a smoothed trigram model can be fit; scaling toN\>3,000N\>3\{,\}000would enable reliable 4\-gram models\. This replaces Governor’s hand\-crafted rules with a learned statistical model while retaining full interpretability\. We note that in our preliminary analysis \(Figure[11](https://arxiv.org/html/2606.15579#S7.F11)b\), the perplexity difference between successful and failed tasks is not yet statistically significant \(p=0\.51p\{=\}0\.51\), underscoring that this direction requires substantially larger datasets to yield reliable discrimination\.
Figure 11:Preliminary base sequence language model analysis onN=347N\{=\}347traces\. \(a\) Bigram surprisal matrix showing information content of each transition\. \(b\) Perplexity distribution for successful vs\. failed tasks\. \(c\) Early warning power \(AUC\) as a function of observed prefix length\. \(d\) Bigram model next\-base prediction accuracy by current base type\.
#### Direction 2: Base\-Conditioned Decoding\.
When the base sequence language model predicts that a particular next\-base type \(e\.g\., P\) would lower success probability below a threshold, this signal can be used as a*soft constraint*on the LLM’s tool\-call generation\. Concretely, at inference time, a bias term is applied to the LLM’s output logits to down\-weight tool calls that would be classified as the high\-risk base type\. This is a form ofsequence\-level guided decoding, analogous to KL\-constrained reward maximization in RLHF\(Bai et al\.,[2022](https://arxiv.org/html/2606.15579#bib.bib2)\), but operating on the behavioral sequence layer rather than the token layer\. The key distinction is that the constraint is defined over an abstract action alphabet \(XEPV\) rather than over natural language tokens, making it model\-agnostic and applicable to any LLM backend\. To our knowledge, this formulation—decoding guidance from a behavioral sequence model—has no direct precedent in the literature\.
#### Direction 3: Sequence Anomaly Detection\.
Governor’s rules detect*known*pathological patterns\. A complementary approach is to detect*unknown*anomalies by learning the distribution of successful base sequences and flagging deviations\. Standard sequence anomaly detection methods \(e\.g\., variational autoencoders over discrete sequences, or isolation forests on the 8\-dimensional feature space\) can be trained on the success\-labeled dataset\. Preliminary analysis on ourN=347N\{=\}347data shows that the 8\-dimensional feature vectors of failed tasks occupy a visually distinct region of the feature space \(high P\-ratio, low E\-ratio, short length\), suggesting directional promise\. However, at this sample size the best anomaly detector achieves only F1=0\.17\{=\}0\.17\(Figure[12](https://arxiv.org/html/2606.15579#S7.F12)d\), reflecting the severe class imbalance \(only 26 failures out of 347\) and the need for\>\>1K traces to train reliable detectors\.
Figure 12:Preliminary sequence anomaly detection onN=347N\{=\}347traces\. \(a\) PCA projection of the 8\-dimensional feature space; failure cases \(red crosses\) cluster separately from success cases\. \(b\) Anomaly score distribution with optimal threshold\. \(c\) Per\-feature deviation of failure vs\. success groups\. \(d\) Detection precision, recall, and F1 as a function of threshold\.
#### Direction 4: Dual\-Stream Agent Architecture\.
Current agent architectures are*single\-stream*: the same LLM is responsible for both semantic reasoning \(understanding the task, generating plans\) and action selection \(choosing which tool to call\)\. The base sequence framework suggests adual\-stream architecture:
- •Semantic stream\(LLM\): responsible for task understanding, reasoning, and tool parameter generation\.
- •Sequence stream\(base model\): responsible for behavioral coordination, pacing, and pattern monitoring\.
At each step, the semantic stream proposes an action; the sequence stream evaluates whether this action is “reasonable” given the current base sequence context \(e\.g\., “another P after P\-X would create a P\-X\-P pattern”\), and applies a gating mechanism to approve, modify, or suggest an alternative action type\. This is the cerebellum hypothesis made concrete—from metaphor to architecture\. The sequence stream can be orders of magnitude smaller than the LLM \(a few million parameters suffice for sequence modeling over a 4\-letter alphabet\), adding negligible latency\.
#### Direction 5: Base Sequence Reward Model\.
Current agent reinforcement learning approaches \(e\.g\., RAGEN\(Wang et al\.,[2025](https://arxiv.org/html/2606.15579#bib.bib11)\)\) rely on sparse, task\-level reward signals \(success/failure\)\. Base sequences can providedense, step\-level rewards: each base transitionbt→bt\+1b\_\{t\}\\to b\_\{t\+1\}can be assigned an immediate reward based on its historical association with task success\. For example,E→V\\text\{E\}\\to\\text\{V\}\(execute\-then\-verify\) receives positive reward; the subsequenceP\-X\-Preceives negative reward\. This transforms the reward shaping problem in agent RL into a base sequence statistics problem\. From our data, we can already compute the empirical reward landscape: E→\\toV transitions appear in tasks with 100% success rate, while P→\\toX→\\toP subsequences appear in tasks with 83\.3% success rate\. AtN=347N\{=\}347, however, the mean cumulative reward difference between successful and failed traces is not statistically significant \(p=0\.25p\{=\}0\.25, Figure[13](https://arxiv.org/html/2606.15579#S7.F13)d\), indicating that the reward signal is directionally correct but too noisy at this scale to serve as a standalone training signal\. Scaling this to larger datasets would yield increasingly fine\-grained reward signals, potentially enabling sample\-efficient agent RL without requiring millions of full task rollouts\.
Figure 13:Empirical reward landscape fromN=347N\{=\}347traces\. \(a\) Bigram reward matrix showing per\-transition success rate deviation from global baseline \(92\.5%\)\. \(b\) Top and bottom trigram rewards\. \(c\) Cumulative reward trajectories for sampled successful \(green\) and failed \(red\) traces\. \(d\) Total sequence reward vs\. token consumption\.
#### Direction 6: Base Sequence Fingerprinting\.
Our cross\-system analysis \(§[6\.7](https://arxiv.org/html/2606.15579#S6.SS7)\) reveals that base sequence profiles are not only diagnostic tools but alsobehavioral identity signatures\. Different models and agent frameworks exhibit distinctive base distributions, transition patterns, and per\-base token costs that form recognizable “fingerprints\.” Within the SWE\-agent framework alone, we observe striking model\-level differences:llama\-405bexhibits a naturally high V\-ratio \(20\.2%\) correlating with its superior resolution rate \(42\.5%\), whilellama\-8bis markedly more X\-heavy \(48\.8%\) and requires longer trajectories \(39\.0 steps on average\)\. At the framework level, DunCrew allocates 13\.4% of bases to explicit planning \(P\) while SWE\-agent produces exactly 0% P bases due to its forced\-action architecture\.
We formalize this as abehavioral fingerprint vector𝐟∈ℝd\\mathbf\{f\}\\in\\mathbb\{R\}^\{d\}composed of: \(a\) base distribution—the 4\-dimensional\[X%,E%,P%,V%\]\[X\\%,E\\%,P\\%,V\\%\]profile; \(b\) transition probabilities—the4×44\\times 4Markov matrix flattened to 16 dimensions; \(c\) efficiency metrics—per\-base token cost\[tok/X,tok/E,tok/P,tok/V\]\[\\text\{tok\}/X,\\text\{tok\}/E,\\text\{tok\}/P,\\text\{tok\}/V\]; and \(d\) structural features—\[avg\_length,max\_X\_run,first\_V\_position,switch\_rate\]\[\\text\{avg\\\_length\},\\text\{max\\\_X\\\_run\},\\text\{first\\\_V\\\_position\},\\text\{switch\\\_rate\}\]\. This yields a∼\\sim28\-dimensional fingerprint that can be computed from as few as 50 traces\.
Potential applications include:*model identification*from behavioral traces alone \(without API access\),*standardized agent benchmarking*\(“how do GPT\-4 and Claude\-3 differ in base sequence profiles on equivalent tasks?”\),*behavioral drift detection*\(flagging when a model’s fingerprint deviates from its baseline after an update\), and*product differentiation*\(different agent products using the same LLM will produce different fingerprints due to framework design choices\)\. Unlike the other research directions, fingerprinting can be implemented at Level 0 with only a base classifier and adapter—no large\-scale data required\.
#### Maturity Levels\.
These six directions map onto a maturity progression:
- •Level 0\(this paper\): Hand\-crafted rules and behavioral fingerprinting\. 7 rules, 347 traces\.
- •Level 1\(Directions 1, 3, 6\): Statistical/learned models replacing hand\-crafted rules, plus cross\-system fingerprint databases\.∼\\sim3K–5K traces\.
- •Level 2\(Directions 2, 5\): Integration with LLM inference and RL training\.∼\\sim50K traces\.
- •Level 3\(Direction 4\): Full dual\-stream architecture with a dedicated sequence coordination model\.∼\\sim1M\+ traces, requiring community\-scale data\.
Level 3 cannot be achieved by any single deployment\. It requires a community\-wide effort to collect, standardize, and share base sequence data\. The XEPV encoding is universal—any ReAct\-style agent can produce base sequences—making this a feasible, if ambitious, goal\. We release our encoding specification and analysis tools to facilitate this\.
### 7\.4Generalizability
The base encoding scheme assumes a ReAct\-style agent with observable tool calls\. It does not apply to:
- •Pure chat models without tool use
- •End\-to\-end RL agents where actions are not decomposable into discrete tool calls
- •Multi\-agent systems where coordination between agents introduces a fifth dimension
For multi\-agent systems, one could extend the alphabet \(e\.g\., addingCfor Coordinate\) or analyze each agent’s sequence independently\. We leave this to future work\.
## 8Conclusion
We have shown that encoding LLM agent behavior as base sequences—compact symbolic representations using a four\-letter alphabet—enables powerful behavioral analysis and practical runtime intervention\. Our framework reveals that planning oscillation \(P\-X\-P\) is the dominant failure mode, that agents almost never verify their work \(E→\\toV = 2\.1%\), and that excessive planning is the strongest predictor of failure \(r=−0\.256r\{=\}\{\-\}0\.256\)\. Governor, a three\-layer runtime system built on these findings, achieves simultaneous improvement in success rate \(\+6\.2%\) and token efficiency \(−\-44%\) with zero LLM overhead\. Crucially, Governor’s rules are not static heuristics: they emerged from data\-driven analysis of 92 initial traces \(§[5\.5](https://arxiv.org/html/2606.15579#S5.SS5)\) and continue to evolve through online chi\-square testing, with thresholds that have already adapted and one rule autonomously disabled by the system\.
Cross\-system validation on 2,000 publicly available SWE\-agent trajectories \(§[6\.7](https://arxiv.org/html/2606.15579#S6.SS7)\) demonstrates that two of our three core findings—exploration spirals \(X→\\toX self\-loop\) and the E→\\toV verification deficit—replicate in an independent agent framework with a structurally different action space\. The analysis further reveals model\-level behavioral fingerprints: larger models exhibit naturally higher verification rates, suggesting that base sequence profiles may serve as a universal behavioral metric across models and systems\.
More broadly, we argue that base sequence analysis opens a new dimension of agent observability\. Just as genomics transformed biology by providing a symbolic language for life’s instructions, base sequences provide a symbolic language for agent behavior\. The six research directions outlined in §[7\.3](https://arxiv.org/html/2606.15579#S7.SS3)—from base sequence language models to behavioral fingerprinting—chart a concrete path from the hand\-crafted rules of this proof of concept toward learned behavioral governance\. Realizing this vision requires data at a scale that no individual can generate alone\.
Your agent has a genome\. It is time we learned to read it\.
#### Data and System Availability\.
The DunCrew system used for data collection is available at[https://duncrew\.com](https://duncrew.com/)\. The XEPV base encoding specification, adapter interface, 8\-dimensional feature extraction, and Governor rule definitions are described in sufficient detail in this paper for independent reimplementation\. We release theBase Sequence Toolkit—including the base classifier, SWE\-agent adapter, and analysis pipeline—as open\-source software at[https://github\.com/FatBy/base\-sequence\-toolkit](https://github.com/FatBy/base-sequence-toolkit)\(Base Sequence Toolkit,[2025](https://arxiv.org/html/2606.15579#bib.bib1)\), enabling community\-scale cross\-system behavioral analysis\.
## References
- Base Sequence Toolkit \[2025\]Base Sequence Toolkit: XEPV analysis framework for AI agent execution traces\.[https://github\.com/FatBy/base\-sequence\-toolkit](https://github.com/FatBy/base-sequence-toolkit), 2025\.
- Bai et al\. \[2022\]Y\. Bai, S\. Kadavath, S\. Kundu, A\. Askell, J\. Kernion, A\. Jones, A\. Chen, A\. Goldie, A\. Mirhoseini, C\. McKinnon, et al\.Constitutional AI: Harmlessness from AI feedback\.*arXiv preprint arXiv:2212\.08073*, 2022\.
- Liu et al\. \[2024\]X\. Liu, H\. Yu, H\. Zhang, Y\. Xu, X\. Lei, H\. Lai, Y\. Gu, et al\.AgentBench: Evaluating LLMs as agents\.In*ICLR*, 2024\.
- Rebedea et al\. \[2023\]T\. Rebedea, R\. Dinu, M\. Sreedhar, C\. Parisien, and J\. Cohen\.NeMo Guardrails: A toolkit for controllable and safe LLM applications with programmable rails\.In*EMNLP System Demonstrations*, 2023\.
- Schick et al\. \[2023\]T\. Schick, J\. Dwivedi\-Yu, R\. Dessi, R\. Raileanu, M\. Lomeli, E\. Hambro, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom\.Toolformer: Language models can teach themselves to use tools\.In*NeurIPS*, 2023\.
- Shinn et al\. \[2023\]N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao\.Reflexion: Language agents with verbal reinforcement learning\.In*NeurIPS*, 2023\.
- Sumers et al\. \[2024\]T\. R\. Sumers, S\. Yao, K\. Narasimhan, and T\. L\. Griffiths\.Cognitive architectures for language agents\.*Transactions on Machine Learning Research \(TMLR\)*, 2024\.
- van der Aalst \[2016\]W\. M\. P\. van der Aalst\.*Process Mining: Data Science in Action*\.Springer, 2nd edition, 2016\.
- Wang et al\. \[2023\]G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar\.Voyager: An open\-ended embodied agent with large language models\.*arXiv preprint arXiv:2305\.16291*, 2023\.
- Wang et al\. \[2024\]X\. Wang, B\. Li, Y\. Song, F\. F\. Xu, X\. Tang, M\. Zhuge, J\. Pan, et al\.OpenHands: An open platform for AI software developers as generalist agents\.In*ICLR*, 2025\.
- Wang et al\. \[2025\]Z\. Wang, K\. Wang, Q\. Wang, P\. Zhang, L\. Li, Z\. Yang, et al\.RAGEN: Understanding self\-evolution in LLM agents via multi\-turn reinforcement learning\.*arXiv preprint arXiv:2504\.20073*, 2025\.
- Wei et al\. \[2022\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. Chi, Q\. Le, and D\. Zhou\.Chain\-of\-thought prompting elicits reasoning in large language models\.In*NeurIPS*, 2022\.
- Yang et al\. \[2024\]J\. Yang, C\. E\. Jimenez, A\. Wettig, K\. Lieret, S\. Yao, K\. Narasimhan, and O\. Press\.SWE\-agent: Agent\-computer interfaces enable automated software engineering\.In*NeurIPS*, 2024\.
- Yao et al\. \[2023\]S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao\.ReAct: Synergizing reasoning and acting in language models\.In*ICLR*, 2023\.
- Yao et al\. \[2024\]S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. Griffiths, Y\. Cao, and K\. Narasimhan\.Tree of thoughts: Deliberate problem solving with large language models\.In*NeurIPS*, 2023\.
- Zhou et al\. \[2024\]A\. Zhou, K\. Yan, M\. Shlapentokh\-Rothman, H\. Wang, and Y\.\-X\. Wang\.Language agent tree search unifies reasoning, acting, and planning in language models\.In*ICML*, 2024\.
- Yang et al\. \[2024b\]J\. Yang, C\. E\. Jimenez, A\. Wettig, K\. Lieret, S\. Yao, K\. Narasimhan, and O\. Press\.SWE\-agent: Agent\-computer interfaces enable automated software engineering\.*arXiv preprint arXiv:2405\.15793*, 2024\.
- Jimenez et al\. \[2024\]C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. Narasimhan\.SWE\-bench: Can language models resolve real\-world GitHub issues?In*ICLR*, 2024\.
## Appendix ABase Classifier Tool Categories
Table 19:Tool\-to\-base classification categories\.
## Appendix BGovernor Prompt Injection Examples
Table 20:Representative prompt injection templates for Governor rules\.
## Appendix CReproducibility
All analysis in this paper is reproducible via five Python scripts operating on the raw JSONL trace files:
- •reanalysis\.py— Core base sequence analysis \(Figures 2–7\)
- •reanalysis\_supplement\.py— Intervention effects and skill analysis \(Figures 9–10\)
- •reanalysis\_memory\.py— Memory system analysis \(Figure 10, memory subfigure\)
- •reanalysis\_governor\.py— Governor effect analysis \(Figure 8\)
- •reanalysis\_future\.py— System architecture diagram and future directions analysis \(Figures 1, 11–13\)
The Governor implementation \(baseSequenceGovernor\.ts\) is approximately 920 lines of TypeScript with no external dependencies beyond the base classifier\.Similar Articles
TrajGenAgent: A Hierarchical LLM Agent for Human Mobility Trajectory Generation
TrajGenAgent proposes a hierarchical LLM agent framework that decouples macro-level activity planning from micro-level spatiotemporal instantiation for realistic human mobility trajectory generation without fine-tuning. It also introduces an anomaly-detection-based evaluation for behavioral fidelity.
Towards Security-Auditable LLM Agents: A Unified Graph Representation
This paper introduces Agent-BOM, a unified graph representation for security auditing in LLM-based agentic systems. It addresses the semantic gap in post-hoc auditing by modeling static capabilities and dynamic runtime states to detect complex attack chains like memory poisoning and tool misuse.
State Contamination in Memory-Augmented LLM Agents
This paper identifies and studies 'memory laundering' in LLM agents, where toxic or adversarial context compressed into memory summaries evades standard toxicity detectors while still influencing future generations. It introduces the sub-threshold propagation gap (SPG) to measure hidden downstream influence and shows that sanitizing toxic state before summarization is more effective than post-hoc cleaning.
Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents
This paper introduces the Insights Generator, a multi-agent system for systematic corpus-level trace diagnostics of LLM agents, which generates evidence-backed insights by proposing and testing hypotheses across execution traces. Experiments show that using Insights Generator reports improves scaffold performance by 30.4 percentage points.
@janehu07: https://x.com/janehu07/status/2058359677843599494
This learning note introduces the concept of an agent harness as the infrastructure layer around an LLM, proposing the ETCLOVG taxonomy (Execution, Tooling, Context, Lifecycle, Observability, Verification, Governance) and demonstrating its application through a coding agent case study.