APEX: Adaptive Principle EXtraction A Three-Layer Self-Evolution Framework for Production AI Agents

arXiv cs.AI 06/16/26, 04:00 AM Papers
self-improvement ai-agents framework multi-axis-evolution production-systems edge-ai
Summary
APEX proposes a three-layer self-evolution framework for production AI agents that simultaneously optimizes the harness, behavioural principles, and workflow topology. Experiments on a production agent show significant improvements in health score and workflow quality with minimal LLM calls.
arXiv:2606.15363v1 Announce Type: new Abstract: Self-improvement in AI agents has emerged as a key research frontier: systems that modify their own prompts, workflows, and decision rules based on accumulated operational experience. The state-of-the-art Self-Harness framework [1] achieves 14--21% improvement on Terminal-Bench-2.0 by mining failure clusters and patching the agent harness. However, Self-Harness optimises only one dimension -- the prompt harness -- leaving behavioural principles and workflow topology unchanged. We propose APEX (Adaptive Principle EXtraction), a three-layer co-evolution framework that simultaneously evolves: (L1) the harness via failure-mode patching, (L2) behavioural principles via success-trace distillation [2], and (L3) the agent workflow topology via structural fitness-based selection [6]. We implement APEX on Joe [13], a production-grade super AI Agent built on NVIDIA Nemotron and designed as an Edge AI Agent Factory for the NVIDIA Agent Challenge 2026, managing a 15-node compute fleet using 114 real task traces collected over 18 days. APEX achieves an APEX Health Score of 0.570 (+90% vs. baseline 0.300) in a single evolutionary run, distilling 6 novel reusable principles and selecting a research-first workflow topology scoring 0.900 (+20%). Our results demonstrate that multi-dimensional co-evolution substantially outperforms single-axis harness optimisation, at a cost of only 4 LLM calls (~270 s) on a local qwen2.5-coder:32b instance.
Original Article
View Cached Full Text
Cached at: 06/16/26, 11:45 AM
# Adaptive Principle EXtraction A Three-Layer Self-Evolution Framework for Production AI Agents
Source: [https://arxiv.org/html/2606.15363](https://arxiv.org/html/2606.15363)
Ya\-Chuan Chen∗Tien\-Jen Lai Hsiang\-Wei Hu Grace AI Technology joycechen108@gmail\.comapplelai001@gmail\.comhw\.hsiang\.wei@gmail\.com ∗Correspondence:joycechen108@gmail\.com

\(June 13, 2026 arXiv preprint\)

###### Abstract

Self\-improvement in AI agents has emerged as a key research frontier: systems that modify their own prompts, workflows, and decision rules based on accumulated operational experience\. The state\-of\-the\-art Self\-Harness framework\[[1](https://arxiv.org/html/2606.15363#bib.bib1)\]achieves 14–21% improvement on Terminal\-Bench\-2\.0 by mining failure clusters and patching the agent harness\. However, Self\-Harness optimises only one dimension—the*prompt harness*—leaving behavioural principles and workflow topology unchanged\. We proposeApex\(Adaptive Principle EXtraction\), a three\-layer co\-evolution framework that simultaneously evolves: \(L1\) the harness via failure\-mode patching, \(L2\) behavioural principles via success\-trace distillation\[[2](https://arxiv.org/html/2606.15363#bib.bib2)\], and \(L3\) the agent workflow topology via structural fitness\-based selection\[[6](https://arxiv.org/html/2606.15363#bib.bib6)\]\. We implementApexonJoe\[[13](https://arxiv.org/html/2606.15363#bib.bib13)\], a production\-grade super AI Agent built on NVIDIA Nemotron and designed as an*Edge AI Agent Factory*for the NVIDIA Agent Challenge 2026, managing a 15\-node compute fleet using 114 real task traces collected over 18 days\.Apexachieves an APEX Health Score of0\.570\(\+90%\+90\\%vs\. baseline 0\.300\) in a single evolutionary run, distilling6 novel reusable principlesand selecting a research\-first workflow topology scoring0\.900\(\+20%\+20\\%\)\. Our results demonstrate that multi\-dimensional co\-evolution substantially outperforms single\-axis harness optimisation, at a cost of only 4 LLM calls \(≈270\{\\approx\}270s\) on a localqwen2\.5\-coder:32binstance\.

## 1Introduction

Modern AI agents deployed in production environments face a fundamental challenge: the initial configuration—system prompt, workflow structure, decision rules—becomes stale as the environment, user needs, and task distributions evolve\. Traditional responses require slow, expensive manual prompt\-engineering cycles disconnected from production realities\.

Automated self\-improvement has emerged as a promising solution\.*Self\-Harness*\[[1](https://arxiv.org/html/2606.15363#bib.bib1)\]clusters failure trajectories and proposes harness patches;*EvolveR*\[[2](https://arxiv.org/html/2606.15363#bib.bib2)\]distils successful execution traces into reusable behavioural principles;*EvoAgentX*\[[3](https://arxiv.org/html/2606.15363#bib.bib3)\]applies gradient\-like text optimisation and AFlow\[[6](https://arxiv.org/html/2606.15363#bib.bib6)\]DAG topology search to improve agent workflows;*Reflexion*\[[9](https://arxiv.org/html/2606.15363#bib.bib9)\]enables agents to self\-improve via verbal reflection over prior trajectories;*Symbolic Learning*\[[11](https://arxiv.org/html/2606.15363#bib.bib11)\]propagates natural language “gradients” through agent pipelines\. However, each method targets a*single axis of improvement*, leaving the others fixed and unexploited\.

We argue that production agents requiremulti\-axis co\-evolution: the harness, internalised behavioural principles, and workflow structure must evolve together\. A perfect harness with a suboptimal workflow still fails systematically; well\-evolved principles with a harness that misses critical failure modes degrade under distribution shift; optimal workflow topology without grounding behavioural rules produces structurally correct but contextually wrong decisions\. These three axes address*orthogonal*failure modes\.

This paper introducesApex, a unified three\-layer self\-evolution framework implementing all three axes as a single orchestrated pipeline\. Our contributions are:

1. \(1\)Apexframework: A three\-layer co\-evolution pipeline \(L1: harness patching, L2: principle distillation, L3: workflow topology evolution\) operating on a shared production trace pool, with no synthetic benchmark required\.
2. \(2\)APEX Health Score: A composite metric for measuring multi\-dimensional agent evolution progress, separating harness coverage, principle richness, and structural workflow quality\.
3. \(3\)Production validation: The first evaluation of a three\-axis agent self\-improvement system on real\-world traces \(114 tasks, 18 days, 15\-node fleet\), achieving\+90%\+90\\%improvement over the untuned baseline\.
4. \(4\)Open\-source release: Full implementation in three composable Python modules \(joe\_apex\.py,joe\_apex\_distill\.py,joe\_apex\_workflow\.py\) deployable on any agent with a trace database and local Ollama instance\.

## 2Related Work

#### Harness\-based self\-improvement\.

Ye et al\.\[[1](https://arxiv.org/html/2606.15363#bib.bib1)\]propose a three\-step loop:*Weakness Mining*clusters trajectory failures;*Harness Proposal*generates new rules via LLM;*Proposal Validation*runs a mini\-benchmark to accept or reject\. Achieves 14–21% on Terminal\-Bench\-2\.0\.*Limitation:*operates only on failure trajectories; ignores successful behavioural patterns\.

#### Trace\-based principle distillation\.

Li et al\.\[[2](https://arxiv.org/html/2606.15363#bib.bib2)\]propose offline distillation of principles from successful traces, then online application at inference\. Key insight: learning from success generalises better than patching failures alone\.*Limitation:*no harness modification; no workflow structure search\.

#### Workflow topology optimisation\.

Zhang et al\.\[[6](https://arxiv.org/html/2606.15363#bib.bib6)\]reformulate workflow optimisation as a code\-level search problem using Monte Carlo Tree Search over LLM\-invocation DAGs, achieving 5\.7% average improvement over state\-of\-the\-art baselines and enabling smaller models to match GPT\-4o at 4\.55% of inference cost\. Wang et al\.\[[3](https://arxiv.org/html/2606.15363#bib.bib3)\]combine TextGrad\[[7](https://arxiv.org/html/2606.15363#bib.bib7)\]prompt optimisation, AFlow DAG topology search, and MIPRO few\-shot selection, achieving\+7\+7–20%20\\%on GAIA/MBPP\.*Limitation for both:*require curated benchmarks; do not exploit per\-task production trace signals\. A recent survey\[[10](https://arxiv.org/html/2606.15363#bib.bib10)\]identifies the absence of production\-trace\-driven methods as a key open problem thatApexdirectly addresses\.

#### Verbal reinforcement and symbolic learning\.

Yao et al\.\[[8](https://arxiv.org/html/2606.15363#bib.bib8)\]introduce ReAct, interleaving reasoning traces with action calls\. Shinn et al\.\[[9](https://arxiv.org/html/2606.15363#bib.bib9)\]extend this with Reflexion, storing verbal self\-reflections in episodic memory\. Wang et al\.\[[11](https://arxiv.org/html/2606.15363#bib.bib11)\]propagate natural language gradients through agent pipelines for self\-evolution\.Apex’s L1/L2 layers can be viewed as an offline, batch variant—systematically extracting patches and principles from accumulated trajectories rather than single\-episode reflections\.

#### Continual adaptation\.

Gao et al\.\[[4](https://arxiv.org/html/2606.15363#bib.bib4)\]propose continuous LoRA fine\-tuning without task boundary annotations\. Anonymous\[[12](https://arxiv.org/html/2606.15363#bib.bib12)\]study online agent adaptation without gradient updates\. These complementApexas a prospective Layer 4 for weight\-level evolution\.

Key Gap\.No prior method simultaneously evolves harness \(L1\), behavioural principles \(L2\),*and*workflow topology \(L3\) from a single shared production trace pool\.Apexcloses this gap with a unified 3\-layer pipeline requiring no synthetic benchmark and no external API dependencies\.

## 3APEX Framework

### 3\.1Architecture Overview

Apextakes as input a trace database containing timestamped task execution records\. Each record includes the task description, execution log, lesson learned, files changed, and an optional outcome score\. From this shared pool, three parallel pipelines operate simultaneously: L1 selects*failure*traces; L2 selects high\-quality*success*traces; L3 uses structural scoring on workflow candidates independent of trace content\. Their outputs—harness patches, behavioural principles, and a selected topology—are aggregated into an updated agent configuration deployed in the next generation\.

TRACE POOLjoe\_learned\_tasks\.db · 114 tasks · 18\-day spanL1Harness PatchL2Principle Distill\.L3Topology Evo\.3 patches✓6 principles✓τ∗=0\.900\\tau^\{\*\}\{=\}0\.900✓APEX Health AggregationH=0\.570H=0\.570\+90%\+90\\%vs\. baselinefailuresuccesstopologiesFigure 1:APEX evolution pipeline\. All three layers draw from a shared production trace pool\. L1 extracts failure traces for harness patching; L2 selects high\-quality success traces for principle distillation; L3 evaluates structural fitness of workflow topology candidates\.
### 3\.2Layer 1: Harness Review \(Self\-Harness Variant\)

Layer 1 identifies systemic failure modes from the trace pool\. Any task record containing keywordserror,fail,wrong, ormistakein its lesson field is flagged as a failure trace\. The top\-30 failure traces \(by recency\) are submitted to the local LLM \(qwen2\.5\-coder:32bvia Ollama\) with the prompt:“Identify the top\-3 systemic failure patterns with root cause and a concrete prohibition rule\.”Each patch is stored inapex\_harnessand injected into the next generation’s system prompt as an explicit rule block\.

L1 Result\.3 harness patches extracted from 114 traces\. Failure modes: \(i\) Port Conflict—openclaw\-gateway port collisions under concurrent restart; \(ii\) Frontend Stability—CI test coverage gaps triggering silent regressions; \(iii\) Crisis Detection Delay—metric polling intervals too coarse for alert SLAs\.

### 3\.3Layer 2: Principle Distillation \(EvolveR\-inspired\)

Layer 2 selects the highest\-quality success traces using a multi\-factor quality score:

s\(t\)=0\.4⋅𝟏\[\|lesson\|\>50\]\+0\.3⋅𝟏\[\|actions\|\>30\]\+0\.2⋅𝟏\[files≠∅\]\+0\.1⋅𝟏\[source≠self\]s\(t\)\\;=\\;0\.4\\cdot\\mathbf\{1\}\[\|lesson\|\{\>\}50\]\+0\.3\\cdot\\mathbf\{1\}\[\|actions\|\{\>\}30\]\+0\.2\\cdot\\mathbf\{1\}\[\\textit\{files\}\{\\neq\}\\emptyset\]\+0\.1\\cdot\\mathbf\{1\}\[\\textit\{source\}\{\\neq\}\\texttt\{self\}\]\(1\)The top\-34 traces \(30th percentile threshold\) are submitted to the LLM:“Extract 6 reusable behavioural principles that made these tasks successful\.”Each candidate principle is scored for novelty using cosine overlap against existing principles \(duplicates are penalised\), specificity \(length proxy for actionability\), and completeness\. Principles scoring≥0\.3\\geq 0\.3on the composite novelty metric are admitted toapex\_principles\.

Table 1:Extracted principles from Layer 2\. All 6 are novel \(avg\. novelty 0\.998\)\.L2 Result\.6/6 principles novel \(average novelty score 0\.998\)\. All principles are production\-grounded—derived from actual deployment traces rather than synthetic benchmarks\.

### 3\.4Layer 3: Workflow Topology Evolution \(AFlow\-inspired\)

Layer 3 maintains a population of agent workflow DAGs defined over a canonical node vocabulary:intake,research,plan,code,review,verify,dispatch,summarize\. Each topologyGGis scored by structural fitness:

score\(G\)=\\displaystyle\\text\{score\}\(G\)\\;=\\;0\.50\+0\.10⋅𝟏\[review∈G\]\+0\.10⋅𝟏\[verify∈G\]\\displaystyle 0\.50\+0\.10\\cdot\\mathbf\{1\}\[\\texttt\{review\}\\in G\]\+0\.10\\cdot\\mathbf\{1\}\[\\texttt\{verify\}\\in G\]\+0\.05⋅𝟏\[research∈G\]\+0\.15⋅𝟏\[loop\-back routing\]\\displaystyle\+0\.05\\cdot\\mathbf\{1\}\[\\texttt\{research\}\\in G\]\+0\.15\\cdot\\mathbf\{1\}\[\\text\{loop\-back routing\}\]\+0\.05⋅𝟏\[parallel nodes\]−0\.10⋅𝟏\[\|G\|\>8\]\\displaystyle\+0\.05\\cdot\\mathbf\{1\}\[\\text\{parallel nodes\}\]\-0\.10\\cdot\\mathbf\{1\}\[\|G\|\>8\]\(2\)
Mutation operators:add\_node\(inject research node before plan\),add\_routing\(add self\-correction loop on failed review\),insert\_verify\(add verification stage after code\)\. The top\-2 topologies per generation each produce two mutant children; over 3 generations, 10 distinct topologies were evaluated\.

Table 2:Topology evolution results across 3 generations\.L3 Result\.Best topology:research\_first\_v1\(score=0\.900=0\.900,\+20%\+20\\%overbaseline\_v1at0\.7500\.750\)\. Key insight: research\-before\-code topology dominates, consistent with EvolveR’s finding that context\-gathering before action reduces downstream execution errors\[[2](https://arxiv.org/html/2606.15363#bib.bib2)\]\.

### 3\.5APEX Algorithm

Algorithm[3\.5](https://arxiv.org/html/2606.15363#S3.SS5)summarises the completeApexevolution cycle\.

Algorithm 1: APEX Evolution CycleRequire:𝒯\\mathcal\{T\}: trace database;MM: LLM oracle;P0P\_\{0\}: initial topology population Ensure:Δ\\Delta: harness patches;𝒬\\mathcal\{Q\}: novel principles;τ∗\\tau^\{\*\}: best topology // Layer 1: Harness Patching 1𝒯fail←\{t∈𝒯:lesson\(t\)∩\{error, fail, wrong, mistake\}≠∅\}\\mathcal\{T\}\_\{\\text\{fail\}\}\\leftarrow\\\{t\\in\\mathcal\{T\}:\\text\{lesson\}\(t\)\\cap\\\{\\text\{error, fail, wrong, mistake\}\\\}\\neq\\emptyset\\\} 2Δ←M\(‘‘Extract top\-3 failure patterns w/ patch rules’’,top\-30\(𝒯fail\)\)\\Delta\\leftarrow M\\bigl\(\\text\{\`\`Extract top\-3 failure patterns w/ patch rules''\},\\text\{top\-30\}\(\\mathcal\{T\}\_\{\\text\{fail\}\}\)\\bigr\) 3store\(Δ→apex\_harness\)\\text\{store\}\(\\Delta\\to\\texttt\{apex\\\_harness\}\)// Layer 2: Principle Distillation 4for each t∈𝒯t\\in\\mathcal\{T\}: computes\(t\)s\(t\)per Eq\. \(1\) 5𝒬cand←M\(‘‘Extract 6 reusable principles’’,top\-30%\(𝒯,s\)\)\\mathcal\{Q\}\_\{\\text\{cand\}\}\\leftarrow M\\bigl\(\\text\{\`\`Extract 6 reusable principles''\},\\text\{top\-30\\%\}\(\\mathcal\{T\},\\,s\)\\bigr\) 6for eachq∈𝒬candq\\in\\mathcal\{Q\}\_\{\\text\{cand\}\}:nov\(q\)←1−maxq′⁡cos⁡\(q,q′\)\\text\{nov\}\(q\)\\leftarrow 1\-\\max\_\{q^\{\\prime\}\}\\cos\(q,q^\{\\prime\}\) 7ifnov\(q\)≥0\.3\\text\{nov\}\(q\)\\geq 0\.3: storeq→apex\_principlesq\\to\\texttt\{apex\\\_principles\}// Layer 3: Topology Evolution \(3 generations\) 8 P←P0P\\leftarrow P\_\{0\} 9forgen=1,2,3\\text\{gen\}=1,2,3: 10score eachG∈PG\\in Pper Eq\. \(2\) 11P←top\-2\(P\)∪mutate\(top\-2\(P\)\)P\\leftarrow\\text\{top\-2\}\(P\)\\;\\cup\\;\\text\{mutate\}\\bigl\(\\text\{top\-2\}\(P\)\\bigr\) 12τ∗←arg⁡maxG∈P⁡score\(G\)\\tau^\{\*\}\\leftarrow\\arg\\max\_\{G\\in P\}\\,\\text\{score\}\(G\)// Health Score Aggregation 13 H←min⁡\(0\.30,\|Δ\|⋅0\.10\)\+min⁡\(0\.40,\|𝒬\|⋅0\.07\)\+score\(τ∗\)⋅0\.30H\\leftarrow\\min\(0\.30,\\,\|\\Delta\|\{\\cdot\}0\.10\)\+\\min\(0\.40,\\,\|\\mathcal\{Q\}\|\{\\cdot\}0\.07\)\+\\text\{score\}\(\\tau^\{\*\}\)\{\\cdot\}0\.30 14returnΔ,𝒬,τ∗,H\\Delta,\\;\\mathcal\{Q\},\\;\\tau^\{\*\},\\;H

## 4Experimental Evaluation

### 4\.1Experimental Setup

We deployApexonJoe\[[13](https://arxiv.org/html/2606.15363#bib.bib13)\], a production\-grade super AI Agent built on NVIDIA Nemotron and developed as an*Edge AI Agent Factory*for the NVIDIA Agent Challenge 2026\[[13](https://arxiv.org/html/2606.15363#bib.bib13)\]\. Joe runs on Ubuntu 22\.04 and autonomously manages a 15\-node compute fleet \(192\.168\.1\.x subnet\)\. Joe’s trace database contains 114 real\-world task executions collected between 2026\-05\-26 and 2026\-06\-13 \(18 days\) across five task domains: AI/ML deployment \(32%\), systems administration \(28%\), frontend/web development \(22%\), networking \(12%\), and security hardening \(6%\)\. All LLM calls useqwen2\.5\-coder:32bvia Ollama with no external API dependency, ensuring full data privacy and zero marginal inference cost\.

### 4\.2APEX Health Score Formulation

We define the APEX Health ScoreHHas a weighted composite of per\-layer contributions:

H=min⁡\(0\.30,\|Δ\|×0\.10\)⏟L1: harness coverage\+min⁡\(0\.40,\|𝒬\|×0\.07\)⏟L2: principle richness\+score\(τ∗\)×0\.30⏟L3: workflow qualityH\\;=\\;\\underbrace\{\\min\(0\.30,\\;\|\\Delta\|\\times 0\.10\)\}\_\{\\text\{L1: harness coverage\}\}\+\\underbrace\{\\min\(0\.40,\\;\|\\mathcal\{Q\}\|\\times 0\.07\)\}\_\{\\text\{L2: principle richness\}\}\+\\underbrace\{\\text\{score\}\(\\tau^\{\*\}\)\\times 0\.30\}\_\{\\text\{L3: workflow quality\}\}\(3\)The formula assigns maximum weight to L2 \(0\.40\), reflecting that internalised behavioural principles offer the broadest generalisation benefit\. L1 and L3 each contribute up to 0\.30\. The untuned agent baseline is calibrated asH0=0\.300H\_\{0\}=0\.300, corresponding to the observed task\-completion quality without anyApexevolution; baseline and Self\-HarnessHHvalues in[table˜3](https://arxiv.org/html/2606.15363#S4.T3)are measured from agent task completion rates, whileApexHHis computed analytically from[eq\.˜3](https://arxiv.org/html/2606.15363#S4.E3)\.

### 4\.3Results

Table 3:Comparison across three configurations\.Summary\.Apexachieves\+90%\+90\\%improvement over the untuned baseline and\+50%\+50\\%over Self\-Harness alone, at a cost of 4 LLM calls \(≈270\{\\approx\}270s\) per evolution cycle on a localqwen2\.5\-coder:32binstance\.

### 4\.4Per\-Layer Ablation

Table 4:Ablation study\. \* and † explained below\.∗L3\-only topology evolution without harness patching or behavioural principles scores below the calibrated baseline \(H=0\.270<H0=0\.300H\{=\}0\.270<H\_\{0\}\{=\}0\.300\), confirming that structural workflow optimisation requires a quality harness foundation\. The non\-additive interaction between L1 and L3 \(HL1\+L3=0\.570\>HL1\+L2=0\.500H\_\{\\text\{L1\+L3\}\}\{=\}0\.570\>H\_\{\\text\{L1\+L2\}\}\{=\}0\.500\) indicates that workflow topology improvement \(\+0\.190\+0\.190\) outweighs principle richness \(\+0\.120\+0\.120\) in the current production task distribution\. †In the current implementation, extracted L2 principles are stored correctly inapex\_principlesbut are not yet injected at harness assembly time; this integration is under active development\. As a result,Apex\-full matches L1\+L3 \(H=0\.570H\{=\}0\.570\)\. With L2 injection complete,[eq\.˜3](https://arxiv.org/html/2606.15363#S4.E3)projectsH≈0\.65H\\approx 0\.65–0\.700\.70\.

## 5Discussion

### 5\.1Why Multi\-Axis Co\-Evolution Wins

The ablation results expose the complementary roles of each layer\. Self\-Harness patches surface failures but cannot generalise beyond the observed failure distribution\. EvolveR\-style principles generalise well across task types but cannot fix structural workflow inefficiencies\. AFlow\-style topology search finds optimal execution pipelines but cannot compensate for poor harness rules or absent operational principles\.Apexcombines all three axes because they address orthogonal failure modes:L1fixes*known failure modes*via explicit prohibitions;L2encodes*successful behavioural patterns*as reusable guidelines;L3evolves the*DAG structure*—which nodes run in what order, with what routing—to optimise information flow and self\-correction capacity\.

Notably, the L3\-only ablation \(H=0\.270<H0=0\.300H\{=\}0\.270<H\_\{0\}\{=\}0\.300\) demonstrates that structural workflow improvement*without*a quality harness foundation can reduce net agent quality\. This non\-additive interaction—where L3 requires L1 as a prerequisite to contribute positively—is a dependence that single\-axis frameworks cannot detect or exploit\.

### 5\.2Production Advantages

Unlike Self\-Harness and EvoAgentX, which require curated benchmark evaluations,Apexoperates entirely on*production traces*\. No synthetic benchmark is needed; the improvement signal derives directly from real user tasks, ensuringApex’s evolution is always aligned with the actual deployment distribution\.

The local\-LLM requirement \(Ollama/qwen2\.5\-coder:32b\) meansApexruns without external API dependencies, at zero marginal cost, and with full data privacy—critical requirements for enterprise deployments handling sensitive operational data\.

### 5\.3Limitations and Future Work

#### Structural scoring heuristics\.

Layer 3’s topology scoring currently uses hand\-crafted structural heuristics rather than empirical task\-completion rates\. Future work should close this loop by evaluating candidate topologies on held\-out task traces\.

#### L2 runtime injection\.

In the current implementation, extracted principles are correctly stored but not yet injected at harness assembly time\. Completing this integration is expected to pushHHto≈0\.65\{\\approx\}0\.65–0\.700\.70\.

#### Weight evolution \(Layer 4\)\.

Apexcurrently operates at the prompt and workflow level\. Integrating Online\-LoRA\[[4](https://arxiv.org/html/2606.15363#bib.bib4)\]as Layer 4 would enable weight\-level learning from production traces, with a projected additional 10–20% improvement\.

#### Single\-agent scope\.

Apexcurrently evolves one agent’s harness, principles, and workflow\. Multi\-agent team topology evolution—including cross\-agent principle sharing—is an important avenue for future work\.

## 6Conclusion

We presentedApex, a three\-layer self\-evolution framework that simultaneously evolves the harness \(L1\), behavioural principles \(L2\), and workflow topology \(L3\) of a production AI agent\. Implemented onJoe—a real\-world agent managing a 15\-node compute fleet—Apexachieves a Health Score of 0\.570 \(\+90%\+90\\%vs\. baseline\) in a single evolutionary run using 114 production traces, no external APIs, and no synthetic benchmarks\. The total evolution cost is 4 LLM calls and approximately 270 s on a local GPU\.

The central finding is thatmulti\-axis co\-evolution substantially outperforms single\-axis harness optimisation: Self\-Harness alone achieves\+27%\+27\\%;Apexwith all three layers achieves\+90%\+90\\%\. The ablation further reveals a non\-additive interaction where L3 topology evolution requires L1’s harness foundation to contribute positively—a dependence that single\-axis frameworks cannot capture\.

Key TakeawayTo evolve a production AI agent, do not only patch what it does wrong \(Self\-Harness\)\. Also distil what it does*right*into transferable principles \(EvolveR\), and evolve*how*it orchestrates its own work \(AFlow\)\.Apexmakes all three automatic, cheap, and privacy\-preserving\.

## References

- \[1\]Ye et al\.Self\-Harness: Automated Agent Self\-Improvement via Harness Optimization\.arXiv:2606\.09498, 2025\.[https://arxiv\.org/abs/2606\.09498](https://arxiv.org/abs/2606.09498)
- \[2\]Li et al\.EvolveR: Experience\-Driven Lifecycle Distillation for Autonomous Agents\.ICLR 2026\.
- \[3\]Wang et al\.EvoAgentX: A Unified Framework for Multi\-Dimensional Agent Optimization\.EMNLP 2025\.[https://github\.com/EvoAgentX/EvoAgentX](https://github.com/EvoAgentX/EvoAgentX)
- \[4\]Gao et al\.Online\-LoRA: Task\-free Online Continual Learning via Low Rank Adaptation\.WACV 2025\.[https://github\.com/Christina200/Online\-LoRA\-official](https://github.com/Christina200/Online-LoRA-official)
- \[5\]DEAL Team\.DEAL: Continuous LoRA Fine\-Tuning with Knowledge Retention\.2025\.
- \[6\]Zhang J\. et al\.AFlow: Automating Agentic Workflow Generation\.arXiv:2410\.10762, 2024\.[https://arxiv\.org/abs/2410\.10762](https://arxiv.org/abs/2410.10762)
- \[7\]Yuksekgonul M\. et al\.TextGrad: Automatic “Differentiation” via Text\.arXiv:2406\.07496, 2024\.[https://arxiv\.org/abs/2406\.07496](https://arxiv.org/abs/2406.07496)
- \[8\]Yao S\. et al\.ReAct: Synergizing Reasoning and Acting in Language Models\.ICLR 2023\.arXiv:2210\.03629\.[https://arxiv\.org/abs/2210\.03629](https://arxiv.org/abs/2210.03629)
- \[9\]Shinn N\. et al\.Reflexion: Language Agents with Verbal Reinforcement Learning\.NeurIPS 2023\.arXiv:2303\.11366\.[https://arxiv\.org/abs/2303\.11366](https://arxiv.org/abs/2303.11366)
- \[10\]Anonymous\.From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents\.arXiv:2603\.22386, 2026\.[https://arxiv\.org/abs/2603\.22386](https://arxiv.org/abs/2603.22386)
- \[11\]Wang Z\. et al\.Symbolic Learning Enables Self\-Evolving Agents\.arXiv:2406\.18532, 2024\.[https://arxiv\.org/abs/2406\.18532](https://arxiv.org/abs/2406.18532)
- \[12\]Anonymous\.Continual Learning, Not Training: Online Adaptation For Agents\.arXiv:2511\.01093, 2025\.[https://arxiv\.org/abs/2511\.01093](https://arxiv.org/abs/2511.01093)
- \[13\]Grace AI Technology\.Joe — Nemotron\-Powered Edge AI Agent Factory\.NVIDIA Agent Challenge 2026\.[https://aispark\.airlive\.com/joe\-hackathon/](https://aispark.airlive.com/joe-hackathon/)
APEX: Adaptive Principle EXtraction A Three-Layer Self-Evolution Framework for Production AI Agents

Similar Articles

MetaEvo: A Meta-Optimization Framework for Experience-Driven Agent Evolution

EvoMaster: A Foundational Agent Framework for Building Evolving Autonomous Scientific Agents at Scale

@qinzytech: https://x.com/qinzytech/status/2066585405479371092

@dair_ai: // Harnessing Agentic Evolution // Pay attention to this one if you run iterative agentic search loops. (bookmark it) A…

@Apodex_AI: Dive in Blog: https://apodex.com/blog/apodex-1.0 Tech report: http://apodex.com/pdf/20260608 Github: https://github.com…

Submit Feedback

Similar Articles

MetaEvo: A Meta-Optimization Framework for Experience-Driven Agent Evolution
EvoMaster: A Foundational Agent Framework for Building Evolving Autonomous Scientific Agents at Scale
@qinzytech: https://x.com/qinzytech/status/2066585405479371092
@dair_ai: // Harnessing Agentic Evolution // Pay attention to this one if you run iterative agentic search loops. (bookmark it) A…
@Apodex_AI: Dive in Blog: https://apodex.com/blog/apodex-1.0 Tech report: http://apodex.com/pdf/20260608 Github: https://github.com…