QUIVER: A Formal Framework for Quantifying Perturbation Propagation and Bifurcation in Compound AI Systems

arXiv cs.AI Papers

Summary

QUIVER introduces a formal framework for quantifying how perturbations propagate through compound AI systems structured as computation graphs, defining sensitivity matrices, trajectory divergence, bifurcation thresholds, and distribution faithfulness, with validation on production and public pipelines.

arXiv:2605.23956v1 Announce Type: new Abstract: Compound AI systems that chain multiple LLM calls into directed computation graphs are now the dominant architecture for production AI. Although these architectures leverage heterogeneous nodes with mixed-mode outputs, no existing framework quantifies how perturbations propagate through such pipelines, where nodes are stochastic and execution paths can diverge structurally. We introduce QUIVER, a formal framework for measuring perturbation propagation in graph-structured LLM pipelines. The framework defines: (1) a sensitivity matrix with type-dispatched distance metrics that classifies edges as amplifiers, absorbers, or threshold-sensitive, complemented by occurrence-lift; (2) trajectory divergence decomposing variation into value drift, structural path divergence, and iteration count divergence; (3) bifurcation thresholds identifying the smallest perturbation that causes structural execution path changes; and (4) distribution faithfulness, quantifying when per node evaluation datasets diverge from production distributions. We validate on two production enterprise pipelines and a public DSPy multihop QA pipeline, three structurally distinct architectures. Across 8,200+ instrumented traces (32,000+ pair comparisons), we demonstrate that QUIVER reveals distinct sensitivity profiles across architectures, distinguishes mechanistically different cascade patterns producing identical divergence rates, predicts nodes prone to trajectory bifurcation from observational data alone, and localizes stale evaluation artifacts to specific node-field categories that aggregate metrics cannot surface.
Original Article
View Cached Full Text

Cached at: 05/26/26, 09:03 AM

# QUIVER: A Formal Framework for Quantifying Perturbation Propagation and Bifurcation in Compound AI Systems
Source: [https://arxiv.org/html/2605.23956](https://arxiv.org/html/2605.23956)
###### Abstract

Compound AI systems that chain multiple LLM calls into directed computation graphs with parallel branches, sequential stages, and conditional loops, are now the dominant architecture for production AI applications\(Khattab and others,[2024](https://arxiv.org/html/2605.23956#bib.bib7); Yaoet al\.,[2022](https://arxiv.org/html/2605.23956#bib.bib23); Shaoet al\.,[2024](https://arxiv.org/html/2605.23956#bib.bib26)\)\. Although these architectures leverage heterogeneous nodes with mixed\-mode outputs, from typed schemas to natural language, no existing framework quantifies how perturbations propagate through such pipelines, where nodes are stochastic, outputs are heterogeneously typed, and execution paths can diverge structurally under perturbation\. We introduceQUIVER, a formal framework for measuring perturbation propagation in computation\-graph\-structured LLM pipelines\. The framework defines: \(1\) a sensitivity matrix with type\-dispatched distance metrics that classifies edges as amplifiers, absorbers, or threshold\-sensitive and composes multiplicatively along paths, complemented by occurrence\-lift capturing the probability of downstream drift independent of magnitude; \(2\) three\-component trajectory divergence decomposing variation into value drift, structural path divergence, and iteration count divergence; \(3\) bifurcation thresholds identifying the smallest perturbation that causes structural execution path changes; and \(4\) distribution faithfulness, quantifying when per\-node evaluation datasets diverge from production distributions\. We validate on two production enterprise pipelines with different architectures \(Systems P and Q; Figures[1\(a\)](https://arxiv.org/html/2605.23956#S0.F1.sf1),[1\(b\)](https://arxiv.org/html/2605.23956#S0.F1.sf2)\) and cross\-validate on a public DSPyKhattab and others \([2024](https://arxiv.org/html/2605.23956#bib.bib7)\)multi\-hop QA pipeline \(HotpotQAYanget al\.\([2018](https://arxiv.org/html/2605.23956#bib.bib28)\)/ColBERTv2Santhanamet al\.\([2022](https://arxiv.org/html/2605.23956#bib.bib27)\)\), a third and structurally distinct topology\. Across 8,200\+ instrumented traces yielding 32,000\+ pair comparisons, we demonstrate that: \(a\) the framework reveals distinct sensitivity profiles across architectures, from global resilience to deep cascade amplification; \(b\) identical divergence rates arise from mechanistically distinct patterns, distinguishable only via nodal decomposition; \(c\) observational sensitivity profiles predict nodes prone to trajectory bifurcation; and \(d\) distribution faithfulness quantifies per\-node evaluation gap, localizing stale\-evaluation artifacts to specific node\-field categories that aggregate metrics cannot surface \(Appendix F case study\)\.

![Refer to caption](https://arxiv.org/html/2605.23956v1/x1.png)\(a\)System P architecture\.
![Refer to caption](https://arxiv.org/html/2605.23956v1/x2.png)\(b\)System Q architecture\.

Figure 1:Comparative architectures of Systems P and Q\.System P uses a parallel\-intake first wave \(rewriter, signal analysis, context sufficiency\) feeding a planner loop with conditional replanning and tool execution\. System Q uses a dual\-planner entry point with a winner\-selection mechanism that routes to either a fast path or a slow path through retrieval, reranking, and generation\.## 1Introduction

Compound AI systems that chain multiple LLM calls into directed computation graphs with parallel branches, sequential stages, and conditional loops are now the dominant architecture for production AI applications\(Khattab and others,[2024](https://arxiv.org/html/2605.23956#bib.bib7); Yaoet al\.,[2022](https://arxiv.org/html/2605.23956#bib.bib23); Shaoet al\.,[2024](https://arxiv.org/html/2605.23956#bib.bib26)\)\. Although these architectures span heterogeneous nodes with mixed\-mode outputs \[typed schemas to natural language\], no existing framework quantifies how perturbations propagate through them, where nodes are stochastic, outputs are heterogeneously typed, and execution paths can diverge structurally under perturbation\.

When a production pipeline returns a low\-quality response, practitioners cannot determine which node is responsible, whether quality degraded continuously along a chain of small perturbations, or whether a single upstream change crossed a threshold that flipped the execution path\. End\-to\-end evaluation detects symptoms but cannot localize causes; per\-node evaluation in isolation tests each component against curated inputs that may not reflect what it receives from upstream, yielding quality estimates that do not predict production behavior\. The problem compounds in pipelines with conditional loops\(Yaoet al\.,[2022](https://arxiv.org/html/2605.23956#bib.bib23)\), where small upstream perturbations can activate different nodes, execute different loop iterations, or retrieve entirely different knowledge\. Existing optimization and evaluation tooling\(Khattab and others,[2024](https://arxiv.org/html/2605.23956#bib.bib7); Yuksekgonul and others,[2024](https://arxiv.org/html/2605.23956#bib.bib8); Chenget al\.,[2024](https://arxiv.org/html/2605.23956#bib.bib9)\)treats nodes independently, without measuring how perturbations propagate across edges or where structural bifurcation occurs \(Section 3\)\.

We introduceQUIVER, a formal framework for measuring perturbation propagation in computation\-graph\-structured LLM pipelines\. The framework defines four contributions: \(1\) a*sensitivity matrix*with type\-dispatched distance metrics that classifies edges as amplifiers, absorbers, or threshold\-sensitive, complemented by*occurrence\-lift*capturing the probability of propagation independent of magnitude; \(2\)*three\-component trajectory divergence*decomposing variation into value drift, structural path divergence, and iteration count divergence; \(3\)*bifurcation thresholds*identifying the smallest perturbation that causes structural execution path changes; and \(4\)*distribution faithfulness*, quantifying when evaluation datasets diverge from production distributions\. We also release a programmatic trace interface for automatic computation of all measurements on any pipeline\.

We validate on two production enterprise pipelines with different architectures \(Figure[1](https://arxiv.org/html/2605.23956#S0.F1)\): System P, a complex graph with parallel intake, retrieval tool selection, and a conditional replanning loop \(kmax=5k\_\{\\max\}\{=\}5\); and System Q, a dual\-planner pipeline with parallel reranking and fast/slow path routing\. We additionally cross\-validate on a public DSPy multi\-hop QA pipeline over HotpotQA with ColBERTv2 retrieval which is a strict sequential chain that supplies a third, structurally distinct topology\. Across 8,200\+ instrumented traces yielding 32,000\+ pair comparisons \(observational and interventional\), we demonstrate: \(a\) the framework reveals distinct sensitivity profiles across architectures, from global resilience to deep cascade amplification; \(b\) identical divergence rates arise from mechanistically distinct cascade patterns, distinguishable only via per\-node decomposition; \(c\) observational profiles predict which nodes will bifurcate under perturbation; and \(d\) distribution faithfulness detects previously undetectable evaluation invalidation from configuration changes\.

## 2Framework

We define the formal objects for measuring perturbation propagation in compound LLM pipelines\. Extended discussion, worked examples, and estimation details are provided in Appendices A and B\.

### 2\.1Pipeline Graph and Typed Output Spaces

###### Definition 1\(Typed Pipeline Graph\)\.

A compound LLM pipeline is a tuple𝒢=\(V,E,𝒯,ℱ\)\\mathcal\{G\}=\(V,E,\\mathcal\{T\},\\mathcal\{F\}\)whereV=\{v1,…,vn\}V=\\\{v\_\{1\},\\ldots,v\_\{n\}\\\}is a finite set of nodes,E⊆V×VE\\subseteq V\\times Vis a set of directed edges representing data flow,𝒯=\{T1,…,Tn\}\\mathcal\{T\}=\\\{T\_\{1\},\\ldots,T\_\{n\}\\\}assigns each nodeviv\_\{i\}a typed output spaceTiT\_\{i\}, andℱ=\{f1,…,fn\}\\mathcal\{F\}=\\\{f\_\{1\},\\ldots,f\_\{n\}\\\}assigns each node a stochastic function:

fi:∏j∈pa​\(i\)Tj→Δ​\(Ti\)f\_\{i\}:\\prod\_\{j\\in\\mathrm\{pa\}\(i\)\}T\_\{j\}\\rightarrow\\Delta\(T\_\{i\}\)\(1\)wherepa​\(i\)=\{j:\(vj,vi\)∈E\}\\mathrm\{pa\}\(i\)=\\\{j:\(v\_\{j\},v\_\{i\}\)\\in E\\\}andΔ​\(Ti\)\\Delta\(T\_\{i\}\)is the set of probability distributions overTiT\_\{i\}\.

Typed output spaces may include schema\-typed objects, ordered lists, categorical values, or unstructured text\. External inputs are modeled as source nodes with no parents\.

### 2\.2Type\-Dispatched Distance Metrics

###### Definition 2\(Type\-Dispatched Distance\)\.

For each typed output spaceTiT\_\{i\}, definedi:Ti×Ti→ℝ≥0d\_\{i\}:T\_\{i\}\\times T\_\{i\}\\rightarrow\\mathbb\{R\}\_\{\\geq 0\}satisfying non\-negativity and identity of indiscernibles\. For schema\-typed spacesTi=Ti\(1\)×⋯×Ti\(m\)T\_\{i\}=T\_\{i\}^\{\(1\)\}\\times\\cdots\\times T\_\{i\}^\{\(m\)\}:

di​\(x,y\)=∑k=1mwk⋅di\(k\)​\(x\(k\),y\(k\)\)d\_\{i\}\(x,y\)=\\sum\_\{k=1\}^\{m\}w\_\{k\}\\cdot d\_\{i\}^\{\(k\)\}\(x^\{\(k\)\},y^\{\(k\)\}\)\(2\)wherewk≥0w\_\{k\}\\geq 0,∑kwk=1\\sum\_\{k\}w\_\{k\}=1, and eachdi\(k\)d\_\{i\}^\{\(k\)\}is appropriate to the field type:𝟙​\[a≠b\]\\mathbb\{1\}\[a\\neq b\]for categorical,1−\|A∩B\|/\|A∪B\|1\-\|A\\cap B\|/\|A\\cup B\|for set\-valued, normalized edit distance for ordered lists, normalized absolute difference for numeric, and1−cos⁡\(ϕ​\(s\),ϕ​\(t\)\)1\-\\cos\(\\phi\(s\),\\phi\(t\)\)for text fields\.

Per\-field weightswkw\_\{k\}may encode application priors \(e\.g\., higher weight for routing\-decision fields than for descriptive context fields\); the choice of weighting and per\-type kernel is application\-dependent and orthogonal to the framework’s edge\-classification definitions below\.

### 2\.3Sensitivity Matrix

###### Definition 3\(Edge Sensitivity\)\.

For each edge\(vi,vj\)∈E\(v\_\{i\},v\_\{j\}\)\\in E:

σi​j=𝔼​\[dj​\(fj​\(𝐱\),fj​\(𝐱′\)\)di​\(xi,xi′\)\]\\sigma\_\{ij\}=\\mathbb\{E\}\\left\[\\frac\{d\_\{j\}\\big\(f\_\{j\}\(\\mathbf\{x\}\),\\;f\_\{j\}\(\\mathbf\{x\}^\{\\prime\}\)\\big\)\}\{d\_\{i\}\(x\_\{i\},x\_\{i\}^\{\\prime\}\)\}\\right\]\(3\)where𝐱,𝐱′\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}differ only in the component fromviv\_\{i\}, and the expectation is restricted to pairs withdi\>ϵd\_\{i\}\>\\epsilon\.

Classifies edges as*amplifiers*\(σ\>1\\sigma\>1\),*absorbers*\(σ<1\\sigma<1\), or*insensitive*\(σ≈0\\sigma\\approx 0\)\. Edges withσ^\\hat\{\\sigma\}close to 1 are also called*near\-unity*: they sit at the amplifier/absorber boundary, where the binary classification is most sensitive to per\-field kernel and weight choices\. The band\|σ^−1\|<δ\|\\hat\{\\sigma\}\-1\|<\\deltafor an application\-chosenδ\\deltacaptures these edges; we reportδ\\deltainline where used\. The scalarσi​j\\sigma\_\{ij\}may summarize a multimodal distribution; we recommend inspecting the full ratio distribution \(Section 5\)\.

###### Definition 4\(Occurrence\-Lift\)\.

For each edge\(vi,vj\)∈E\(v\_\{i\},v\_\{j\}\)\\in E:

λi​j=P​\(dj\>0​∣di\>​0\)−P​\(dj\>0∣di=0\)\\lambda\_\{ij\}=P\\big\(d\_\{j\}\>0\\mid d\_\{i\}\>0\\big\)\-P\\big\(d\_\{j\}\>0\\mid d\_\{i\}=0\\big\)\(4\)

Captures the*probability*of downstream drift, complementingσi​j\\sigma\_\{ij\}which captures*magnitude*\. The two measures are decoupled, carrying distinct information about an edge \(Section 5\)\.

###### Definition 5\(Sensitivity Matrix\)\.

Σ∈ℝ≥0n×n\\Sigma\\in\\mathbb\{R\}\_\{\\geq 0\}^\{n\\times n\}whereΣi​j=σi​j\\Sigma\_\{ij\}=\\sigma\_\{ij\}if\(vi,vj\)∈E\(v\_\{i\},v\_\{j\}\)\\in E, else0\.

###### Definition 6\(Path Sensitivity\)\.

For a directed pathp=\(vi1,…,vik\)p=\(v\_\{i\_\{1\}\},\\ldots,v\_\{i\_\{k\}\}\):

σ​\(p\)=∏\(vil,vil\+1\)∈pσil,il\+1\\sigma\(p\)=\\prod\_\{\(v\_\{i\_\{l\}\},v\_\{i\_\{l\+1\}\}\)\\in p\}\\sigma\_\{i\_\{l\},i\_\{l\+1\}\}\(5\)

A path is a*cascade amplifier*ifσ​\(p\)\>1\\sigma\(p\)\>1\. The maximum over all source\-to\-sink paths is the*critical amplification path*\. When parallel branches reconverge, interaction terms may be estimated via multivariate regression \(Appendix B\)\.

### 2\.4Loop\-Augmented Pipelines and Trajectory Bifurcation

We extend the framework to pipelines with conditional loops\(Yaoet al\.,[2022](https://arxiv.org/html/2605.23956#bib.bib23)\)\. The loop bodyL⊆VL\\subseteq Vis unrolled into indexed copiesL\(1\),…,L\(k∗\)L^\{\(1\)\},\\ldots,L^\{\(k^\{\*\}\)\}wherek∗≤kmaxk^\{\*\}\\leq k\_\{\\max\}\.

###### Definition 7\(Iteration Action\)\.

At each iterationtt, the loop controller producesa\(t\)∈𝒜a^\{\(t\)\}\\in\\mathcal\{A\}from a finite action set \(e\.g\.,EXECUTE,RETRY,GENERATE,COMPOSE\) with auxiliary parametersQ\(t\)Q^\{\(t\)\}\. The iteration topology isg\(t\)=\(a\(t\),Q\(t\)\)g^\{\(t\)\}=\(a^\{\(t\)\},Q^\{\(t\)\}\); the trajectory topology isG∗=\(k∗,g\(1\),…,g\(k∗\)\)G^\{\*\}=\(k^\{\*\},g^\{\(1\)\},\\ldots,g^\{\(k^\{\*\}\)\}\)\.

###### Definition 8\(Trajectory\)\.

τ=\(o1,…,on,oL\(1\),…,oL\(k∗\)\)\\tau=\(o\_\{1\},\\ldots,o\_\{n\},o\_\{L\}^\{\(1\)\},\\ldots,o\_\{L\}^\{\(k^\{\*\}\)\}\)whereoi∈Tio\_\{i\}\\in T\_\{i\}andoL\(t\)o\_\{L\}^\{\(t\)\}is the output tuple at iterationtt\.

###### Definition 9\(Trajectory Divergence\)\.

For trajectoriesτ,τ′\\tau,\\tau^\{\\prime\}:

D​\(τ,τ′\)=\(Diter,Dshape,Doutput\)D\(\\tau,\\tau^\{\\prime\}\)=\(D\_\{\\mathrm\{iter\}\},\\;D\_\{\\mathrm\{shape\}\},\\;D\_\{\\mathrm\{output\}\}\)\(6\)whereDiter=∑i\|ci−ci′\|D\_\{\\mathrm\{iter\}\}=\\sum\_\{i\}\|c\_\{i\}\-c\_\{i\}^\{\\prime\}\|withcic\_\{i\}the number of invocations of nodeviv\_\{i\}in trajectoryτ\\tau\(this generalizes loop\-iteration counting: for a loop bodyLL,∑i∈L\|ci−ci′\|\\sum\_\{i\\in L\}\|c\_\{i\}\-c\_\{i\}^\{\\prime\}\|recovers\|L\|⋅\|k∗−k∗′\|\|L\|\\cdot\|k^\{\*\}\-k^\{\*^\{\\prime\}\}\|when iteration shapes match, and otherwise also captures sub\-stage call\-count divergence in non\-loop pipelines such as reranker or tool\-execution invocations\),Dshape=∑t=1min⁡\(k∗,k∗′\)𝟙​\[g\(t\)≠g′⁣\(t\)\]D\_\{\\mathrm\{shape\}\}=\\sum\_\{t=1\}^\{\\min\(k^\{\*\},k^\{\*^\{\\prime\}\}\)\}\\mathbb\{1\}\[g^\{\(t\)\}\\neq g^\{\\prime\(t\)\}\], andDoutput=∑iwi⋅di​\(oi,oi′\)D\_\{\\mathrm\{output\}\}=\\sum\_\{i\}w\_\{i\}\\cdot d\_\{i\}\(o\_\{i\},o\_\{i\}^\{\\prime\}\)\.

The three components capture per\-node*count*,*order*, and*value*divergence respectively\. We additionally derive a node\-presence indicatorDstruct​\(τ,τ′\):=𝟙​\[\{v:v∈τ\}≠\{v:v∈τ′\}\]D\_\{\\mathrm\{struct\}\}\(\\tau,\\tau^\{\\prime\}\):=\\mathbb\{1\}\[\\\{v:v\\in\\tau\\\}\\neq\\\{v:v\\in\\tau^\{\\prime\}\\\}\]from the trajectories as a summary statistic; it is not a fourth component ofDDbut is reported alongside it becauseDiterD\_\{\\mathrm\{iter\}\}andDshapeD\_\{\\mathrm\{shape\}\}alone do not flag whether the activated node sets differ\. The indicators may co\-occur\. For loop\-free pipelines this construction extends naturally: we takek∗=1k^\{\*\}=1and defineg\(1\)=\(a\(1\),Q\(1\)\)g^\{\(1\)\}=\(a^\{\(1\)\},Q^\{\(1\)\}\)as the trace’s*conditional\-branch activation vector*— the realized values of all branching/routing decisions inτ\\tau\(which arms of conditional gates were taken, which optional nodes activated\)\.Dshape=𝟙​\[g\(1\)≠g′⁣\(1\)\]D\_\{\\mathrm\{shape\}\}=\\mathbb\{1\}\[g^\{\(1\)\}\\neq g^\{\\prime\(1\)\}\]then compares branch activations acrossτ,τ′\\tau,\\tau^\{\\prime\}\. We treat this as a natural extension of Definition 7 rather than an additional formal construct; the loop and non\-loop cases use the same indicator ong\(t\)g^\{\(t\)\}\.

###### Definition 10\(Bifurcation Threshold\)\.

For nodeviv\_\{i\}upstream of a loop or conditional branch:

βshape​\(vi\)=inf\{di​\(fi​\(x\),fi​\(x′\)\):Dshape\>0\}\\beta\_\{\\mathrm\{shape\}\}\(v\_\{i\}\)=\\inf\\\{d\_\{i\}\(f\_\{i\}\(x\),f\_\{i\}\(x^\{\\prime\}\)\):D\_\{\\mathrm\{shape\}\}\>0\\\}\(7\)βiter​\(vi\)\\beta\_\{\\mathrm\{iter\}\}\(v\_\{i\}\)is defined analogously forDiter\>0D\_\{\\mathrm\{iter\}\}\>0\.

Nodes with smallβshape\\beta\_\{\\mathrm\{shape\}\}are*bifurcation\-sensitive*: minor output variations cause structural path divergence thatσi​j\\sigma\_\{ij\}alone cannot detect\.

#### Noise origins\.

Nodeviv\_\{i\}is a*noise origin*ifP​\(di\>ϵ∣dj=0​∀j∈pa​\(i\)\)\>0P\(d\_\{i\}\>\\epsilon\\mid d\_\{j\}=0\\;\\forall\\,j\\in\\mathrm\{pa\}\(i\)\)\>0; otherwise it is a*propagator*\. This partitions nodes into intrinsic variance sources and inherited variance carriers\.

### 2\.5Evaluation Principles and Estimation

#### Distribution Faithfulness\.

A per\-node evaluation is*distribution\-faithful*w\.r\.t\. edge\(vi,vj\)\(v\_\{i\},v\_\{j\}\)ifDKL​\(Pprod​\(Ti\)∥Peval​\(Ti\)\)<δD\_\{\\mathrm\{KL\}\}\(P\_\{\\mathrm\{prod\}\}\(T\_\{i\}\)\\,\\\|\\,P\_\{\\mathrm\{eval\}\}\(T\_\{i\}\)\)<\\delta\. Unfaithful evaluation at high\-σ\\sigmaedges amplifies quality estimation errors downstream\.

#### Cross\-Stage Regression Detection\.

The*impact set*ℐ​\(vi,α\)=\{vj:σ​\(p\)\>α​for some​p:vi→vj\}\\mathcal\{I\}\(v\_\{i\},\\alpha\)=\\\{v\_\{j\}:\\sigma\(p\)\>\\alpha\\text\{ for some \}p\\colon v\_\{i\}\\to v\_\{j\}\\\}identifies downstream nodes requiring re\-evaluation whenviv\_\{i\}changes; per\-edge*drift budgets*τi​j\\tau\_\{ij\}specify the maximum upstream drift below whichvjv\_\{j\}remains within its noise floor\.

#### Estimation\.

σi​j\\sigma\_\{ij\},λi​j\\lambda\_\{ij\},τi​j\\tau\_\{ij\}, and noise origin classification are estimable from*observational*traces via partial regression; bifurcation thresholdsβshape,βiter\\beta\_\{\\mathrm\{shape\}\},\\beta\_\{\\mathrm\{iter\}\}require*interventional*experiments because they describe decision boundaries rather than continuous relationships\. Definition 3 is stated interventionally; the observational estimator approximates it by partialing out parent co\-variation, which is unbiased only when parents have no shared unmeasured causes\. When intake nodes share an upstream input \(e\.g\., a common user query in System P\), residual confounding can biasσ^i​j\\hat\{\\sigma\}\_\{ij\}; we use the interventional corpus \(Section 5\.2\) to validate the observational estimates at perturbation\-targetable edges, and treat agreement between the two as evidence that the approximation holds for the pipelines studied\. Observational identifies proximate bifurcation sources; interventional reveals dormant risks beyond the natural operating envelope\. The matrix is locally valid at the current operating point and should be re\-estimated after large configuration changes \(§[6](https://arxiv.org/html/2605.23956#S6)\)\. Full details: Appendix B\.

## 3Related Work

#### Optimization of compound AI systems\.

DSPy\(Khattab and others,[2024](https://arxiv.org/html/2605.23956#bib.bib7)\)introduces typed signatures and optimizes end\-to\-end through prompt tuning; TextGrad\(Yuksekgonul and others,[2024](https://arxiv.org/html/2605.23956#bib.bib8)\)backpropagates textual feedback as gradients; Trace/OPTO\(Chenget al\.,[2024](https://arxiv.org/html/2605.23956#bib.bib9)\)and LLM\-AutoDiff\(Yin and Wang,[2025](https://arxiv.org/html/2605.23956#bib.bib10)\)extend this with richer execution traces\.Chenet al\.\([2026](https://arxiv.org/html/2605.23956#bib.bib11)\)shows textual gradients degrade exponentially with depth, and TextResNet\(Huanget al\.,[2026](https://arxiv.org/html/2605.23956#bib.bib12)\)identifies semantic entanglement as a deep\-pipeline challenge, both respond with better optimizers\. The prior step is missing: practitioners need to understand propagation dynamics before optimizing\. The EMNLP 2025 survey\(Leeet al\.,[2025](https://arxiv.org/html/2605.23956#bib.bib13)\)reviews 26 optimization methods without identifying sensitivity analysis as a direction\.

#### Failure attribution in agentic systems\.

AgenTracer\(Zhanget al\.,[2025](https://arxiv.org/html/2605.23956#bib.bib2)\)replaces each agent action with an oracle\-corrected alternative to identify the earliest fault\-flipping correction; RAFFLES\(Zhuet al\.,[2026](https://arxiv.org/html/2605.23956#bib.bib3)\)uses LLM\-based causal\-hierarchy reasoning\. Both produce binary attribution rather than continuous sensitivity measurement, and both require auxiliary models that introduce uncharacterized error\. Our framework requires no auxiliary model: measurements are computed directly from production traces via type\-dispatched distance and standard regression\. The text\-distance kernel takes any sentence\-embeddingϕ\\phi\(we useall\-MiniLM\-L6\-v2\);ϕ\\phiis non\-generative, not an LLM judge or oracle, so the “no auxiliary model” claim specifically excludes generative/judge models, the distinction that matters for evaluation cost and reliability\. Watershed\(Parikh and Dumit,[2025](https://arxiv.org/html/2605.23956#bib.bib24)\)distinguishes property from correctness evaluations at component and task levels; our work formalizes this with the sensitivity matrix and distribution faithfulness\(Eset al\.,[2024](https://arxiv.org/html/2605.23956#bib.bib21)\)\.

#### Formal analysis of LLM system dynamics\.

UProp\(Duanet al\.,[2025](https://arxiv.org/html/2605.23956#bib.bib1)\)decomposes per\-step uncertainty into intrinsic and extrinsic components via pointwise mutual information on sequential chains, requiring model logprobs; we handle arbitrary computation graphs at the typed\-output level using only logged intermediate outputs\. “From Spark to Fire”\(Xieet al\.,[2026](https://arxiv.org/html/2605.23956#bib.bib4)\)formalizes cascade amplification via spectral\-radius analysis but assumes homogeneous interaction matrices; our typed representations require heterogeneous, type\-dispatched per\-edge metrics\(Shenet al\.,[2025](https://arxiv.org/html/2605.23956#bib.bib16)\)\.Heet al\.\([2025](https://arxiv.org/html/2605.23956#bib.bib5)\)apply information\-bottleneck analysis, showing stage\-wise information loss is unrecoverable\. “Geometric Dynamics”\(Tacheny,[2025](https://arxiv.org/html/2605.23956#bib.bib6)\)classifies loop behavior as contractive/oscillatory/exploratory but does not address upstream\-perturbation\-driven transitions, our bifurcation thresholds do\.Kimet al\.\([2025](https://arxiv.org/html/2605.23956#bib.bib14)\)measure a17\.2×17\.2\\timeserror amplification in unstructured multi\-agent networks, giving empirical context for our per\-edge cascade quantification\(Liuet al\.,[2026](https://arxiv.org/html/2605.23956#bib.bib15); Eslami and Yu,[2026](https://arxiv.org/html/2605.23956#bib.bib18)\)\.

No existing approach combines computation\-graph\-level sensitivity analysis, type\-dispatched distance, bifurcation analysis for conditional paths, and the distribution\-faithfulness / cross\-stage\-regression evaluation principles \(capability comparison in Appendix[C](https://arxiv.org/html/2605.23956#A3)\)\.

## 4Experimental Setup

We validate the framework on two production enterprise conversational AI pipelines with different architectures, referred to as System P and System Q \(Figure[1](https://arxiv.org/html/2605.23956#S0.F1)\)\.

#### System P

is a three\-phase computation graph: a*First Wave*of parallel intake nodes \(rewriter, signal analysis, context sufficiency\); a*Discovery*phase where the planner selects and merges retrieval tools \(company search, knowledge graph, web search\); and a*Planner Loop*\(kmax=5k\_\{\\max\}=5\) that iterates over subtasks, choosing per iteration to execute a tool, compose, or retry discovery\. Signal analysis gates conditional paths \(small\-talk short\-circuit, web\-search\-hint\)\.

#### System Q

is a dual\-planner pipeline: a fast and slow LLM process the query in parallel, and a winner\-selection mechanism routes to a fast path \(simple queries, history cache hits\) or a slow path through company search, reranking, and generation\. It has no replanning loop but branches conditionally at the planner and fast/slow selection\.

#### Trace collection\.

LLM calls use a fixed determinism contract \(temperature=0\{\}=0, seed=42\{\}=42\) via Azure OpenAI; text distances useall\-MiniLM\-L6\-v2\. System P: an*observational corpus*of 1,500 traces \(500 seeds×\\times3 repeats\) and an*interventional corpus*of 3,183 baseline\-perturbed pairs across five classes \(rewriter swap, signal turn\-type/web\-search\-hint/small\-talk flips, context\-sufficiency override\)\. System Q: 1,497 observational traces drawn from production replay across 209 distinct utterance payloads with*heterogeneous*per\-utterance group sizes \(production traffic is not uniform; some utterances appear far more often than others\)\. Pairs are enumerated as all unordered same\-input pairs within each group,∑g\(ng2\)=24,693\\sum\_\{g\}\\binom\{n\_\{g\}\}\{2\}=24\{,\}693\. The pair count substantially exceeds the500⋅\(32\)=1,500500\\cdot\\binom\{3\}\{2\}\{=\}1\{,\}500implied by a balanced 500×\\times3 design because of the skewed group\-size distribution\.

#### Perturbation design\.

System P perturbations target upstream nodes at controlled magnitudes: rewriter swaps four prompt variants \(meandrewriter∈\[0\.696,0\.754\]d\_\{\\mathrm\{rewriter\}\}\\in\[0\.696,0\.754\]\); signal analysis flips categorical fields \(turn\_type,web\_search\_hint,is\_small\_talk\); context sufficiency forcessufficient=True\\texttt\{sufficient\}\{=\}\\texttt\{True\}, removing discovery from the graph\. Interventional results are stratified by effective vs\. no\-op pairs \(the latter, where baseline already matches the perturbed value, serve as built\-in negative controls\)\.

#### Distance metrics\.

We use uniform per\-field aggregation under the type\-dispatched kernels of Definition 2\. Definition 2 admits any per\-field weightswk≥0w\_\{k\}\\geq 0with∑kwk=1\\sum\_\{k\}w\_\{k\}=1and any per\-type kernel; the specific instantiation is application\-dependent and the framework’s measurements \(σ^\\hat\{\\sigma\},λ\\lambda,β\\beta\) are well\-defined for any admissible choice\.

## 5Results

We validate each formal object from Section 2 on production traces from Systems P and Q\. Our goal is not to characterize these specific pipelines but to demonstrate that the framework’s measurements are computable, reveal non\-trivial structure, and generalize across architectures\.

### 5\.1Sensitivity Coefficients Reveal Threshold\-Sensitive Structure

Table[1](https://arxiv.org/html/2605.23956#S5.T1)reports edge sensitivityσ^i​j\\hat\{\\sigma\}\_\{ij\}and occurrence\-liftλi​j\\lambda\_\{ij\}for System P: two edges amplify \(σ^\>1\\hat\{\\sigma\}\>1; rewriter→\\todiscovery, planner→\\tocomposer\), four absorb\. The full ratio distribution reveals that mean\-classified amplifiers are bimodal, a majority absorber regime \(52–65% of pairs, ratio<1\{<\}1\) and a minority amplification tail \(12–23%, ratio\>1\.5\{\>\}1\.5, max∼7×\{\\sim\}7\\times\); the mean lands near unity because the two regimes roughly balance\. Tail pairs cluster around two mechanisms: retrieval\-boundary flips at single\-word query changes, and categorical field flips at composition \(the latter contributes minimally to composite output distance but gates execution routing\)\.

σ^\\hat\{\\sigma\}andλ\\lambdaare decoupled, carrying distinct information \(Table[1](https://arxiv.org/html/2605.23956#S5.T1)\): rewriter→\\todiscovery has modestσ^=1\.128\\hat\{\\sigma\}\{=\}1\.128but near\-total couplingλ=\+0\.994\\lambda\{=\}\{\+\}0\.994; signal→\\toplanner has meaningfulσ^=0\.857\\hat\{\\sigma\}\{=\}0\.857but negligibleλ=\+0\.041\\lambda\{=\}\{\+\}0\.041\.

System Q exhibits a qualitatively different sensitivity profile \(Table[2](https://arxiv.org/html/2605.23956#S5.T2)\)\. Multiple edges showσ^≫1\\hat\{\\sigma\}\\gg 1, with amplification factors up to9×9\\timeson well\-sampled edges\. Fast LLM and Slow LLM \(the parallel intake planners\) exhibit moderateDnoiseD\_\{\\mathrm\{noise\}\}\(0\.180, 0\.172\) consistent with their first\-stage position as likely intrinsic variance sources; the formal noise\-origin classification \(§2\.4, Appendix B\.4\) is not reported for System Q because the production\-replay corpus lacks pairs with byte\-identical upstream payloads at every intake node simultaneously\. The framework correctly identifies the post\-Reranker absorber edge \(σ^=0\.34\\hat\{\\sigma\}=0\.34\) and the deterministic tool\-execution terminals \(insensitive,σ^=0\\hat\{\\sigma\}=0from all upstream nodes\)\.

Table 1:Edge sensitivity and occurrence\-lift, System P\(1,500 observational traces\)\.nnis the count of pairs withdi\>ϵd\_\{i\}\>\\epsilon\.*Class:*amplifier \(σ^\>1\\hat\{\\sigma\}\>1\), absorber \(σ^<1\\hat\{\\sigma\}<1\)\. The median ratio is reported alongside the mean to expose multimodal distributions\.Table 2:Edge sensitivity, System Q\(24,693 same\-input pairs\)\. Edges named per the architecture in Figure[1\(b\)](https://arxiv.org/html/2605.23956#S0.F1.sf2)\.nnis the count of pairs withdi\>ϵd\_\{i\}\>\\epsilon\.*DnoiseD\_\{\\mathrm\{noise\}\}*columns report the within\-input noise floor at each node\. Tool\-execution terminals \(insensitive,σ^=0\\hat\{\\sigma\}=0from every upstream\) are omitted as they carry no per\-edge signal\. We omitλi​j\\lambda\_\{ij\}for System Q because the heterogeneous group\-size pairing \(production replay, see §4\) yields highly variable\|di≤ϵ\|\|d\_\{i\}\\leq\\epsilon\|partition sizes per edge, making per\-edgeλ^\\hat\{\\lambda\}estimates noisy and not directly comparable to System P’s balanced\-designλ\\lambdavalues in Table[1](https://arxiv.org/html/2605.23956#S5.T1)\.
### 5\.2Path Sensitivity Approximation Holds Empirically

We computed transitiveσ^​\(i,j\)\\hat\{\\sigma\}\(i,j\)for reachable pairs in System P and compared against the path\-product approximation \(Definition 6\)\. The multiplicative approximation agrees within±4%\\pm 4\\%\(rewriter→\\tocomposer: empirical 0\.366 vs\. product 0\.356; signal\_analysis→\\tocomposer: 0\.879 vs\. 0\.916\), indicating interaction terms are negligible for this pipeline\. Transitive measurements revealed risk profiles invisible in the direct\-edge view: one source node with no direct edge to the output exhibited transitiveσ^=0\.879\\hat\{\\sigma\}=0\.879\(nearly unity\), while another with a direct amplifier edge to retrieval showed transitiveσ^=0\.366\\hat\{\\sigma\}=0\.366to the output, dampened by intervening absorbers\.

### 5\.3Three\-Component Divergence Captures Structural Information

Table[3](https://arxiv.org/html/2605.23956#S5.T3)reports the trajectory divergence distribution across both corpora and both systems\.

Table 3:Trajectory divergence rates\.System P observational from 1,500 traces; System P interventional under rewriter variant swap \(515 baseline\-perturbed pairs\)\. System Q from 24,693 same\-input pairs \(1,497 traces\)\. Rates are population aggregates over the indicator𝟙​\[D∙\>0\]\\mathbb\{1\}\[D\_\{\\bullet\}\>0\]\.In System P, the majority of observational divergences \(76\.3%\) are pure value drift; 12\.9% exhibit structural divergence; 3\.3% exhibit iteration count divergence\. Under interventional perturbation, structural divergence approximately triples \(12\.9%→\\to37\.5% forDshapeD\_\{\\mathrm\{shape\}\}\), confirming that upstream perturbation causally drives trajectory bifurcation\. System Q shows higher baseline structural divergence \(25\.1%DshapeD\_\{\\mathrm\{shape\}\}\) and a non\-trivial 23\.7%DiterD\_\{\\mathrm\{iter\}\}that comes from sub\-stage call\-count variability \(e\.g\., reranker invocations, tool\-execution counts\) rather than architectural loops, the metric captures any per\-node count divergence, not only loop iteration counts\. The fact thatDstructD\_\{\\mathrm\{struct\}\}\(20\.6%\) is lower thanDshapeD\_\{\\mathrm\{shape\}\}\(25\.1%\) for System Q reflects that some branch\-decision differences \(Dshape\>0D\_\{\\mathrm\{shape\}\}\{\>\}0\) are within\-path: routing parameters or sub\-stage decisions vary while the activated node set is unchanged, soDstructD\_\{\\mathrm\{struct\}\}does not fire\.

Critically, different perturbation types on System P produce nearly identicalDshapeD\_\{\\mathrm\{shape\}\}rates through mechanistically distinct cascades \(Table[4](https://arxiv.org/html/2605.23956#S5.T4)\)\. Rewriter variant swap and signal turn\-type flip both yield∼37%\{\\sim\}37\\%DshapeD\_\{\\mathrm\{shape\}\}, but the cascade signatures differ: the rewriter cascade amplifies through retrieval then triggers a categorical flip at the planner \(dplanner≈0\.01d\_\{\\mathrm\{planner\}\}\\approx 0\.01\), while the turn\-type cascade amplifies directly through the planner \(dplanner≈0\.27d\_\{\\mathrm\{planner\}\}\\approx 0\.27\) into the composer \(dcomposer≈0\.53d\_\{\\mathrm\{composer\}\}\\approx 0\.53\)\. Web search hint perturbation produces the highestDshapeD\_\{\\mathrm\{shape\}\}\(75\.9%\) by inserting a new node into the graph topology\. Context sufficiency override produces 63\.4%DshapeD\_\{\\mathrm\{shape\}\}in the effective stratum by*removing*a node\. These cascade patterns are distinguishable only through per\-node sensitivity decomposition\.

Table 4:Cascade patterns across interventional perturbations on System P\.Same near\-boundary seed set \(115 utterances\), four upstream perturbation classes, four mechanistically distinct cascades\.*Cascade pattern*is the qualitative shape ofddalong the downstream chain conditional onDshape\>0D\_\{\\mathrm\{shape\}\}\>0\. CS\-override rates are reported on the*effective*stratum \(262 pairs where the override changed the natural value\)\.
### 5\.4Bifurcation Thresholds Are Measurable

We estimated bifurcation thresholds \(Definition 10\) under both observational and interventional conditions for System P\.

Under observational estimation, the planner is the sole bifurcation source \(β^shape=0\.102\\hat\{\\beta\}\_\{\\mathrm\{shape\}\}=0\.102\); all other nodes showβ^=0\\hat\{\\beta\}=0, meaning structural divergence occurs even when those nodes’ outputs are identical across traces\. Under interventional perturbation, the rewriter exhibitsβ^shape≈0\.25\\hat\{\\beta\}\_\{\\mathrm\{shape\}\}\\approx 0\.25\(upper bound under sparse coverage below 0\.4; see §[6](https://arxiv.org/html/2605.23956#S6)\): below this threshold, no structural divergence is observed; above it, 37\.5% of traces bifurcate\.

The two estimates are complementary: observationalβ\\betaidentifies the proximate intrinsic source \(planner\) for monitoring; interventionalβ\\betaidentifies the externally\-induced threshold \(rewriter\) for change management\. At the boundary, the cascade exhibits amplify\-then\-flip dynamics: moderate upstream drift \(drewriter≈0\.225d\_\{\\mathrm\{rewriter\}\}\{\\approx\}0\.225\) amplifies at retrieval \(ddiscovery≈0\.367d\_\{\\mathrm\{discovery\}\}\{\\approx\}0\.367\), collapses to a near\-zero planner output change \(dplanner≈0\.01d\_\{\\mathrm\{planner\}\}\{\\approx\}0\.01\) with a categorical field flip, and triggers structural divergence\.

### 5\.5Cross\-Architecture Validation: DSPy Multi\-Hop QA

To validate generalization, we instrument a DSPy\(Khattab and others,[2024](https://arxiv.org/html/2605.23956#bib.bib7)\)multi\-hop QA pipeline on HotpotQA\(Yanget al\.,[2018](https://arxiv.org/html/2605.23956#bib.bib28)\), a strict sequential chain \(gen\_query\_1→\\toretrieve\_1→\\togen\_query\_2→\\toretrieve\_2→\\togen\_answer\) with no parallel intake, no loops, no conditional branching; 1,500 traces \(500 dev questions×\\times3 repeats\), same determinism contract, ColBERTv2\(Santhanamet al\.,[2022](https://arxiv.org/html/2605.23956#bib.bib27)\)retrieval over per\-question distractor passages \(canonical DSPy setup\)\.

Across all 1,500 same\-input pairs,Diter=Dshape=Dstruct=0D\_\{\\mathrm\{iter\}\}=D\_\{\\mathrm\{shape\}\}=D\_\{\\mathrm\{struct\}\}=0— as predicted for a chain without conditional branches, and a clean separation from Systems P \(12\.9%DshapeD\_\{\\mathrm\{shape\}\}\) and Q \(25\.1%\)\. Value drift remains non\-zero \(DnoiseD\_\{\\mathrm\{noise\}\}system mean 0\.255\), confining variation to theDoutputD\_\{\\mathrm\{output\}\}\-only regime of Definition 9\. Edge sensitivities \(Table[5](https://arxiv.org/html/2605.23956#S5.T5)\) split cleanly: the upstream rewritergen\_query\_1amplifies downstream \(σ^≈3\.2\\hat\{\\sigma\}\{\\approx\}3\.2, bimodal\); finalretrieve→\\togen\_answeredges absorb sharply \(σ^≈0\.28\\hat\{\\sigma\}\{\\approx\}0\.28–0\.310\.31\)\. A matched\-protocol BM25 replication \(Appendix[E](https://arxiv.org/html/2605.23956#A5)\) preserves the sign of dominant amplifiers and absorbers; three near\-unity edges \(\|σ^−1\|<0\.4\|\\hat\{\\sigma\}\-1\|<0\.4\) crossσ^=1\\hat\{\\sigma\}\{=\}1between retrievers — exactly the regime where Definition 3’s binary classification is most kernel\-sensitive — yet the framework’s decomposition still isolates the mechanism \(e\.g\.,gen\_query\_1→\\toretrieve\_1: intrinsic\-noise term0\.0000\.000under both retrievers; median ratiodj/did\_\{j\}/d\_\{i\}shifts from0\.260\.26to0\.860\.86, indicating upstream\-coupling sensitivity rather than per\-node noise; full numbers in Appendix[E](https://arxiv.org/html/2605.23956#A5)\)\. Two production interventions \(Appendix[F](https://arxiv.org/html/2605.23956#A6)\) reinforce this: framework outputs surfaced pipeline\-level couplings invisible to component\-level evaluation, driving prompt\-level and architectural changes that improved quality and reduced variance\.

Table 5:Edge sensitivity, DSPy multi\-hop QA\(1,500 same\-input pairs, 500 HotpotQA dev questions×\\times3 repeats; ColBERTv2 retrieval\)\. Selected edges; full 10\-edge matrix in Appendix[E](https://arxiv.org/html/2605.23956#A5)\.
### 5\.6Distribution Faithfulness Quantifies Per\-Node Evaluation Gap

We computed distribution faithfulness \(Principle 1\) on 1,500 System P traces, comparing per\-node golden vs\. actual outputs\.

Table 6:Distribution faithfulness gap per node, System P\.Per\-fieldddbetween actual run outputs and golden expected outputs averaged across 1,500 traces \(×\\timesfield count\)\. Type\-dispatched kernels per Section 2\.2; fragment\-reference fields use recall\-based distance to handle the broad\-recall vs\. relevant\-subset asymmetry\.*Min/max field*report the per\-field range within the node\.†Discovery’s gap of 0\.900 reflects a structural retrieval/golden\-set asymmetry \(broad retrieval set vs\. smaller golden\-relevant subset\), not model misalignment\.Gaps span an order of magnitude: 0\.047 \(signal analysis, well\-characterized\) to 0\.696 \(composer\); discovery’s recall\-based gap of 0\.900 reflects retrieval/golden\-set asymmetry rather than model misalignment\. Per\-field variation within nodes is also substantial \(planner routing fields 0\.03–0\.16; context fields 0\.48–0\.93\)\.

## 6Conclusion, Limitations, and Future Work

QUIVER provides primitives:σ\\sigma,λ\\lambda, three\-component divergence,β\\beta, distribution faithfulness, that capture propagation dynamics invisible to component\-level and end\-to\-end evaluation; validation across Systems P, Q, and DSPy shows architecture\-appropriate measurements with no modification\.*Limitations:*σ^\\hat\{\\sigma\}is locally valid; re\-estimation is needed after large prompt/model/index changes\. Rewriter drift covered\[0\.696,0\.754\]\[0\.696,0\.754\];β^shape≈0\.25\\hat\{\\beta\}\_\{\\mathrm\{shape\}\}\\approx 0\.25is an upper bound under sparse coverage below0\.40\.4\. Definitions 2–4 are kernel\-/weight\-agnostic; reported values are one instantiation\.*Future work:*data\-driven weight estimation, broader pipeline classes, non\-conversational domains; broader impact in Appendix[D](https://arxiv.org/html/2605.23956#A4)\.

## References

- M\. Chen, W\. Deng, J\. Zou, H\. Yu, and X\. Li \(2026\)Textual equilibrium propagation for deep compound ai systems\.arXiv preprint arXiv:2601\.21064\.Cited by:[§3](https://arxiv.org/html/2605.23956#S3.SS0.SSS0.Px1.p1.1)\.
- C\. Cheng, A\. Nie, and A\. Swaminathan \(2024\)Trace is the next autodiff: generative optimization with rich feedback, execution traces, and llms\.Advances in Neural Information Processing Systems37,pp\. 71596–71642\.Cited by:[§1](https://arxiv.org/html/2605.23956#S1.p2.1),[§3](https://arxiv.org/html/2605.23956#S3.SS0.SSS0.Px1.p1.1)\.
- J\. Duan, J\. Diffenderfer, S\. Madireddy, T\. Chen, B\. Kailkhura, and K\. Xu \(2025\)Uprop: investigating the uncertainty propagation of llms in multi\-step agentic decision\-making\.arXiv preprint arXiv:2506\.17419\.Cited by:[§A\.9](https://arxiv.org/html/2605.23956#A1.SS9.p5.1),[§3](https://arxiv.org/html/2605.23956#S3.SS0.SSS0.Px3.p1.1)\.
- S\. Es, J\. James, L\. E\. Anke, and S\. Schockaert \(2024\)Ragas: automated evaluation of retrieval augmented generation\.InProceedings of the 18th conference of the european chapter of the association for computational linguistics: system demonstrations,pp\. 150–158\.Cited by:[§3](https://arxiv.org/html/2605.23956#S3.SS0.SSS0.Px2.p1.2)\.
- A\. Eslami and J\. Yu \(2026\)A control\-theoretic foundation for agentic systems\.arXiv preprint arXiv:2603\.10779\.Cited by:[§3](https://arxiv.org/html/2605.23956#S3.SS0.SSS0.Px3.p1.1)\.
- S\. He, A\. Narayan, I\. S\. Khare, S\. W\. Linderman, C\. Ré, and D\. Biderman \(2025\)An information theoretic perspective on agentic system design\.arXiv preprint arXiv:2512\.21720\.Cited by:[§3](https://arxiv.org/html/2605.23956#S3.SS0.SSS0.Px3.p1.1)\.
- S\. Huang, M\. Li, H\. Yu, and X\. Li \(2026\)TextResNet: decoupling and routing optimization signals in compound ai systems via deep residual tuning\.arXiv preprint arXiv:2602\.08306\.Cited by:[§3](https://arxiv.org/html/2605.23956#S3.SS0.SSS0.Px1.p1.1)\.
- O\. Khattabet al\.\(2024\)DSPy: compiling declarative language model calls into self\-improving pipelines\.InICLR,Cited by:[§1](https://arxiv.org/html/2605.23956#S1.p1.1),[§1](https://arxiv.org/html/2605.23956#S1.p2.1),[§3](https://arxiv.org/html/2605.23956#S3.SS0.SSS0.Px1.p1.1),[§5\.5](https://arxiv.org/html/2605.23956#S5.SS5.p1.5)\.
- Y\. Kim, K\. Gu, C\. Park, C\. Park, S\. Schmidgall, A\. A\. Heydari, Y\. Yan, Z\. Zhang, Y\. Zhuang, Y\. Liu,et al\.\(2025\)Towards a science of scaling agent systems\.arXiv preprint arXiv:2512\.08296\.Cited by:[§3](https://arxiv.org/html/2605.23956#S3.SS0.SSS0.Px3.p1.1)\.
- Y\. Lee, G\. Yi, M\. Liu, J\. Lu, G\. Yang, and Y\. Chen \(2025\)Compound ai systems optimization: a survey of methods, challenges, and future directions\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 28748–28763\.Cited by:[§3](https://arxiv.org/html/2605.23956#S3.SS0.SSS0.Px1.p1.1)\.
- J\. Liu, C\. Liu, and H\. Shen \(2026\)ValueFlow: measuring the propagation of value perturbations in multi\-agent llm systems\.arXiv preprint arXiv:2602\.08567\.Cited by:[§3](https://arxiv.org/html/2605.23956#S3.SS0.SSS0.Px3.p1.1)\.
- I\. Parikh and A\. Dumit \(2025\)A practical framework for LLM system evaluations for multi\-step processes\.Note:[https://watershed\.com/blog/a\-practical\-framework\-for\-llm\-system\-evaluations\-for\-multi\-step\-processes](https://watershed.com/blog/a-practical-framework-for-llm-system-evaluations-for-multi-step-processes)Accessed: 2026\-04\-28Cited by:[§3](https://arxiv.org/html/2605.23956#S3.SS0.SSS0.Px2.p1.2)\.
- K\. Santhanam, O\. Khattab, J\. Saad\-Falcon, C\. Potts, and M\. Zaharia \(2022\)ColBERTv2: effective and efficient retrieval via lightweight late interaction\.InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 3715–3734\.Cited by:[§5\.5](https://arxiv.org/html/2605.23956#S5.SS5.p1.5)\.
- Y\. Shao, Y\. Jiang, T\. Kanell, P\. Xu, O\. Khattab, and M\. Lam \(2024\)Assisting in writing wikipedia\-like articles from scratch with large language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 6252–6278\.Cited by:[§1](https://arxiv.org/html/2605.23956#S1.p1.1)\.
- X\. Shen, Y\. Liu, Y\. Dai, Y\. Wang, R\. Miao, Y\. Tan, S\. Pan, and X\. Wang \(2025\)Understanding the information propagation effects of communication topologies in llm\-based multi\-agent systems\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 12358–12372\.Cited by:[§3](https://arxiv.org/html/2605.23956#S3.SS0.SSS0.Px3.p1.1)\.
- N\. Tacheny \(2025\)Geometric dynamics of agentic loops in large language models\.arXiv preprint arXiv:2512\.10350\.Cited by:[§3](https://arxiv.org/html/2605.23956#S3.SS0.SSS0.Px3.p1.1)\.
- Y\. Xie, C\. Zhu, X\. Zhang, T\. Zhu, D\. Ye, M\. Qi, H\. Chen, and W\. Zhou \(2026\)From spark to fire: modeling and mitigating error cascades in llm\-based multi\-agent collaboration\.arXiv preprint arXiv:2603\.04474\.Cited by:[§3](https://arxiv.org/html/2605.23956#S3.SS0.SSS0.Px3.p1.1)\.
- Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning \(2018\)HotpotQA: a dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,pp\. 2369–2380\.Cited by:[§5\.5](https://arxiv.org/html/2605.23956#S5.SS5.p1.5)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2022\)React: synergizing reasoning and acting in language models\.arXiv preprint arXiv:2210\.03629\.Cited by:[§1](https://arxiv.org/html/2605.23956#S1.p1.1),[§1](https://arxiv.org/html/2605.23956#S1.p2.1),[§2\.4](https://arxiv.org/html/2605.23956#S2.SS4.p1.3)\.
- L\. Yin and Z\. Wang \(2025\)Llm\-autodiff: auto\-differentiate any llm workflow\.arXiv preprint arXiv:2501\.16673\.Cited by:[§3](https://arxiv.org/html/2605.23956#S3.SS0.SSS0.Px1.p1.1)\.
- M\. Yuksekgonulet al\.\(2024\)TextGrad: automatic "differentiation" via text\.arXiv preprint arXiv:2406\.07496\.Cited by:[§1](https://arxiv.org/html/2605.23956#S1.p2.1),[§3](https://arxiv.org/html/2605.23956#S3.SS0.SSS0.Px1.p1.1)\.
- G\. Zhang, J\. Wang, J\. Chen, W\. Zhou, K\. Wang, and S\. Yan \(2025\)AgenTracer: who is inducing failure in the llm agentic systems?\.arXiv preprint arXiv:2509\.03312\.Cited by:[§3](https://arxiv.org/html/2605.23956#S3.SS0.SSS0.Px2.p1.2)\.
- C\. Zhu, S\. Hong, J\. Wu, K\. Chawla, Y\. Tang, Y\. Yin, N\. Wolfe, E\. Babinsky, and D\. Liu \(2026\)Raffles: reasoning\-based attribution of faults for llm systems\.InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 7659–7688\.Cited by:[§3](https://arxiv.org/html/2605.23956#S3.SS0.SSS0.Px2.p1.2)\.

## Appendix AExtended Discussion of Definitions

This appendix provides extended discussion, interpretation, and worked examples for each formal definition in Section 2\.

### A\.1Typed Pipeline Graph \(Definition 1\)

The typed pipeline graph𝒢=\(V,E,𝒯,ℱ\)\\mathcal\{G\}=\(V,E,\\mathcal\{T\},\\mathcal\{F\}\)captures the essential structure of any compound LLM pipeline\. Each nodeviv\_\{i\}represents a computational unit — typically an LLM call, a retrieval operation, a deterministic transformation, or a routing decision\. The stochastic functionfi:∏j∈pa​\(i\)Tj→Δ​\(Ti\)f\_\{i\}:\\prod\_\{j\\in\\mathrm\{pa\}\(i\)\}T\_\{j\}\\rightarrow\\Delta\(T\_\{i\}\)maps parent outputs to a*distribution*over outputs rather than a single output, reflecting the inherent non\-determinism of LLM calls \(temperature, sampling, backend non\-determinism\) and stochastic retrieval operations\.

The typed output spacesTiT\_\{i\}are determined by each node’s implementation:

- •Schema\-typed objects:JSON conforming to a defined schema \(e\.g\., a planner output with fields for task, intent type, remaining subtasks, and reasoning trace\)\. Formally,Ti=Ti\(1\)×⋯×Ti\(m\)T\_\{i\}=T\_\{i\}^\{\(1\)\}\\times\\cdots\\times T\_\{i\}^\{\(m\)\}is a product ofmmtyped fields\.
- •Ordered lists:Ranked sequences with typed elements \(e\.g\., a retrieval node returning a ranked list of document fragments\)\.
- •Categorical values:Elements from a finite set \(e\.g\., an intent classifier outputting one of\{FAQ,troubleshoot,request\}\\\{\\texttt\{FAQ\},\\texttt\{troubleshoot\},\\texttt\{request\}\\\}\)\.
- •Unstructured text:Free\-form natural language \(e\.g\., a composer node generating a response, or a rewriter node reformulating a query\)\.

A single pipeline may combine all of these types across different nodes\. External inputs to the pipeline \(user queries, session state, conversation history, configuration parameters\) are modeled as source nodesvsv\_\{s\}withpa​\(s\)=∅\\mathrm\{pa\}\(s\)=\\emptyset, whose output distributionsΔ​\(Ts\)\\Delta\(T\_\{s\}\)are determined by the deployment environment rather than by a learned function\.

#### Worked example\.

Consider a pipeline with five nodes: Rewriter \(v1v\_\{1\}\), Signal Analysis \(v2v\_\{2\}\), Discovery \(v3v\_\{3\}\), Planner \(v4v\_\{4\}\), Composer \(v5v\_\{5\}\)\. The typed output spaces are:

T1\\displaystyle T\_\{1\}=RewriterOutput=\{intents:Map\[str, List\[str\]\]\}\\displaystyle=\\texttt\{RewriterOutput\}=\\\{\\texttt\{intents\}:\\texttt\{Map\[str, List\[str\]\]\}\\\}T2\\displaystyle T\_\{2\}=SignalResult=\{turn\_type:Cat,sentiment:Cat,web\_hint:Cat,is\_small\_talk:Bool\}\\displaystyle=\\texttt\{SignalResult\}=\\\{\\texttt\{turn\\\_type\}:\\texttt\{Cat\},\\;\\texttt\{sentiment\}:\\texttt\{Cat\},\\;\\texttt\{web\\\_hint\}:\\texttt\{Cat\},\\;\\texttt\{is\\\_small\\\_talk\}:\\texttt\{Bool\}\\\}T3\\displaystyle T\_\{3\}=DiscoveryOutput=\{fragment\_ids:List\[str\]\}\\displaystyle=\\texttt\{DiscoveryOutput\}=\\\{\\texttt\{fragment\\\_ids\}:\\texttt\{List\[str\]\}\\\}T4\\displaystyle T\_\{4\}=PlannerOutput=\{intent\_type:Cat,task:str,remaining:List\[str\],gaps:List\[str\],thought:str\}\\displaystyle=\\texttt\{PlannerOutput\}=\\\{\\texttt\{intent\\\_type\}:\\texttt\{Cat\},\\;\\texttt\{task\}:\\texttt\{str\},\\;\\texttt\{remaining\}:\\texttt\{List\[str\]\},\\;\\texttt\{gaps\}:\\texttt\{List\[str\]\},\\;\\texttt\{thought\}:\\texttt\{str\}\\\}T5\\displaystyle T\_\{5\}=ComposerOutput=\{response:str,selected\_refs:List\[str\]\}\\displaystyle=\\texttt\{ComposerOutput\}=\\\{\\texttt\{response\}:\\texttt\{str\},\\;\\texttt\{selected\\\_refs\}:\\texttt\{List\[str\]\}\\\}
The edge setEEencodes which nodes consume which outputs:E=\{\(v1,v3\),\(v1,v4\),\(v2,v4\),\(v3,v4\),\(v3,v5\),\(v4,v5\)\}E=\\\{\(v\_\{1\},v\_\{3\}\),\(v\_\{1\},v\_\{4\}\),\(v\_\{2\},v\_\{4\}\),\(v\_\{3\},v\_\{4\}\),\(v\_\{3\},v\_\{5\}\),\(v\_\{4\},v\_\{5\}\)\\\}, indicating that the Planner \(v4v\_\{4\}\) has four parents and the Composer \(v5v\_\{5\}\) has two\.

### A\.2Type\-Dispatched Distance Metrics \(Definition 2\)

The key design principle is that distance metrics are*dispatched by type*: different output types require fundamentally different notions of distance\. Embedding cosine distance between two JSON objects with one categorical field flipped tells you almost nothing; a type\-aware metric that checks “did the enum value change?” gives a precise binary signal\.

For schema\-typed output spaces, the distance decomposes as a weighted sum of per\-field distances:

di​\(x,y\)=∑k=1mwk⋅di\(k\)​\(x\(k\),y\(k\)\)d\_\{i\}\(x,y\)=\\sum\_\{k=1\}^\{m\}w\_\{k\}\\cdot d\_\{i\}^\{\(k\)\}\(x^\{\(k\)\},y^\{\(k\)\}\)\(8\)
The per\-field distance functions are:

- •Categorical fields\(e\.g\., intent classification, action type\): discrete metricd​\(a,b\)=𝟙​\[a≠b\]d\(a,b\)=\\mathbb\{1\}\[a\\neq b\]\. A classification flip fromFAQtotroubleshoothas distance 1; same classification has distance 0\.
- •Boolean fields\(e\.g\.,is\_small\_talk\): discrete metricd​\(a,b\)=𝟙​\[a≠b\]d\(a,b\)=\\mathbb\{1\}\[a\\neq b\]\.
- •Set\-valued fields\(e\.g\., list of fragment IDs, set of retrieved document identifiers\): normalized Jaccard distanced​\(A,B\)=1−\|A∩B\|/\|A∪B\|d\(A,B\)=1\-\|A\\cap B\|/\|A\\cup B\|\. Two fragment sets sharing 15 of 20 elements have distance1−15/25=0\.41\-15/25=0\.4\.
- •Ordered list fields\(e\.g\., ranked results, remaining task list\): normalized edit distance or rank correlation distance, depending on whether order carries semantic meaning\.
- •Numeric fields\(e\.g\., confidence scores, response length\): normalized absolute differenced​\(a,b\)=\|a−b\|/max⁡\(\|a\|,\|b\|,ϵ\)d\(a,b\)=\|a\-b\|/\\max\(\|a\|,\|b\|,\\epsilon\)\.
- •Text fields\(e\.g\., reasoning traces, generated responses\): embedding cosine distanced​\(s,t\)=1−cos⁡\(ϕ​\(s\),ϕ​\(t\)\)d\(s,t\)=1\-\\cos\(\\phi\(s\),\\phi\(t\)\)for an embedding functionϕ\\phi\. We useall\-MiniLM\-L6\-v2throughout our experiments\.
- •Mapping fields\(e\.g\.,Map\[category, List\[query\]\]\): combined key\-set Jaccard plus per\-key text distance, averaged\.

#### Field weights\.

The weightswkw\_\{k\}reflect each field’s causal influence on downstream nodes rather than its semantic importance to a human reader\. We distinguish three categories:

- •Routing fields:Fields consumed by downstream nodes in programmatic routing logic or conditional branching \(e\.g\.,intent\_typedetermining which agent executes,remainingdetermining loop continuation\)\. Assigned weight2​w02w\_\{0\}\.
- •Context fields:Fields consumed by downstream nodes as prompt context without affecting control flow \(e\.g\.,taskdescription,fragment\_refs\)\. Assigned weightw0w\_\{0\}\.
- •Observability fields:Fields used only for logging, debugging, or human inspection, not consumed by any downstream node \(e\.g\.,thoughtreasoning traces\)\. Assigned weight0\.

wherew0w\_\{0\}is a normalizing constant ensuring∑kwk=1\\sum\_\{k\}w\_\{k\}=1\. This assignment can be determined by static analysis of the pipeline code: trace which fields appear in downstream prompt templates or routing conditions\. Specific weight ratios are application\-dependent; the framework’s measurements are defined for any admissible choice\.

### A\.3Edge Sensitivity \(Definition 3\)

The edge sensitivity coefficientσi​j\\sigma\_\{ij\}quantifies the expected ratio of downstream output distance to upstream output distance:

σi​j=𝔼​\[dj​\(fj​\(𝐱\),fj​\(𝐱′\)\)di​\(xi,xi′\)\]\\sigma\_\{ij\}=\\mathbb\{E\}\\left\[\\frac\{d\_\{j\}\\big\(f\_\{j\}\(\\mathbf\{x\}\),\\;f\_\{j\}\(\\mathbf\{x\}^\{\\prime\}\)\\big\)\}\{d\_\{i\}\(x\_\{i\},x\_\{i\}^\{\\prime\}\)\}\\right\]\(9\)
The expectation is taken over the distribution of all other parent inputs and the stochasticity offjf\_\{j\}, restricted to pairs wheredi​\(xi,xi′\)\>ϵd\_\{i\}\(x\_\{i\},x\_\{i\}^\{\\prime\}\)\>\\epsilonfor a small thresholdϵ\>0\\epsilon\>0to avoid degenerate ratios when the upstream outputs are nearly identical\.

#### Classification\.

Edge sensitivity classifies each connection in the pipeline graph:

- •σi​j\>1\\sigma\_\{ij\}\>1: edge\(vi,vj\)\(v\_\{i\},v\_\{j\}\)is anamplifier— nodevjv\_\{j\}magnifies upstream variation fromviv\_\{i\}\.
- •σi​j<1\\sigma\_\{ij\}<1: edge\(vi,vj\)\(v\_\{i\},v\_\{j\}\)is anabsorber— nodevjv\_\{j\}dampens upstream variation fromviv\_\{i\}\.
- •σi​j≈0\\sigma\_\{ij\}\\approx 0: nodevjv\_\{j\}isinsensitivetoviv\_\{i\}— the edge carries data but does not influence behavior\.

#### Threshold\-sensitive edges\.

The scalarσi​j\\sigma\_\{ij\}is a summary statistic that may obscure richer structure\. Our empirical results \(Section 5\) reveal that edges classified as “amplifiers” by their meanσ^\\hat\{\\sigma\}can exhibit bimodal ratio distributions: a majority regime where the edge absorbs \(ratio<1<1\) and a minority tail regime where the edge amplifies by factors of33–7×7\\times\. The mean lands near unity because the two regimes roughly balance\. This bimodality reflects*threshold\-sensitive coupling*: below a domain\-specific boundary \(e\.g\., an embedding similarity threshold in a retrieval index\), the edge absorbs perturbations; above it, the edge amplifies\. We recommend inspecting the full ratio distribution alongside the scalar summary to identify threshold\-sensitive edges\.

### A\.4Occurrence\-Lift \(Definition 4\)

Occurrence\-liftλi​j\\lambda\_\{ij\}captures a complementary aspect of perturbation propagation thatσi​j\\sigma\_\{ij\}misses: the*probability*that downstream drift occurs at all, independent of its magnitude\.

λi​j=P​\(dj\>0​∣di\>​0\)−P​\(dj\>0∣di=0\)\\lambda\_\{ij\}=P\\big\(d\_\{j\}\>0\\mid d\_\{i\}\>0\\big\)\-P\\big\(d\_\{j\}\>0\\mid d\_\{i\}=0\\big\)\(10\)
The thresholds are written asd\>0d\>0for clarity; under continuous\-valued LLM outputs the estimator partitions on a smallϵ\>0\\epsilon\>0matched to the per\-type kernel’s numerical floor \(Appendix[B](https://arxiv.org/html/2605.23956#A2), B\.2\), and we treatd\>ϵd\>\\epsilonas the operative event throughout\.

The two measures are decoupled, carrying distinct information about an edge:

- •Highσ\\sigma, lowλ\\lambda:When upstream drift does propagate, it propagates strongly, but it rarely does\. Example: a signal analysis node with very low intrinsic variance \(Dnoise=0\.008D\_\{\\mathrm\{noise\}\}=0\.008\) but high sensitivity to planner when it does drift \(σ=0\.86\\sigma=0\.86\)\.
- •Lowσ\\sigma, highλ\\lambda:Upstream drift almost always triggers some downstream drift, but the magnitude is small\. Example: a rewriter\-to\-discovery edge where any rewrite change affects which fragments are retrieved \(λ=\+0\.994\\lambda=\+0\.994\) but the number of changed fragments is moderate \(σ=1\.13\\sigma=1\.13\)\.
- •Highσ\\sigma, highλ\\lambda:Strong coupling in both probability and magnitude — the most dangerous edges\.
- •Lowσ\\sigma, lowλ\\lambda:Effectively decoupled — the edge carries data but does not influence downstream behavior\.

A framework measuring onlyσ\\sigma\(magnitude\) or onlyλ\\lambda\(probability\) provides an incomplete picture of edge behavior\. The sensitivity matrixΣ\\Sigmaand the occurrence\-lift matrixΛ\\Lambdatogether characterize the full propagation dynamics\.

### A\.5Path Sensitivity \(Definition 6\)

The path sensitivityσ​\(p\)=∏σil,il\+1\\sigma\(p\)=\\prod\\sigma\_\{i\_\{l\},i\_\{l\+1\}\}along a directed path follows from the chain rule: if node A’s output changes by 1 unit and the A→\\toB edge hasσ=2\\sigma=2, and the B→\\toC edge hasσ=0\.5\\sigma=0\.5, then the net effect from A to C is2×0\.5=12\\times 0\.5=1\.

The product formulation makes two assumptions:

1. 1\.Local linearity:The sensitivity of each edge is independent of perturbation magnitude\. This is valid for small perturbations around the current operating point\. For large perturbations, the bifurcation analysis \(Definitions 9–10\) provides the appropriate tool\.
2. 2\.Path independence:The path\-product gives per\-path sensitivity, not total sensitivity when multiple paths connect two nodes\. For node pairs connected bykkpaths, we report bothmaxp⁡σ​\(p\)\\max\_\{p\}\\sigma\(p\)and the empirically measured transitive sensitivityσ^​\(i,j\)\\hat\{\\sigma\}\(i,j\)\. Agreement between the two validates the multiplicative approximation; divergence indicates significant interaction effects\.

#### Parallel reconvergence\.

When multiple parallel nodes feed a common downstream node, their joint effect depends on whether perturbations are independent or correlated\. The full model includes interaction terms:

dj=∑i∈pa​\(j\)αi​di\+∑i<k∈pa​\(j\)γi​k​di​dk\+εd\_\{j\}=\\sum\_\{i\\in\\mathrm\{pa\}\(j\)\}\\alpha\_\{i\}\\,d\_\{i\}\+\\sum\_\{i<k\\in\\mathrm\{pa\}\(j\)\}\\gamma\_\{ik\}\\,d\_\{i\}\\,d\_\{k\}\+\\varepsilon\(11\)whereαi\\alpha\_\{i\}are partial sensitivity contributions,γi​k\\gamma\_\{ik\}capture pairwise interactions, andε\\varepsiloncaptures intrinsic stochasticity and unmeasured influences\. Under independence, the joint sensitivity is approximately:

σjoint​\(vj\)=∑i∈pa​\(j\)σi​j2\\sigma\_\{\\mathrm\{joint\}\}\(v\_\{j\}\)=\\sqrt\{\\sum\_\{i\\in\\mathrm\{pa\}\(j\)\}\\sigma\_\{ij\}^\{2\}\}\(12\)
If the interaction coefficientsγi​k\\gamma\_\{ik\}are significant, co\-occurring perturbations have reinforcing \(positiveγ\\gamma\) or compensating \(negativeγ\\gamma\) effects\. In our experiments, the path\-product approximation matched empirical transitive sensitivity within±4%\\pm 4\\%, suggesting interaction terms are negligible along sequential paths in the pipelines studied\.

The closed\-formσjoint​\(vj\)=∑iσi​j2\\sigma\_\{\\mathrm\{joint\}\}\(v\_\{j\}\)=\\sqrt\{\\sum\_\{i\}\\sigma\_\{ij\}^\{2\}\}above is the*independence baseline*: it holds when the parents’ perturbations are uncorrelated\. In practice, parallel parents downstream of a shared upstream node \(as in System P, where multiple intake nodes are conditioned on the same user query\) have correlated perturbations, and the independence form is an approximation\. The±4%\\pm 4\\%path\-product validation covers*sequential*multiplicativity, not reconvergent independence; we therefore flag the closed\-formσjoint\\sigma\_\{\\mathrm\{joint\}\}as a limitation and recommend the full multivariate regression with interaction termsγi​k\\gamma\_\{ik\}for reconvergent nodes, treating the closed\-form as a reference value to compare against the empirical transitive sensitivityσ^​\(i,j\)\\hat\{\\sigma\}\(i,j\)\.

### A\.6Loop\-Augmented Pipelines \(Definitions 7–8\)

The loop bodyL⊆VL\\subseteq Vcontains the nodes that execute at each iteration\. At each iterationtt, the loop controller selects an actiona\(t\)∈𝒜a^\{\(t\)\}\\in\\mathcal\{A\}that determines which subgraph is realized\. The action set𝒜\\mathcal\{A\}is determined by the pipeline implementation:

- •EXECUTE: invoke a tool or external agent and incorporate the result\.
- •RETRY: re\-invoke discovery or retrieval nodes with targeted, gap\-filling queries\.
- •GENERATE: produce intermediate artifacts \(summaries, sub\-answers\)\.
- •COMPOSE: synthesize a response from accumulated context\.

Different actions realize different subgraphs within the iteration\. ARETRYaction adds retrieval nodes to the iteration’s realized graph; anEXECUTEaction adds tool execution nodes; aCOMPOSEaction routes directly to the response composer\. The iteration topologyg\(t\)=\(a\(t\),Q\(t\)\)g^\{\(t\)\}=\(a^\{\(t\)\},Q^\{\(t\)\}\)captures both the action and its parameters \(e\.g\., which queries were used for retry, which tool was invoked\)\.

The loop terminates when a termination condition is met \(e\.g\., the remaining task list is empty\) or the maximum iteration countkmaxk\_\{\\max\}is reached\. The realized iteration countk∗k^\{\*\}is itself a random variable that depends on all upstream outputs\.

### A\.7Trajectory Divergence \(Definition 9\)

The three\-component trajectory divergenceD​\(τ,τ′\)=\(Diter,Dshape,Doutput\)D\(\\tau,\\tau^\{\\prime\}\)=\(D\_\{\\mathrm\{iter\}\},D\_\{\\mathrm\{shape\}\},D\_\{\\mathrm\{output\}\}\)captures qualitatively different kinds of pipeline variation:

#### Value divergence \(Doutput\>0D\_\{\\mathrm\{output\}\}\>0,Dshape=Diter=0D\_\{\\mathrm\{shape\}\}=D\_\{\\mathrm\{iter\}\}=0\)\.

The pipeline took the same structural path but produced different output values\. This is the most common form of variation and the only kind visible to end\-to\-end evaluation\. Causes include LLM sampling variation, retrieval index updates, and upstream drift within the absorber regime of threshold\-sensitive edges\.

#### Structural divergence \(Dshape\>0D\_\{\\mathrm\{shape\}\}\>0\)\.

The pipeline executed different internal logic at one or more iterations — different actions chosen, different nodes activated, different tools invoked\. The end\-to\-end output may be similar or dramatically different, but the*path*was different\. This is invisible to evaluation methods that examine only inputs and outputs\.

#### Extent divergence \(Diter\>0D\_\{\\mathrm\{iter\}\}\>0\)\.

At least one node was invoked a different number of times across the two traces \(∑i\|ci−ci′\|\>0\\sum\_\{i\}\|c\_\{i\}\-c\_\{i\}^\{\\prime\}\|\>0\)\. The motivating case is loop\-iteration divergence — the loop bodyLLrunsk∗k^\{\*\}times in one trace andk∗′k^\{\*^\{\\prime\}\}in the other, so every nodevi∈Lv\_\{i\}\\in Lcontributes\|k∗−k∗′\|\|k^\{\*\}\-k^\{\*^\{\\prime\}\}\|to the sum and entire subgraphs exist in one trace but not the other\. The same indicator also fires in non\-loop pipelines whenever sub\-stage call counts differ \(e\.g\., a reranker invoked twice vs\. three times\), without any subgraph being absent\. This is the strongest form of structural divergence in the loop case and a per\-node call\-count divergence indicator in general\.

#### Applicability to non\-loop pipelines\.

DiterD\_\{\\mathrm\{iter\}\}counts per\-node call\-count divergence, not only loop iteration counts; for pipelines without explicit loops it is zero whenever each node is invoked the same number of times acrossτ,τ′\\tau,\\tau^\{\\prime\}, but takes positive values when sub\-stage call counts \(e\.g\., reranker invocations, tool\-execution counts\) vary — as observed for System Q \(23\.7%\) despite the absence of an architectural loop\.DshapeD\_\{\\mathrm\{shape\}\}similarly remains meaningful in the loop\-free case: it captures differences in conditional branching decisions \(e\.g\., whether a node was activated or skipped, whether a fast or slow path was taken\); System Q exhibits 25\.1%DshapeD\_\{\\mathrm\{shape\}\}from such conditional routing\.

### A\.8Bifurcation Threshold \(Definition 10\)

Bifurcation thresholds identify the perturbation magnitudes at which the pipeline’s behavior transitions from continuous degradation to structural path change\.

βshape​\(vi\)\\beta\_\{\\mathrm\{shape\}\}\(v\_\{i\}\)is the smallest perturbation at nodeviv\_\{i\}that causes any downstream structural divergence \(different node activations, different branching decisions\)\.βiter​\(vi\)\\beta\_\{\\mathrm\{iter\}\}\(v\_\{i\}\)is the smallest perturbation that causes a different number of loop iterations\.

#### Observational vs\. interventional estimation\.

These thresholds can be estimated in two complementary modes:

- •Observational:From production traces under natural operating conditions\. Identifies the proximate source of structural instability — which node’s intrinsic variance is responsible for the observed structural divergence\. In our experiments, observational estimation identified the planner as the sole bifurcation source \(β^shape=0\.102\\hat\{\\beta\}\_\{\\mathrm\{shape\}\}=0\.102\) while all other nodes showedβ^=0\\hat\{\\beta\}=0\(structural divergence occurred even with zero drift at those nodes\)\.
- •Interventional:From controlled perturbation experiments\. Identifies the causal threshold for externally\-induced bifurcation\. In our experiments, interventional estimation revealed a rewriter bifurcation threshold ofβ^shape≈0\.25\\hat\{\\beta\}\_\{\\mathrm\{shape\}\}\\approx 0\.25— dormant under natural conditions but activated when perturbation exceeds this magnitude\.

The two modes answer different questions\. Observationalβ\\betaanswers: “Under current operating conditions, which node is causing structural instability?” Interventionalβ\\betaanswers: “If I change this node’s prompt, how large a change can I make before the pipeline takes a structurally different path?” Practitioners need both: observational for monitoring, interventional for change management\.

### A\.9Noise Origins

A nodeviv\_\{i\}is a*noise origin*if it produces different outputs on identical inputs:

P​\(di\>ϵ∣dj=0​∀j∈pa​\(i\)\)\>0P\(d\_\{i\}\>\\epsilon\\mid d\_\{j\}=0\\;\\forall\\,j\\in\\mathrm\{pa\}\(i\)\)\>0\(13\)
Otherwise it is a*propagator*— its output varies only because its inputs varied\. This partitioning is determined by examining trace pairs where all upstream outputs are byte\-identical \(dj≤ϵd\_\{j\}\\leq\\epsilonfor all parentsjj\) and checking whether the node’s own output differs\.

Noise origins are the root causes of system\-level non\-determinism\. Even with temperature=0\{\}=0and fixed random seeds, LLM API calls may exhibit backend non\-determinism \(batching, hardware floating\-point variations, load balancing across replicas\)\. The noise origin analysis distinguishes this intrinsic variance from propagated variance inherited from upstream\.

This provides a complementary diagnostic to sensitivity analysis\. The sensitivity matrix tells you how perturbations propagate; the noise origin analysis tells you where perturbations originate\. Together, they provide a complete picture: noise originates at origin nodes, propagates through the graph along high\-σ\\sigmaedges, and potentially triggers structural bifurcation at nodes with lowβshape\\beta\_\{\\mathrm\{shape\}\}\.

This analysis is conceptually related to UProp’s\[Duanet al\.,[2025](https://arxiv.org/html/2605.23956#bib.bib1)\]decomposition of uncertainty into intrinsic and extrinsic components, but achieved through a simpler, more practical measurement: a binary partition based on upstream cleanliness, rather than mutual information estimation requiring token\-level logprobs\.

### A\.10Distribution Faithfulness \(Principle 1\)

Distribution faithfulness formalizes a necessary condition for valid per\-node evaluation\. LetPprod​\(Ti\)P\_\{\\mathrm\{prod\}\}\(T\_\{i\}\)be the marginal distribution over nodeviv\_\{i\}’s output space induced by running the full pipeline on production inputs\. LetPeval​\(Ti\)P\_\{\\mathrm\{eval\}\}\(T\_\{i\}\)be the distribution of inputs used when evaluating any downstream nodevjv\_\{j\}where\(vi,vj\)∈E\(v\_\{i\},v\_\{j\}\)\\in E\.

A per\-node evaluation is*distribution\-faithful*with respect to edge\(vi,vj\)\(v\_\{i\},v\_\{j\}\)if:

DKL​\(Pprod​\(Ti\)∥Peval​\(Ti\)\)<δD\_\{\\mathrm\{KL\}\}\(P\_\{\\mathrm\{prod\}\}\(T\_\{i\}\)\\,\\\|\\,P\_\{\\mathrm\{eval\}\}\(T\_\{i\}\)\)<\\delta\(14\)
The practical failure mode: if a golden evaluation dataset assumes clean, well\-formed upstream outputs but the upstream node actually produces noisy, partially correct outputs 30% of the time, the evaluation overestimates the downstream node’s quality by testing it on inputs it never encounters in production\.

The sensitivity matrix determines where distribution gaps matter most:

- •A faithfulness gapΔ\\Deltaat a high\-sensitivity edge \(σi​j≫1\\sigma\_\{ij\}\\gg 1\) produces a correspondingly amplified quality estimation error at downstream nodes\.
- •The same gapΔ\\Deltaat a low\-sensitivity edge \(σi​j≪1\\sigma\_\{ij\}\\ll 1\) has minimal downstream impact\.
- •A uniform re\-evaluation policy would prioritize nodes by gap magnitude alone; the sensitivity\-weighted analysis correctly identifies where gaps matter most\.

Distribution faithfulness is measurable per\-field: the gap may vary dramatically across fields of the same node’s output\. Routing fields may be well\-characterized by golden data while context fields diverge substantially, or vice versa\. The per\-field breakdown, combined with the field weight assignments from Definition 2, determines the effective faithfulness gap\.

### A\.11Cross\-Stage Regression Detection \(Principle 2\)

The impact setℐ​\(vi,α\)=\{vj:σ​\(p\)\>α​for any directed path​p​from​vi​to​vj\}\\mathcal\{I\}\(v\_\{i\},\\alpha\)=\\\{v\_\{j\}:\\sigma\(p\)\>\\alpha\\text\{ for any directed path \}p\\text\{ from \}v\_\{i\}\\text\{ to \}v\_\{j\}\\\}converts the sensitivity matrix into a regression testing policy\.

#### Operational protocol\.

When the prompt or configuration of nodeviv\_\{i\}is modified:

1. 1\.Measuredid\_\{i\}on a held\-out probe set before and after the change\.
2. 2\.For each downstream edge\(vi,vj\)\(v\_\{i\},v\_\{j\}\), comparedid\_\{i\}against the drift budgetτi​j\\tau\_\{ij\}at the team’s chosen significance levelα\\alpha\.
3. 3\.Ifdi\>τi​jd\_\{i\}\>\\tau\_\{ij\}, re\-run golden dataset evaluation for nodevjv\_\{j\}\.
4. 4\.Ifdi≤τi​jd\_\{i\}\\leq\\tau\_\{ij\}, skip re\-evaluation ofvjv\_\{j\}— the perturbation is within the safe budget\.

This replaces the current industry practice of either re\-evaluating everything \(expensive and slow\) or re\-evaluating nothing downstream \(risky\) with a quantitatively grounded policy derived from the pipeline’s measured sensitivity profile\.

For pipelines with conditional loops, the impact set should additionally include any loop\-body node whereβshape​\(vi\)\\beta\_\{\\mathrm\{shape\}\}\(v\_\{i\}\)is below the measured perturbation magnitudedid\_\{i\}\. These are nodes where the change may trigger structural trajectory divergence, not merely continuous quality degradation\.

#### Drift budgets\.

The per\-edge drift budgetτi​j\\tau\_\{ij\}is defined as the smallest upstream drift above which the downstream node is likely to exceed its own noise floor:

τi​j​@​α=min⁡\{di:P​\(dj\>Dnoise​\(vj\)​∣di\>​τ\)≥α\}\\tau\_\{ij\}@\\alpha=\\min\\\{d\_\{i\}:P\(d\_\{j\}\>D\_\{\\mathrm\{noise\}\}\(v\_\{j\}\)\\mid d\_\{i\}\>\\tau\)\\geq\\alpha\\\}\(15\)
Higherτ\\tauindicates a more robust edge\. Drift budgets exhibited a wide range across edges in our experiments: from zero \(any upstream drift at all triggers downstream risk\) to effectively infinite \(the downstream node is insensitive at any realistic drift magnitude\)\. This variation confirms that a uniform regression\-testing policy is suboptimal\.

## Appendix BEstimation Procedure Details

### B\.1Observational Estimation ofσi​j\\sigma\_\{ij\}

Given a corpus ofNNproduction traces with logged intermediate outputs at every node, we estimate edge sensitivity via the following procedure:

1. 1\.Pair formation\.Group traces by input similarity \(same seed, same intent class, or similar query embedding\)\. Within each group, form all pairwise combinations\. For a corpus ofNNtraces withSSseeds andRRrepeats per seed, this yieldsS​\(R2\)S\\binom\{R\}\{2\}pairs\.
2. 2\.Distance computation\.For each pair and each nodeviv\_\{i\}, compute the type\-dispatched distancedid\_\{i\}using Definition 2\. Store the per\-field distances alongside the aggregate for later analysis\.
3. 3\.Sensitivity estimation\.For each edge\(vi,vj\)∈E\(v\_\{i\},v\_\{j\}\)\\in E, collect all pairs wheredi\>ϵd\_\{i\}\>\\epsilonand computeσ^i​j=𝔼​\[dj/di\]\\hat\{\\sigma\}\_\{ij\}=\\mathbb\{E\}\[d\_\{j\}/d\_\{i\}\]over those pairs\.
4. 4\.Partial regression \(for multi\-parent nodes\)\.When nodevjv\_\{j\}has multiple parents, the simple ratiodj/did\_\{j\}/d\_\{i\}confounds the contributions of different parents\. We estimate partial sensitivities via multivariate regression: dj=∑k∈pa​\(j\)αk​dk\+∑k<l∈pa​\(j\)γk​l​dk​dl\+εd\_\{j\}=\\sum\_\{k\\in\\mathrm\{pa\}\(j\)\}\\alpha\_\{k\}\\,d\_\{k\}\+\\sum\_\{k<l\\in\\mathrm\{pa\}\(j\)\}\\gamma\_\{kl\}\\,d\_\{k\}\\,d\_\{l\}\+\\varepsilon\(16\)The coefficientαi\\alpha\_\{i\}is the partial sensitivity ofvjv\_\{j\}toviv\_\{i\}, controlling for co\-variation across other parents\. The interaction termsγk​l\\gamma\_\{kl\}capture pairwise reinforcing or compensating effects\.

#### Sample size\.

The partial regression requiresNNsufficiently large relative to\|pa​\(j\)\|2\|\\mathrm\{pa\}\(j\)\|^\{2\}\(the number of parameters including interaction terms\)\. For nodes with 5 parents, this is 15 parameters \(5 main effects \+ 10 interactions\), requiring on the order of hundreds of pairs for stable estimates\. In our experiments, 1,500 traces \(yielding 266–943 qualifying pairs per edge depending onϵ\\epsilonfiltering\) provided stable estimates for System P with 6 edges and a maximum of 4 parents per node\.

#### Epsilon threshold\.

The thresholdϵ\\epsilonavoids degenerate ratios when upstream outputs are nearly identical\. We use a fixedϵ=0\.01\\epsilon=0\.01throughout\. When the upstream node’s noise floorDnoise​\(vi\)D\_\{\\mathrm\{noise\}\}\(v\_\{i\}\)is very small \(e\.g\.,<0\.01<0\.01\), few pairs qualify and estimates may have wide confidence intervals\. We report sample sizes alongside all sensitivity estimates\.

### B\.2Observational Estimation ofλi​j\\lambda\_\{ij\},τi​j\\tau\_\{ij\}, andDnoiseD\_\{\\mathrm\{noise\}\}

#### Occurrence\-lift\.

For each edge, partition pairs into those withdi\>ϵd\_\{i\}\>\\epsilonand those withdi≤ϵd\_\{i\}\\leq\\epsilon\(using the same per\-type kernel floor as inσ^i​j\\hat\{\\sigma\}\_\{ij\}\)\. Compute the conditional probabilitiesP​\(dj\>ϵ​∣di\>​ϵ\)P\(d\_\{j\}\>\\epsilon\\mid d\_\{i\}\>\\epsilon\)andP​\(dj\>ϵ∣di≤ϵ\)P\(d\_\{j\}\>\\epsilon\\mid d\_\{i\}\\leq\\epsilon\); the difference isλ^i​j\\hat\{\\lambda\}\_\{ij\}\.

#### Drift budgets\.

For each edge and each significance levelα\\alpha, sweep upstream drift thresholdsτ\\tauand computeP​\(dj\>Dnoise​\(vj\)​∣di\>​τ\)P\(d\_\{j\}\>D\_\{\\mathrm\{noise\}\}\(v\_\{j\}\)\\mid d\_\{i\}\>\\tau\)\. The smallestτ\\tauat which this probability reachesα\\alphais the drift budgetτi​j​@​α\\tau\_\{ij\}@\\alpha\. When the probability never reachesα\\alphaat any observed drift magnitude, the edge is effectively insensitive at that significance level \(reported as “never” in our tables\)\.

#### Noise floor\.

Dnoise​\(vi\)D\_\{\\mathrm\{noise\}\}\(v\_\{i\}\)is the mean type\-dispatched distance across same\-input pairs:Dnoise​\(vi\)=𝔼​\[di​\(oi,oi′\)\]D\_\{\\mathrm\{noise\}\}\(v\_\{i\}\)=\\mathbb\{E\}\[d\_\{i\}\(o\_\{i\},o\_\{i\}^\{\\prime\}\)\]over pairs with identical pipeline inputs\. This measures the intrinsic non\-determinism of each node under the fixed determinism contract \(temperature, seed\)\.

### B\.3Interventional Estimation ofβshape\\beta\_\{\\mathrm\{shape\}\}andβiter\\beta\_\{\\mathrm\{iter\}\}

Bifurcation thresholds describe decision boundaries and cannot be reliably estimated from observational data alone, as the natural variation may not probe the boundary region\.

1. 1\.Boundary identification\.From the observational corpus, identify traces near the bifurcation boundary — traces where the loop controller nearly chose a different action or a conditional branch was marginal\. Heuristics include: gap analysis that returned marginal results, confidence scores near decision thresholds, or categorical classifications with low model confidence\.
2. 2\.Perturbation application\.For each near\-boundary trace and each target upstream nodeviv\_\{i\}, apply controlled perturbations of known magnitude\. Perturbation strategies depend on the node’s output type: - •Text outputs:paraphrase at varying temperatures, keyword addition/removal, prompt variant substitution\. - •Categorical outputs:flip to next\-most\-likely class\. - •List outputs:remove elements, reorder, add spurious elements\. - •Boolean outputs:flip the value\.
3. 3\.Re\-execution\.Re\-run the pipeline from the perturbed node forward, holding all other inputs fixed\.
4. 4\.Threshold recording\.Record the smallest perturbation magnitudedid\_\{i\}at whichDshape\>0D\_\{\\mathrm\{shape\}\}\>0orDiter\>0D\_\{\\mathrm\{iter\}\}\>0\.
5. 5\.Aggregation\.Reportβ^shape​\(vi\)\\hat\{\\beta\}\_\{\\mathrm\{shape\}\}\(v\_\{i\}\)as the minimum observed threshold across all near\-boundary traces, with sample size and interquartile range\.

#### Stratification\.

For perturbations that are binary \(categorical flips, boolean overrides\), the perturbation is either effective \(the baseline had a different value\) or a no\-op \(the baseline already matched the perturbed value\)\. We report these strata separately: the effective stratum gives the bifurcation rate when the perturbation actually changes something; the no\-op stratum serves as a built\-in negative control, where all divergence metrics should return to the noise floor\.

#### Limitations\.

The precision ofβ^shape\\hat\{\\beta\}\_\{\\mathrm\{shape\}\}depends on the resolution of perturbation magnitudes available\. Discrete perturbations \(categorical flips, prompt variant swaps\) provide binary probe points; continuous perturbations \(paraphrasing at varying temperatures\) provide finer resolution but are harder to control precisely\. Our rewriter perturbation experiments used prompt variant swaps, yielding meandrewriter∈\[0\.696,0\.754\]d\_\{\\mathrm\{rewriter\}\}\\in\[0\.696,0\.754\]with sparse coverage below 0\.4\. The reported thresholdβ^shape≈0\.25\\hat\{\\beta\}\_\{\\mathrm\{shape\}\}\\approx 0\.25is therefore an upper bound; the true threshold may be lower\.

### B\.4Noise Origin Classification

For each pair\(a,b\)\(a,b\)and each nodeviv\_\{i\}, we partition by whether every recorded upstream output is byte\-identical \(dj≤ϵd\_\{j\}\\leq\\epsilonfor allj∈pa​\(i\)j\\in\\mathrm\{pa\}\(i\)\):

- •Ifdi\>ϵd\_\{i\}\>\\epsilonwhen all upstream outputs are identical,viv\_\{i\}is anorigin— it generates variance intrinsically\.
- •Ifdi\>ϵd\_\{i\}\>\\epsilononly when at least one upstream output differs,viv\_\{i\}is apropagator— it inherits variance\.
- •Ifviv\_\{i\}never has all\-clean upstream \(because at least one parent is always dirty\), the classification is indeterminate from the available data\. We report these as “always upstream\-dirty” and note the drift rate when upstream is dirty\.

### B\.5Distribution Faithfulness Estimation

For each nodeviv\_\{i\}with a corresponding golden evaluation field, we compute the per\-field type\-dispatched distance between each production trace output and the golden expected output\. The per\-node distribution gap is the mean across fields; the per\-field gap identifies which specific output dimensions are well\-characterized versus poorly\-characterized by the golden data\.

We use the same type\-dispatched kernel family as Definition 2; the specific instantiation \(text\-distance, list\-distance, etc\.\) is application\-dependent and orthogonal to the per\-node faithfulness gap\.

## Appendix CCapability Comparison with Related Work

Table[7](https://arxiv.org/html/2605.23956#A3.T7)compares capability coverage across the most closely related approaches to QUIVER\. Symbols:✓= supported,∼\\sim= partial,×= not supported\.

Table 7:Capability comparison with related work\.QUIVER is the only approach providing structural bifurcation analysis and distribution faithfulness without requiring auxiliary models\.The partial \(∼\\sim\) for DSPy/TextGrad under*Typed distances*reflects that DSPy signatures*declare*field types and TextGrad supports textual feedback over typed I/O, but neither dispatches a type\-appropriate distance kernel per field \(categorical, set\-valued, ordered\-list, numeric, text\) nor aggregates them into a per\-edgeσ\\sigma; the type information is used for prompt scaffolding and parsing rather than for measurement\.

## Appendix DBroader Impact

The framework provides diagnostic capability for production AI systems, enabling practitioners to identify and remediate fragile components before they cause user\-facing failures\. The sensitivity matrix could theoretically be used to identify vulnerable edges for adversarial exploitation, but this risk is minimal since the measurements require access to internal pipeline traces that external adversaries would not have\.

## Appendix EDSPy Multi\-Hop QA — Cross\-Retriever Sensitivity

Table[8](https://arxiv.org/html/2605.23956#A5.T8)reportsσ^\\hat\{\\sigma\}and its decomposition \(median ratio, intrinsicdjd\_\{j\}whendi≈0d\_\{i\}\\\!\\approx\\\!0\) for every observed temporally\-ordered edge in the DSPy multi\-hop QA pipeline \(Section[5\.5](https://arxiv.org/html/2605.23956#S5.SS5)\) under both retrievers: ColBERTv2 and BM25, both at 500 dev questions×\\times3 repeats==1,500 pairs \(matched protocol; same pipeline and determinism contract\)\. Figure[2](https://arxiv.org/html/2605.23956#A5.F2)plots the same data as side\-by\-side heatmaps for visual comparison\.

The strong amplifiers \(σ^\>2\\hat\{\\sigma\}\{\>\}2\) and strong absorbers \(σ^<0\.5\\hat\{\\sigma\}\{<\}0\.5\) retain their sign and magnitude character across retrievers\. Three near\-unity edges \(\|σ^−1\|<0\.4\|\\hat\{\\sigma\}\-1\|<0\.4\) crossσ^=1\\hat\{\\sigma\}=1between retrievers \(marked*flip*\); these sit precisely in Definition 3’s most decision\-sensitive regime\. The decomposition isolates the mechanism even where classification agrees: ongen\_query\_1→\\toretrieve\_1\(an amplifier under both retrievers,σ^=1\.05\\hat\{\\sigma\}\{=\}1\.05vs1\.371\.37\), the intrinsic\-noise termdretrieve\_1d\_\{\\text\{retrieve\\\_1\}\}in pairs wheredgen\_query\_1≈0d\_\{\\text\{gen\\\_query\\\_1\}\}\\\!\\approx\\\!0is exactly0\.0000\.000under*both*retrievers, while the median ratiodretrieve\_1/dgen\_query\_1d\_\{\\text\{retrieve\\\_1\}\}/d\_\{\\text\{gen\\\_query\\\_1\}\}jumps from0\.260\.26\(BM25\) to0\.860\.86\(ColBERTv2\) — upstream\-coupling sensitivity, not per\-node noise\.

Table 8:Cross\-retrieverσ^\\hat\{\\sigma\}decomposition\.For each ordered edge,σ^\\hat\{\\sigma\}, median ratiodj/did\_\{j\}/d\_\{i\}in drifted pairs, and intrinsicdjd\_\{j\}\(meandjd\_\{j\}in pairs wheredi≈0d\_\{i\}\\\!\\approx\\\!0\) under both retrievers\. Boldσ^\\hat\{\\sigma\}values are amplifiers \(σ^\>1\\hat\{\\sigma\}\{\>\}1\)\. Edges marked*flip*crossσ^=1\\hat\{\\sigma\}\{=\}1between retrievers; all are near\-unity\.∗transitive \(no direct edge\)\.![Refer to caption](https://arxiv.org/html/2605.23956v1/x3.png)Figure 2:σ^\\hat\{\\sigma\}heatmap, BM25 vs ColBERTv2\.Rows: upstreamviv\_\{i\}; columns: downstreamvjv\_\{j\}\. Color encodesσ^i​j\\hat\{\\sigma\}\_\{ij\}on a diverging scale centered atσ^=1\\hat\{\\sigma\}\{=\}1\(white\)\. Greyed cells are temporally infeasible \(i fires after j\) or have insufficient data \(n​\(di\>ϵ\)<30n\(d\_\{i\}\{\>\}\\epsilon\)\{<\}30\)\. Strong amplifier and absorber cells \(deep red, deep blue\) are stable across retrievers; only near\-unity cells shift color — the same three edges marked*flip*in Table[8](https://arxiv.org/html/2605.23956#A5.T8)\.![Refer to caption](https://arxiv.org/html/2605.23956v1/x4.png)Figure 3:Per\-pair\(di,dj\)\(d\_\{i\},d\_\{j\}\)scatter for the most pronounced flipping edge\.Each dot is one same\-input pair\. Dashed diagonal isdj=did\_\{j\}\{=\}d\_\{i\}\(σ^=1\\hat\{\\sigma\}\{=\}1\); pairs above amplify, pairs below absorb\. Grey dots on the y\-axis \(di<ϵd\_\{i\}\{<\}\\epsilon\) carry the intrinsic\-noise term — both retrievers cluster these tightly atdj=0d\_\{j\}\\\!=\\\!0, so the intrinsic noise is exactly0\.0000\.000in both cases\. The cloud of drifted pairs \(di\>ϵd\_\{i\}\{\>\}\\epsilon\) sits well below the diagonal under BM25 \(median ratio0\.260\.26\) but spreads diagonally under ColBERTv2 \(median ratio0\.860\.86\) — the framework’s decomposition reads this directly as upstream\-coupling sensitivity, not increased per\-node noise\.
## Appendix FCase Studies: From Framework Outputs to Production Changes

This appendix documents four production interventions on the systems studied in this paper, illustrating how QUIVER’s pipeline\-level signals become actionable for practitioners\. Three were directly driven by framework analysis \(Sections[F\.2](https://arxiv.org/html/2605.23956#A6.SS2),[F\.3](https://arxiv.org/html/2605.23956#A6.SS3), and[F\.4](https://arxiv.org/html/2605.23956#A6.SS4)\); one \(Section[F\.1](https://arxiv.org/html/2605.23956#A6.SS1)\) predates the framework’s formalism and is included because the patterns the framework now articulates were the operational signals that motivated its development\. Each case lists the framework finding, the operational response, the observed outcome, and what component\-level evaluation alone would have missed\.

### F\.1System P: Rewriter Variant Diversification

Background\.QUIVER’s design was motivated in part by operational debugging on System P, where pipeline\-level coupling — small rewriter changes producing disproportionate downstream drift and triggering unnecessary loop iterations — was an observable but unformalized pain point that node\-level evaluation of the rewriter could not surface\.

Operational response \(predates framework\)\.The team’s intervention, implemented before this paper’sσ^\\hat\{\\sigma\}formalism existed, was to modify the rewriter to producemultiple semantically distinct rewritesin parallel, with deduplication on the retrieved\-fragment set\. The intuition was that a portfolio of rewrites would average out the variance any single rewrite contributed\.

Retrospective framework analysis\.QUIVER’sσ^\\hat\{\\sigma\}matrix on the System P observational corpus \(collected on the diversified\-rewriter system\) shows the rewriter retains a load\-bearing, threshold\-sensitive profile:σ^​\(rewriter→discovery\)=1\.128\\hat\{\\sigma\}\(\\text\{rewriter\}\\to\\text\{discovery\}\)=1\.128with a bimodal ratio distribution \(most pairs absorb, a minority tail amplifies\)\. Under interventional rewriter perturbation,DshapeD\_\{\\mathrm\{shape\}\}rises from 12\.9% \(observational\) to 37\.5% \(interventional\) andDiterD\_\{\\mathrm\{iter\}\}from 3\.3% to 9\.5% — single\-prompt perturbation still bifurcates the execution path in roughly one in three pairs\. The framework provides formal vocabulary for what the team had observed informally: rewriter variance couples upstream of a structural bifurcation, and diversification reduces single\-point coupling\.

What this case study illustrates\.Component\-level evaluation of the rewriter would have shown rewrites of acceptable quality\. The pipeline\-level signal — that rewriter variance bifurcates downstream execution paths — is invisible to isolated benchmarks\. The framework’sσ^\\hat\{\\sigma\}classifies the edge, theDshapeD\_\{\\mathrm\{shape\}\}rates quantify the bifurcation, and the bimodal threshold\-sensitive pattern is now a well\-defined diagnostic primitive applicable to other pipelines\.

### F\.2System P: Resolving a Categorical\-Flip Amplifier at Planner→\\toComposer

Framework finding\.On System P’s observational corpus,σ^​\(planner→composer\)=1\.069\\hat\{\\sigma\}\(\\text\{planner\}\\to\\text\{composer\}\)=1\.069with median ratio0\.5370\.537— a bimodal amplifier where the majority of same\-input pairs absorb \(median well below mean\) but a minority tail amplifies sharply\. The planner was also thesole bifurcation sourcein the pipeline:β^shape=0\.102\\hat\{\\beta\}\_\{\\mathrm\{shape\}\}=0\.102at planner,0at every other upstream node\. Together these signals identified planner→\\tocomposer as a categorical\-flip amplifier: small planner output drift could push a routing\-field categorical across its decision boundary, triggering a structurally different downstream composition path with end\-user\-visible irrecoverable consequences\.

Operational response \(driven by framework analysis\)\.The framework’s diagnosis pointed to two complementary interventions\.Architecturally, the team introduced ahybrid composer variantfor the boundary case — a path that absorbs the categorical variance into a single, robust output rather than propagating it as an irrecoverable structural divergence\.Operationally, the planner’s prompt was tuned to favor the boundary\-case category more aggressively, reducing the rate at which a single planner\-output difference triggered diametrically opposed composer behaviors\.

Outcome\.The categorical\-flip mechanism was dampened: same\-input pairs that previously routed through entirely different downstream paths more often share the hybrid path\. End\-user\-visible irrecoverable differences across same\-input runs were reduced\.

What this case study illustrates\.Component\-level evaluation of the planner \(e\.g\., intent\-classification accuracy on a curated benchmark\) would have shown a node performing its assigned task\. The pipeline\-level signal — that small classification flips at a structural boundary irrecoverably altered user\-visible behavior — is exactly whatσ^\\hat\{\\sigma\}’s bimodal\-amplifier pattern \(mean above 1, median well below\) andβ^shape\\hat\{\\beta\}\_\{\\mathrm\{shape\}\}’s bifurcation lower bound are designed to surface\. The framework flagged the problem; the architectural \+ prompt\-level intervention addressed both its magnitude \(the amplifier\) and its structural mechanism \(the bifurcation\)\.

### F\.3System Q: Reranker Removal

Framework finding\.On the System Q observational corpus, the Reranker had the highestDnoiseD\_\{\\mathrm\{noise\}\}of any node \(≈2×\\approx 2\\timesthe next\-noisiest stage\) while the downstream Reranker→\\toGenerator edge was a strong absorber \(σ^=0\.341\\hat\{\\sigma\}=0\.341, median ratio0\.3030\.303\)\. Upstream stages amplified heavily into it \(σ^​\(Fast LLM→Reranker\)=9\.018\\hat\{\\sigma\}\(\\text\{Fast LLM\}\\to\\text\{Reranker\}\)=9\.018,σ^​\(Slow LLM→Reranker\)=5\.982\\hat\{\\sigma\}\(\\text\{Slow LLM\}\\to\\text\{Reranker\}\)=5\.982\)\. The framework’s diagnostic profile: a high\-noise node feeding a leaky absorber — a pattern in which the node consumes the lion’s share of pipeline variance, with no commensurate downstream value evident\.

Operational response \(driven by framework analysis\)\.The pattern flagged the Reranker as a candidate for ablation\. The team ran the ablation, removing the Reranker from the pipeline; the ablation confirmed the framework’s hypothesis: removing the Reranker did not regress quality\.

Outcome\.The Reranker was fully removed from the production pipeline\. Generator output quality improved, with reduced variance and drift across same\-input runs; end\-to\-end latency improved \(one fewer model call per query\)\.

What this case study illustrates\.Component\-level evaluation of the Reranker \(rerank\-quality benchmarks, MRR, NDCG on curated test sets\) would have shown a node that performs its assigned task\. The pipeline\-level signal — that this node was the dominant variance source whose downstream contribution did not justify its instability — is a*coupling*property no isolated benchmark surfaces\. QUIVER’sDnoiseD\_\{\\mathrm\{noise\}\}\+σ^\\hat\{\\sigma\}pairing made the variance attribution explicit; ablation confirmed; the production change followed\.

### F\.4System P: Stale Golden Data Detected by Distribution Faithfulness

Framework finding\.Computing the distribution faithfulness gap \(Principle 1\) on System P surfaced fragment\-related golden fields with anomalously high gaps across three downstream nodes \(discovery, planner, composer; see Table[6](https://arxiv.org/html/2605.23956#S5.T6)\)\. The pattern was localized to fragment\-reference fields rather than spread across all field types, suggesting a structural mismatch in evaluation\-data coverage rather than model misalignment\.

Operational diagnosis\.Investigation traced the elevated gaps to a configuration change that had increased retrieval depth from 10 to 20 fragments without a corresponding update to the golden evaluation set; the golden expectations now covered at most half the current retrieval set by construction\. End\-to\-end evaluation did not detect this drift, and per\-node evaluation would have attributed the resulting score degradation to the nodes themselves rather than to stale evaluation data\.

Outcome\.The golden dataset was regenerated against the current retrieval depth, restoring per\-node evaluation validity for the affected fields\.

What this case study illustrates\.Distribution faithfulness measurement at the per\-node, per\-field level surfaces evaluation invalidation that aggregate metrics cannot localize\. A single configuration change can silently invalidate evaluation across multiple downstream nodes; the framework localizes the cause to specific field categories, distinguishing stale\-evaluation artifacts from genuine model drift\.

Similar Articles

Do transformers need three projections? Systematic study of QKV variants

Hacker News Top

This paper systematically studies variants of QKV projection sharing in transformers, finding that sharing key and value projections (Q-K=V) achieves 50% KV cache reduction with only 3.1% perplexity degradation, and combining with GQA/MQA can reach up to 96.9% cache reduction—enabling practical on-device inference with minimal quality loss.

Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

Hugging Face Blog

This article introduces VAKRA, an executable benchmark for evaluating AI agents' reasoning and tool-use capabilities in enterprise-like environments. It analyzes failure modes and details the benchmark's structure involving API chaining and document retrieval.