Declarative Data Services: Structured Agentic Discovery for Composing Data Systems

arXiv cs.AI Papers

Summary

This paper proposes Declarative Data Services (DDS), an architecture for structured agentic discovery of data-system compositions from declarative user intent. It decomposes the global search into bounded sub-searches and shows convergence on a trading-backend workload where unbounded discovery fails.

arXiv:2605.20690v1 Announce Type: new Abstract: Agentic discovery has shown that LLM-driven search can find novel algorithms, designs, and code under benchmark conditions. Translating the paradigm to multi-system data backends surfaces a harder problem: the search space is heterogeneous, the verifier is whether a deployed stack actually runs, and composition knowledge is unevenly captured in pretraining. Unbounded agentic discovery, a coding agent iterating on failure-log feedback, fails to converge consistently on a working stack even when iteration and explicit composition knowledge are added. We propose Declarative Data Services (DDS), an architecture for structured agentic discovery of data-system compositions from declarative user intent. The framework owns four typed contracts at successive layers (intent, operator DAG, per-system skills, runtime attribution) that decompose the global search into bounded sub-searches; sub-agents search each typed space, while the framework provides the channels by which knowledge flows forward as inline skill citations and errors route backward as typed signals. As a proof of life on a trading-backend workload, DDS converges where unbounded discovery does not; runtime failures become skill patches that the next deployment cites inline. We position this as an early prototype reporting lessons from real-world data-system composition.
Original Article
View Cached Full Text

Cached at: 05/22/26, 08:48 AM

# Declarative Data Services: Structured Agentic Discovery for Composing Data Systems
Source: [https://arxiv.org/html/2605.20690](https://arxiv.org/html/2605.20690)
Shanshan Ye Northeastern University ye\.sha@northeastern\.edu&Duo Lu11footnotemark:1 Brown University duo\_lu@brown\.edu

###### Abstract

Agentic discovery has shown that LLM\-driven search can find novel algorithms, designs, and code under benchmark conditions\. Translating the paradigm to multi\-system data backends surfaces a harder problem: the search space is heterogeneous, the verifier is whether a deployed stack actually runs, and composition knowledge is unevenly captured in pretraining\.*Unbounded agentic discovery*, a coding agent iterating on failure\-log feedback, fails to converge consistently on a working stack even when iteration and explicit composition knowledge are added\. We propose Declarative Data Services \(DDS\), an architecture for*structured agentic discovery*of data\-system compositions from declarative user intent\. The framework owns four typed contracts at successive layers \(intent, operator DAG, per\-system skills, runtime attribution\) that decompose the global search into bounded sub\-searches; sub\-agents search each typed space, while the framework provides the channels by which knowledge flows forward as inline skill citations and errors route backward as typed signals\. As a proof of life on a trading\-backend workload, DDS converges where unbounded discovery does not; runtime failures become skill patches that the next deployment cites inline\. We position this as an early prototype reporting lessons from real\-world data\-system composition\.

## 1Introduction

Agentic discovery has made a fast transition from research to product\. AlphaEvolve\[[38](https://arxiv.org/html/2605.20690#bib.bib56)\]discovered novel algorithms by LLM\-driven evolutionary search; EvoX\[[35](https://arxiv.org/html/2605.20690#bib.bib45)\]and AdaEvolve\[[8](https://arxiv.org/html/2605.20690#bib.bib46)\]meta\-evolved the search strategies themselves for systems\-discovery problems; GEPA\[[1](https://arxiv.org/html/2605.20690#bib.bib34)\]optimized declarative LM\-module graphs by reflective natural\-language attribution; Glia\[[25](https://arxiv.org/html/2605.20690#bib.bib57)\]produced expert\-level distributed\-systems designs by multi\-agent reasoning; agentic coding tools such as Claude Code\[[3](https://arxiv.org/html/2605.20690#bib.bib31)\]moved single\-repository code generation into product\-grade workflows\. The common shape across these systems is*structured agentic search*: an LLM\-driven agent explores a typed solution space, a verifier evaluates each candidate, and the loop refines under structured feedback\.

Translating this paradigm to multi\-system data backends, the kind of stack a small team or individual without a dedicated data\-engineering function \(i\.e\., a solo trader, a small product team, a research group\) would compose to run a real workload, surfaces a different set of problems\. The scope is end\-to\-end: composing across queues, OLAP and OLTP stores, caches, search indices, and the connectors and configurations between them, then deploying that composition as a working stack and evolving it as the workload moves\. The search space is heterogeneous; the verifier is whether the deployed stack actually runs and meets declared SLOs, not a benchmark answer key; ground truth is partial; and composition knowledge that distinguishes a working stack from a broken one is unevenly captured in pretraining and changes with every release cycle\. Adjacent automation searches narrower spaces: infrastructure\-as\-code renders a chosen architecture, self\-driving DBs tune inside one product, and modern data\-stack tools declare within one pipeline stage\. The search above these \(which topology, which products, which configurations actually work together\) is the gap DDS targets\. The question is not whether agents can codegen, but how the structure around them should be designed:*how should typed abstractions and coding agents split responsibility to discover, deploy, and evolve data backends from user intent?*

![Refer to caption](https://arxiv.org/html/2605.20690v1/x1.png)Figure 1:End\-to\-end view of DDS\. The user states intent in natural language with concrete constraints; DDS sits between intent and deployment as a structured\-discovery framework; the output is a multi\-system backend whose components specialize across the workload\.Figure[1](https://arxiv.org/html/2605.20690#S1.F1)sketches our answer\. The framework owns typed contracts at four layers \(L1–L4\); sub\-agents search each layer’s typed sub\-space\. Each contract is*validatable*,*citable*, and*editable*, properties that let knowledge pass forward through the layers as the search progresses and let runtime errors route backward to the layer owning the violated decision; §[3](https://arxiv.org/html/2605.20690#S3)develops the architecture\.

We make three contributions\. \(i\) An architecture for structured agentic discovery of multi\-system data backends: a framework/agent split with four typed contracts at L1–L4, governed by two ownership rules \(the framework owns each layer’s contract and validation; sub\-agents own the bounded search inside that contract; §[3](https://arxiv.org/html/2605.20690#S3)\)\. \(ii\) Two architectural ideas that make discovery structured rather than unbounded: typed attribution at L4 routes every runtime signal to the layer owning the violated decision, and agent skills are the persistent memory where composition knowledge accumulates \(§[3](https://arxiv.org/html/2605.20690#S3), §[4\.3](https://arxiv.org/html/2605.20690#S4.SS3)\)\. \(iii\) A trading\-backend case study \(§[4](https://arxiv.org/html/2605.20690#S4), proof of life\), with skill\-content ablation and learning\-loop demonstration; a second domain \(chat, Appendix[D](https://arxiv.org/html/2605.20690#A4)\) describes operator\-algebra extensibility\.

## 2Motivation and Scope

#### A running scenario\.

We anchor the discussion to the trader workload of Fig\.[1](https://arxiv.org/html/2605.20690#S1.F1)\(full intent in §[4](https://arxiv.org/html/2605.20690#S4)\): a real\-time analytics backend combining high\-throughput streaming ingest, multi\-year time\-series history, low\-latency operational lookups, and a small budget\. The user turns to a coding agent such as Claude Code; on a single repository the agent succeeds, on this multi\-system stack it fails to converge consistently \(empirical evidence in §[4\.2](https://arxiv.org/html/2605.20690#S4.SS2)and Appendix[E](https://arxiv.org/html/2605.20690#A5)\)\. This is one instance of a broader class: agentic discovery for real\-world data\-system composition, where the search space is heterogeneous and the verifier is whether a deployed stack actually runs\.

#### Adjacent work covers slices, not structured discovery\.

Adjacent directions search narrower spaces than DDS targets\. Polystores\[[20](https://arxiv.org/html/2605.20690#bib.bib5)\]search at query time across an already\-composed set of stores, not the composition itself\. Self\-driving DBs\[[51](https://arxiv.org/html/2605.20690#bib.bib14)\]search the configuration space within one product, not across products\. Modern data\-stack tools \(dbt\[[18](https://arxiv.org/html/2605.20690#bib.bib48)\], Airbyte\[[2](https://arxiv.org/html/2605.20690#bib.bib49)\], Fivetran\[[24](https://arxiv.org/html/2605.20690#bib.bib50)\]\) offer declarative surfaces within one pipeline stage but do not compose cross\-stage dataflow\. Vendor data platforms \(Snowflake\[[16](https://arxiv.org/html/2605.20690#bib.bib12)\], Databricks Lakehouse\[[6](https://arxiv.org/html/2605.20690#bib.bib13)\]\) lock the user into one vendor’s product mix and are not neutral over the 400\+ database systems in production\[[17](https://arxiv.org/html/2605.20690#bib.bib53)\]\. Agent\-first data\-system redesign\[[37](https://arxiv.org/html/2605.20690#bib.bib20)\]redesigns databases*for*agents; the complementary direction, agent\-driven discovery of multi\-system backends from intent, is what DDS targets\. Infrastructure\-as\-code with LLMs \(Pulumi\-AI\[[44](https://arxiv.org/html/2605.20690#bib.bib51)\], Terraform\-AI\[[50](https://arxiv.org/html/2605.20690#bib.bib52)\], IaC\-Eval\[[29](https://arxiv.org/html/2605.20690#bib.bib30)\]\) renders a chosen architecture; we sit above that, with the IaC backend as a physical\-layer implementation under our L3–L4 contracts\.

#### Composition demands typed contracts, and composition knowledge demands persistent memory\.

User intent carries constraints an agent cannot reliably infer unaided, along six typed dimensions \(data model, access pattern, scale, latency, consistency, and cost; elaborated in §[3](https://arxiv.org/html/2605.20690#S3)\) that mirror the textbook view of a data\-intensive application\[[28](https://arxiv.org/html/2605.20690#bib.bib4)\]\. “One size fits all” is settled\[[49](https://arxiv.org/html/2605.20690#bib.bib3)\], so every realistic intent forces composition\. What agents need is not more context but typed contracts that bound the search: a validated intent, a type\-checked operator DAG, skill contracts per system, and layer\-attributed runtime signals\. Empirical work on agent reliability supports this\[[9](https://arxiv.org/html/2605.20690#bib.bib39),[40](https://arxiv.org/html/2605.20690#bib.bib40)\]: typed failures should route to the layer owning the violated decision rather than rely on free\-form coordination\. Composition knowledge is not just absent but unstable; connector configurations, recommended images, and version\-specific quirks change with every release\. A framework that hard\-codes composition rules ages out within a release; a framework whose composition knowledge lives in per\-system agent skills \(§[3](https://arxiv.org/html/2605.20690#S3)\), edited from typed runtime signals, keeps pace, as the learning\-loop \(§[4\.3](https://arxiv.org/html/2605.20690#S4.SS3)\) shows\.

## 3The DDS Framework: Typed Contracts for Structured Agentic Discovery

![Refer to caption](https://arxiv.org/html/2605.20690v1/x2.png)Figure 2:The four DDS layers \(L1–L4\), each carrying a typed contract\. L0 in the figure is an elicitation sub\-step within L1: it produces the draft intent that the L1 contract validates, without owning its own contract\. Framework\-owned contracts sit above each layer; sub\-agents do the work inside\. L4 routes runtime signals to the layer owning the violated decision\.Declarative Data Services \(DDS\)is the structured\-discovery framework sketched in §[1](https://arxiv.org/html/2605.20690#S1)\. Four layers \(L1–L4; Fig\.[2](https://arxiv.org/html/2605.20690#S3.F2)\) each carry a typed contract\. The L1 contract begins with an elicitation sub\-step \(depicted as L0 in Fig\.[2](https://arxiv.org/html/2605.20690#S3.F2)\) in which a sub\-agent produces the draft intent that the L1 contract then validates; the sub\-step owns no separate contract because its output is the draft L1 commits to\. Host\-environment policy is a sub\-policy at L4 rather than a separate layer\. The framework owns the contract at each of L1–L4, sub\-agents search inside each typed sub\-space, and an attribution loop at the top routes every runtime signal back to the layer that owns the violated decision\. The four sub\-searches discover four different objects: a validated typed intent at L1, an SLO\-feasible operator topology at L2, a product composition that fits the topology and clears anti\-patterns at L3, and a deployable artifact at L4 \(with skill patches accumulating across deployments as the loop runs\)\. The framework makes these sub\-searches compose into one deployment that meets original intents\.

The framework/agent split \(the two ownership rules\)\.A pure rule\-based framework cannot keep up with the breadth of product\-specific knowledge a real stack requires, since every system has its own dialect, connector, and operational quirks\. A pure prompt\-to\-code agent cannot enforce cross\-layer contracts or trace a failure to the decision that caused it\. DDS therefore commits to two ownership rules that hold at every layer L1–L4:*\(R1\) the framework owns the contract*, meaning its types, schemas, composition rules, and validation, including attribution of runtime signals to the layer that owns the violated decision;*\(R2\) the sub\-agent owns the bounded search*, meaning the unconstrained, knowledge\-intensive work that fits inside the contract \(intent elicitation, DAG synthesis, product selection, codegen and deployment\)\. The single global problem \(“build me a working multi\-system backend that meets this intent”\) decomposes into four sub\-searches over typed spaces, with the framework guaranteeing that the sub\-searches compose\. Each layer boundary carries a typed artifact \(Fig\.[2](https://arxiv.org/html/2605.20690#S3.F2)\) with three properties: it is*validatable*\(the framework rejects malformed inputs before codegen\),*citable*\(downstream sub\-agents quote specific fields in generated output\), and*editable*\(a small change at one layer propagates downstream without rewriting code\)\. Together these properties let knowledge pass forward through the layers as the search progresses, and let an L4 runtime signal land as a skill edit at L3 with no code change\.

L1: intent contract\.Natural language is ambiguous; code and config commit to a product before the user has stated the requirement\. An intent specification sits in between: a typed declaration over six dimensions \(data model, access pattern, scale, latency, consistency, and cost\)\. The framework checks well\-formedness before any sub\-agent writes code\. An elicitation sub\-agent \(the L0 step within L1; Fig\.[2](https://arxiv.org/html/2605.20690#S3.F2)\) produces the draft intent; the typed declaration is common ground between a non\-expert user and downstream sub\-agents\.

L2: operator DAG\.An agent can in principle jump from intent \(“low\-latency analytics over a stream”\) straight to product selection \(“use ClickHouse and Kafka”\) in one prompt, but doing so collapses two distinct decisions:*what topology*meets the workload, and*which products*fill each role\. A bad topology choice then invalidates every downstream product choice, and there is no clean way later to attribute a runtime failure to the topology versus the chosen system\. The design decision at L2 is to separate these by committing to topology before products: a typed operator graph over an open set \(INGEST,STORE,TRANSFORM,SERVE,CACHE,QUEUE, plus domain\-specific extensions like chat’sROUTE,NOTIFY,INDEX; Appendix[D](https://arxiv.org/html/2605.20690#A4)\)\. The type system enforces three things the unstructured search cannot\. \(i\) Every declared access pattern has a query path: reachability from eachINGESTto every declaredSERVEis a contract, not a hope\. \(ii\) Per\-edge guarantees compose into end\-to\-end SLOs, and a DAG whose aggregates miss the L1 budget is rejected before any codegen happens, so pattern alternatives surface here as ranked candidates rather than as silent agent choices buried in generated code\. \(iii\) Pattern stays product\-agnostic, which is what lets skills at L3 evolve independently and what lets later runtime signals attribute cleanly to L2 \(the topology cannot meet the SLO\) versus L3 \(this product cannot meet the SLO under this configuration\)\. The current SLO algebra is small and conservative \(path latency sums, throughput minimums, consistency degrades to the weakest link\); richer rules are open work \(Appendix[H](https://arxiv.org/html/2605.20690#A8)\)\.

L3: skill contract\.The composition knowledge that distinguishes a working multi\-system stack from a broken one—which connector, which configurations actually matter, and when*not*to use a system—is scarce in pretraining: vendor docs do not document non\-fit, and anti\-patterns live in incident reports and team discussions\. This knowledge also changes with every release cycle, so a pretraining\-only agent ages out fast, and any fix expressed in a prompt evaporates with that prompt\. The design decision at L3 is to make composition knowledge a first\-class, persistent artifact of the framework rather than a transient instruction: an*agent skill*is a structured, reusable, editable artifact, one per system, with four blocks \(today materialized as YAML; Appendix[A](https://arxiv.org/html/2605.20690#A1)\):capabilities,compositions,anti\_patterns, andoperational\. The four\-block split aligns to change rates:capabilitiesages slowly with the system itself, while the other three change at the rate of release cycles, incidents, and host\-environment churn, and are exactly the targets of runtime\-driven patches\. Two properties make L3 the*learning unit*of DDS\. \(i\) Persistent memory: a runtime signal can land as a skill edit, and the next deployment cites the patched line, so a fix made once carries forward to every future deployment of that system rather than evaporating with the prompt\. \(ii\) Inline traceability: each non\-obvious config decision in the deployed artifact is annotated downstream with the skill field that informed it, so the artifact carries the rule it was made to satisfy and a reviewer can audit each choice without rerunning the agent\. The L3 planner sub\-agent uses this catalog to filter and rank candidate products against the validated DAG; anti\-pattern entries carry machine\-checkable structured fields \(severity, hard versus soft, plus matchers such as forbidden version ranges, type\-incompatible columns, or known\-bad operator pairings\) that the framework enforces during planning, so a hard anti\-pattern eliminates a candidate before any code is generated\.

L4: attribution loop\.L4 deployment is the only layer that observes the running stack, so every runtime signal first lands here\. The design decision is to*type*each signal and route it back to the layer that owns the violated decision, rather than treat every failure as a generic agent error to retry\. This is the difference between structured and unbounded discovery: without attribution, a failure reads “the agent messed up,” the next iteration searches the global space again, and there is no traceable connection between the symptom and the decision that caused it; with attribution, a failure becomes a bounded edit at one layer, and the next iteration searches only the affected sub\-space\. The routing \(Table[1](https://arxiv.org/html/2605.20690#S3.T1), Fig\.[3](https://arxiv.org/html/2605.20690#S3.F3)\) pairs each signal class with a correction policy matched to who can sensibly act: the framework auto\-patches what it owns \(transient codegen errors, host\-environment policy entries\); reviewer\-in\-the\-loop edits land at L3 as skill patches that carry forward to every future deployment; L1 intent and L2 pattern edits are surfaced to the user rather than auto\-applied because they cross a contract boundary the framework cannot revise\. The classifier today is rule\-based over deployment outputs \(compose stderr, container exit codes and health\-check states, container logs, smoke\-verifier output\); empirical evaluation is reported in §[4\.3](https://arxiv.org/html/2605.20690#S4.SS3), and out\-of\-class behavior and a learned alternative to the rule\-based classifier are future work \(Appendix[H](https://arxiv.org/html/2605.20690#A8)\)\.

Table 1:Runtime signals are typed and routed to the layer that owns the violated decision \(cf\. Fig\.[3](https://arxiv.org/html/2605.20690#S3.F3)\)\. Concrete signals from the trading case study and the skill patches that close them are in §[4\.3](https://arxiv.org/html/2605.20690#S4.SS3)\. Some signals \(consumer lag, p99 violation\) are genuinely ambiguous between L2 and L3; the harness accepts either label and the open problem of a confidence model is discussed in Appendix[H](https://arxiv.org/html/2605.20690#A8)\.![Refer to caption](https://arxiv.org/html/2605.20690v1/x3.png)Figure 3:The L4 attribution loop: deploy, observe, attribute, patch\. Each runtime signal is routed to the layer that owns the violated decision; the edit unit and policy \(framework auto, reviewer\-in\-the\-loop, or user\-decided\) are set by that layer\.
## 4Case Study: Building a Trading Backend with DDS

Setup\.The case study builds an end\-to\-end data backend for a solo\-trader analytics workload from a single intent specification covering streaming market\-data ingest, time\-series aggregation, low\-latency lookups, and a hobbyist budget; the full intent is in Appendix Fig\.[6](https://arxiv.org/html/2605.20690#A6.F6)\. We evaluate each generated deployment at three tiered levels \(T3: declared SLOs hold under load \(deferred to §[5](https://arxiv.org/html/2605.20690#S5)\)\):

- •T0:all generated artifacts have valid syntax;
- •T1:the stack boots to steady state underdocker compose upand all healthchecks pass;
- •T2:a smoke query returns expected rows end\-to\-end\.

We report three things: an end\-to\-end DDS walkthrough \(§[4\.1](https://arxiv.org/html/2605.20690#S4.SS1)\); a headline comparison of unbounded\-discovery baselines against DDS plus a skill\-content ablation that isolates which parts of the skill artifact carry the win \(§[4\.2](https://arxiv.org/html/2605.20690#S4.SS2)\); and the L4 attribution loop on real failures and controlled fault injection \(§[4\.3](https://arxiv.org/html/2605.20690#S4.SS3)\)\. All runs use the same Claude model \(claude\-opus\-4\-6\) and tool access; only the framework surface and iteration regime vary across conditions\. A second case study on a chat platform that stresses operator\-algebra extensibility is in Appendix[D](https://arxiv.org/html/2605.20690#A4)\.

### 4\.1End\-to\-end walk: intent to live data

Intent through deployment\.The elicitation sub\-agent produces a typed intent that populates all six L1 dimensions; framework validation at L1 emits one soft warning \(cost preference under\-specified, defaulted to*simplicity*\) and no hard errors\. The planner sub\-agent at L2 synthesizes the operator DAG:INGEST→\\toQUEUE→\\toTRANSFORM→\\to\{STORE\(analytics\),STORE\(operational\),CACHE\}, and the framework type\-checks each edge and the SLO composability of each path\. At L3, the planner sub\-agent selects Kafka \(QUEUE\), ClickHouse \(STORE analytics, with Kafka Engine and an OHLCV materialized view\), PostgreSQL \(STORE operational, positions table\), and Redis \(CACHE, hot state\)\. The framework validates each candidate against the skill’s anti\-patterns and each pair against the adjacent skill’s composition rules; the plan fits the declared $300/mo envelope\. At L4, the deployment sub\-agent receives a structured brief \(what to generate, which skill fields to cite, what checks must pass\), emits the artifacts in Table[2](https://arxiv.org/html/2605.20690#S4.T2), and runsdocker compose up \-d\.

Boot, smoke, and live data\.A single representative DDS one\-shot run \(one of the 10 reported in Appendix[B](https://arxiv.org/html/2605.20690#A2)\) passes T0, T1, and T2 \(the smoke query returns end\-to\-end rows\); the generated deployment comprises roughly 1,100 lines across six services\. To show that a T1\-passing stack is not a Potemkin deployment, Fig\.[4](https://arxiv.org/html/2605.20690#S4.F4)reports one 10\-min window in which a live public exchange feed \(Coinbase spot, 20 USD pairs\) is plugged into a DDS\-generated stack at the Kafka ingress\. \(The L4 sub\-agent generated Binance\-targeted producers in Table[2](https://arxiv.org/html/2605.20690#S4.T2); for the proof of life we substitute Coinbase because its public feed needs no authentication\.\) Trades land in the ClickHouse raw table within 1 s of producer connect; the OHLCV materialized view emits one row per \(symbol, minute\); ingest tracks the exchange’s bursty traffic \(mean 15\.9 msg/s\); end\-to\-end latency from exchange timestamp to ClickHouse\-observable row holds at roughly 3 s p50 / 4\.7 s p95 across the window\. Nothing is exercised beyond what the DDS\-planned topology already provides\. We report this run as a proof of life, not a scale test: the 10\-min window runs at a single\-exchange public\-feed rate \(mean 15\.9 msg/s\), below the declared sustained target of 100 events/sec in the intent \(Fig\.[6](https://arxiv.org/html/2605.20690#A6.F6), representing multi\-exchange aggregation\); validating declared SLOs under sustained load \(T3\), together with correctness and resilience probes, is deferred to the L5 evaluator \(§[5](https://arxiv.org/html/2605.20690#S5), Appendix[G](https://arxiv.org/html/2605.20690#A7)\)\.

![Refer to caption](https://arxiv.org/html/2605.20690v1/x4.png)Figure 4:Live\-data proof of life on a DDS\-generated stack\. A Coinbase public WebSocket feed \(20 USD pairs\) is plugged into the running stack atT0T\_\{0\}\.Top:cumulative raw trades \(blue\) and OHLCV 1\-minute buckets \(red\); the red staircase is theohlcv\_1mmaterialized view emitting one row per \(symbol, minute\)\.Bottom:10\-second rolling ingest rate \(green\) tracks the exchange’s bursty traffic; end\-to\-end latency p50 \(black\) and p95 \(dashed\) stay bounded across the full window\.Table 2:Artifacts generated by a representative DDS run \(T0, T1, T2 all pass\)\.Structured Discovery Yields a Working Backend in One CycleStarting from a user\-declared intent, DDS searches the four typed sub\-spaces in a single deployment cycle, producing∼1,100\{\\sim\}1\{,\}100lines across six services that boot to steady state, pass the end\-to\-end smoke query, and sustain a live exchange feed for 10 min\.

### 4\.2Where the win comes from

DDS converges consistently; iterated raw agents do not\.We compare four conditions on the same trading intent, using the same Claude model and the same tool access \(Table[3](https://arxiv.org/html/2605.20690#S4.T3)\)\. Conditions A, B, and C are versions of*unbounded agentic discovery*with progressively more knowledge in the prompt; DDS is*structured agentic discovery*with typed contracts at all four layers\. All four conditions use the same 5\-iteration outer feedback loop modeled on real\-world Claude\-Code usage: after each codegen attempt, the harness runs T0/T1/T2 acceptance and feeds the failure log back to the agent for editing in the next iteration\. This eliminates the trivially\-fixable single\-shot failure modes \(host\-port conflicts, missing init files\) and tests whether iteration alone is sufficient for a determined user without a framework\. Conditions vary only in iter\-1 prompt content: A is a natural\-language requirements prompt with no system guidance \(free\-form search over systems and topology\); B adds the explicit list of systems to use \(Kafka, ClickHouse, PostgreSQL, Redis\), narrowing the system catalog; C additionally pastes the four DDS skill YAMLs as prose into the prompt, loading composition knowledge into context without typed channels for it to flow forward \(knowledge\-loaded but still unbounded\)\.Condition A\(n=10n\{=\}10\) reached T1 2/10 at median $5\.76, with median 268 total turns across the 5 iterations; six of eight T1 failures hititerations\_exhaustedafter 5 rounds \(per\-run trajectories in Appendix[E](https://arxiv.org/html/2605.20690#A5)\)\.Condition B\(n=10n\{=\}10\) reached T1 3/10 at median $6\.23 with median 209 turns: telling the agent which systems to use marginally improves boot rate but does not close the gap\.Condition C\(n=10n\{=\}10\) reached T1 6/10 and T2 4/10 at median $5\.06 with median 229 turns: pasting the four skill YAMLs as prose closes most of the rate gap relative to A and B, confirming that skill content carries real value, but at∼\\sim5×\\timesthe median turn cost of DDS\+iter\.DDS \+ 5\-iter\(n=10n\{=\}10, same outer feedback loop as A/B/C\) reached T0, T1, and T2 at10/10, 10/10, 10/10at median $1\.94 with median 44 turns; every run terminated withstop\_reason=passed, all ten by iter 4 with median 2 iterations to T1\. Same model, same tools, same iteration scope; the delta is the contract surface that focuses each iteration’s edits, not iteration itself\. For reference, DDS without the outer loop already reached T1 8/10 at $1\.49 in median 17\.5 turns \(Appendix[B](https://arxiv.org/html/2605.20690#A2)\): contracts alone do most of the work; the outer loop closes the remaining T2 gap\.

Why unbounded discovery fails to converge\.Two structural patterns recur across the unbounded\-discovery baselines \(per\-run trajectories in Appendix[E](https://arxiv.org/html/2605.20690#A5)\): without an L2 pin on topology, a downstream failure can trigger an architectural rewrite that regresses an earlier\-passing tier \(the search re\-enters the global space rather than the local one\); and when the failure log surfaces a symptom without identifying a layer, the agent guesses at the wrong sub\-search and iteration loops without progress\. Structured discovery prevents both, because L1/L2 fix the topology before any codegen and L4 attribution routes each typed signal to the layer that owns the violated decision\.

The outer loop closes the smoke\-timing gap\.Without an outer T1/T2 retry, DDS one\-shot reaches T2 only 3/10, and the five T1\-but\-not\-T2 runs share one root cause: materialized\-view population timing combined with agent non\-determinism in MV column naming\. The 5\-iter outer loop is a separaten=10n\{=\}10batch \(Appendix[C](https://arxiv.org/html/2605.20690#A3)\); in that batch every run reaches T2 at iter 4 or earlier, and the two failure classes that drag T2 down in one\-shot \(smoke\_query\_error,smoke\_empty\) close after at most one extra iteration\. The first\-iteration T1 rate in the iterated batch \(2/10\) is lower than one\-shot’s 8/10 under nominally the same model and tools, a sampling gap atn=10n\{=\}10that a larger sample would resolve\. This is the L4 attribution loop compressed to a single run: a typed runtime signal at iterationiibecomes a focused L4 edit at iterationi\+1i\{\+\}1, no skill patch required\.

Table 3:Headline comparison on the trading intent\. Same Claude model and tool access throughout\. A, B, and C use a 5\-iteration outer feedback loop \(T0/T1/T2 acceptance, failure\-log fed to the next iter; max 5 iters/run\)\. The two DDS rows isolate that outer loop:*one\-shot L4*uses the internal T0 acceptance loop only;*\+ 5\-iter loop*adds the same outer loop\. Per\-run detail in Appendices[E](https://arxiv.org/html/2605.20690#A5)\(A\),[B](https://arxiv.org/html/2605.20690#A2)\(DDS one\-shot\), and[C](https://arxiv.org/html/2605.20690#A3)\(DDS \+ 5\-iter\)\.Inside the skill contract: operational and anti\-pattern content does the heavy lifting\.The skill ablation holds the framework and intent fixed and varies the content of the agent skills across three settings,n=3n\{=\}3runs per variant \(Table[4](https://arxiv.org/html/2605.20690#S4.T4)\)\. The full variant \(all four blocks\) reaches T1 2/3 at median $1\.36\. Ops\-stripped \(dropoperationalandanti\_patterns, keepcapabilitiesandcompositions\) drops to 0/3 at $1\.11\. Minimal \(keep onlycapabilities\) is also 0/3 at $0\.65\. Code\-quality probes localize the damage: dropping the operational block removes the Kafka recommended\-image field \(full: 3/3 use the recommended image; ops\-stripped: 0/3\) and the PostgreSQL host\-port\-conflict remap \(full: 3/3 remap; ops\-stripped: 0/3\)\. These are exactly the fields whose patches close the first\-deployment failures in §[4\.3](https://arxiv.org/html/2605.20690#S4.SS3)\. Composition content helps a little; capabilities content is largely redundant with pretraining\. Cost rises with skill content because a more complete skill drives a more complete artifact: the full variant emits the entire stack and pays for it, while stripped variants give up partway and produce nothing that boots\. Sample size here is small \(n=3n\{=\}3\); a larger replication is left to future work\.

Table 4:Skill\-content ablation: framework, intent, and tools held constant; only the contents of the agent skills vary\. Single\-shot DDS \(no outer iteration loop\),n=3n\{=\}3runs per variant\. Operational and anti\-pattern content does the heavy lifting; capabilities content is largely redundant with pretraining\.Structure Carries the Win, Not IterationThe same skill content wins inside DDS and loses inside an unbounded prompt: the typed channel, not the knowledge, is what turns iteration into a layer\-bounded edit\.

### 4\.3Closing the L4 loop

Attribution closes the loop end\-to\-end on the fault classes that arise in practice\.The fault\-injection harness exercises the full L4 pipeline \(signal collection from compose stderr and container logs, rule\-based classification, layer routing\) on five fault classes with four instances each,n=20n\{=\}20total \(Table[5](https://arxiv.org/html/2605.20690#S4.T5)\)\. The classes are exactly those that arose in the learning\-loop experiment below: image\-pull failure, host\-port conflict, library\-missing, and DDL\-type constraint match F1–F4 in Table[6](https://arxiv.org/html/2605.20690#S4.T6), and consumer\-lag is the canonical L2\-versus\-L3 ambiguous case from Table[1](https://arxiv.org/html/2605.20690#S3.T1), where the harness accepts either label\. Each instance injects a known\-cause fault into a known\-good artifact, deploys, and compares the predicted layer to ground truth\. The pipeline matches ground truth on 20/20, mirroring the outcome on the uncontrolled F1–F4 failures: every signal lands at the layer where its skill patch is applied\. The hand\-authored rule set covers the classes encountered so far; replacing it with a classifier learned from the attribution log itself is the natural next step \(Appendix[H](https://arxiv.org/html/2605.20690#A8)\)\.

Table 5:Attribution accuracy on five controlled fault classes\. Accuracy is scoped to the enumerated classes; consumer\-lag is labeled ambiguous in the harness and accepts L2 or L3 answers\.Real failures become skill patches, not prompts to retry\.The learning\-loop experiment ran DDS twice on the trading intent\. The first deployment surfaced four distinct runtime failures \(Table[6](https://arxiv.org/html/2605.20690#S4.T6)\), each attributed and patched accordingly\. Three failures labeled L3 skill\-gaps: a missing recommended image for Kafka; missing Python extras for the producer; and a missing anti\-pattern forTTLonDateTime64in ClickHouse\. One labeled profile\-level \(host\-port\)\. Remediation consisted of skill\-only patches to three skill files plus one profile entry; no code, no prompt, and no architecture changes\. The second deployment fixed 4/4 with 0 regressions \(Table[6](https://arxiv.org/html/2605.20690#S4.T6)\), and the patched fields are cited inline in the generated artifacts\. One new failure of the same class as F2 surfaced during the second deployment in a different host environment and was patched the same way, showing that the loop absorbs fresh signals without code change\.

Table 6:First\-deployment failures on the trading backend and their layer attribution\. All four observed at L4; three attribute to L3 skill gaps\. The second deployment fixed 4/4 with skill patches\.Runtime Failures Land as PatchesFour real first\-deployment failures \(F1–F4\) attribute to L3 or profile; skill\-only edits fix 4/4 on the second deployment with no code or prompt change, and the patched fields appear as inline citations in the next deployment’s artifacts\.

## 5Discussion and Open Problems

Roadmap: an L5 evaluator peer layer\.The most important next step is anevaluatorpeer layer \(L5\) that would ask whether a deployed stack actually meets the declared intent\. The L4 loop today carries only crash\-time signals; L5 would surface SLO\-time and correctness\-time signals through the same layer\-attributed routing, and would make T3 \(declared SLOs holding under load\) the first acceptance tier the framework makes routine\. The same framework/agent split would apply: the framework would own probe dimensions, access, and attribution; the sub\-agent would synthesize intent\-specific semantic probes \(OHLCV bar continuity in trading, message\-ordering preservation in chat\) that no fixed harness ships\. L5 is the natural answer to validation without a ground\-truth oracle: it would convert “is the deployed stack correct?” from an unanswerable question into a structured search over typed probes, with attribution back to the layer that owns the violated decision\. Detail, the broader case for verifier\-driven systems work, and other roadmap axes \(operator\-algebra inference at L2, a cross\-product cost objective, semi\-automatic skill extraction, security/governance\) are in Appendix[G](https://arxiv.org/html/2605.20690#A7)\.Open challenges this prototype does not yet solve\.The quantitative evaluation is one domain \(trading\), one model \(Claude Opus 4\.6\), and one host environment; the chat case study is descriptive rather than quantitative\. Cross\-domain and cross\-model validation \(e\.g\., open\-weight models, other proprietary frontier models\) is future work\. Attribution rules are hand\-written and measured over five enumerated fault classes \(§[4\.3](https://arxiv.org/html/2605.20690#S4.SS3)\); the false\-positive rate on out\-of\-class signals and the path to learned attribution are open\. The prototype runs on local Docker Compose and defers a catalog of production\-grade concerns; each maps to a specific DDS layer when taken on, and the mapping is in Appendix[G](https://arxiv.org/html/2605.20690#A7)\. The proof\-of\-life run sustains a single\-exchange public feed \(mean 15\.9 msg/s\); the declared 100 events/sec sustained target in Fig\.[6](https://arxiv.org/html/2605.20690#A6.F6)is deferred to L5 evaluation\. Open problems span the four layers: attribution confidence at L4, composition\-rule inference and SLO\-algebra extension at L2, intent drift and elicitation protocol at L1 \(the latter targeting the L0 elicitation sub\-step\), and skill versioning at L3; per\-layer treatment is in Appendix[H](https://arxiv.org/html/2605.20690#A8)\.

## 6Related Work

Adjacent directions on the data\-systems side are in §[2](https://arxiv.org/html/2605.20690#S2); extended treatment of adjacent work, including the agentic\-discovery sibling line, is in Appendix[I](https://arxiv.org/html/2605.20690#A9)\.

## References

- \[1\]L\. A\. Agrawal, S\. Tan, D\. Soylu, N\. Ziems, R\. Khare, K\. Opsahl\-Ong, A\. Singhvi, H\. Shandilya, M\. J\. Ryan, M\. Jiang, C\. Potts, K\. Sen, A\. G\. Dimakis, I\. Stoica, D\. Klein, M\. Zaharia, and O\. Khattab\(2026\)GEPA: reflective prompt evolution can outperform reinforcement learning\.External Links:2507\.19457,[Link](https://arxiv.org/abs/2507.19457)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.20690#S1.p1.1)\.
- \[2\]Airbyte\.Note:[https://airbyte\.com/](https://airbyte.com/)Accessed April 2026\.Cited by:[§2](https://arxiv.org/html/2605.20690#S2.SS0.SSS0.Px2.p1.1)\.
- \[3\]Anthropic\(2025\)Claude Code: an agentic coding tool\.Note:[https://www\.anthropic\.com/claude\-code](https://www.anthropic.com/claude-code)Accessed April 2026Cited by:[§1](https://arxiv.org/html/2605.20690#S1.p1.1)\.
- \[4\]Anthropic\(2025\)Equipping agents for the real world with agent skills\.Note:[https://www\.anthropic\.com/engineering/equipping\-agents\-for\-the\-real\-world\-with\-agent\-skills](https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills)Accessed April 2026Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px9.p1.1)\.
- \[5\]Apache Iceberg\.Note:[https://iceberg\.apache\.org/](https://iceberg.apache.org/)Accessed April 2026\.Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px3.p1.1)\.
- \[6\]M\. Armbrust, A\. Ghodsi, R\. Xin, and M\. Zaharia\(2021\)Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics\.InConference on Innovative Data Systems Research,External Links:[Link](https://vldb.org/cidrdb/papers/2021/cidr2021_paper17.pdf)Cited by:[§2](https://arxiv.org/html/2605.20690#S2.SS0.SSS0.Px2.p1.1)\.
- \[7\]M\. Armbrust, R\. S\. Xin, C\. Lian, Y\. Huai, D\. Liu, J\. K\. Bradley, X\. Meng, T\. Kaftan, M\. J\. Franklin, A\. Ghodsi, and M\. Zaharia\(2015\)Spark SQL: relational data processing in Spark\.InProceedings of the 2015 ACM SIGMOD International Conference on Management of Data,SIGMOD ’15,New York, NY, USA,pp\. 1383–1394\.External Links:ISBN 9781450327589,[Link](https://doi.org/10.1145/2723372.2742797),[Document](https://dx.doi.org/10.1145/2723372.2742797)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px5.p1.1)\.
- \[8\]M\. Cemri, S\. Agrawal, A\. Gupta, S\. Liu, A\. Cheng, Q\. Mang, A\. Naren, L\. E\. Erdogan, K\. Sen, M\. Zaharia, A\. Dimakis, and I\. Stoica\(2026\)AdaEvolve: adaptive LLM driven zeroth\-order optimization\.External Links:2602\.20133,[Link](https://arxiv.org/abs/2602.20133)Cited by:[§1](https://arxiv.org/html/2605.20690#S1.p1.1)\.
- \[9\]M\. Cemri, M\. Z\. Pan, S\. Yang, L\. A\. Agrawal, B\. Chopra, R\. Tiwari, K\. Keutzer, A\. Parameswaran, D\. Klein, K\. Ramchandran, M\. Zaharia, J\. E\. Gonzalez, and I\. Stoica\(2025\)Why do multi\-agent LLM systems fail?\.External Links:2503\.13657,[Link](https://arxiv.org/abs/2503.13657)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.20690#S2.SS0.SSS0.Px3.p1.1)\.
- \[10\]D\. D\. Chamberlin and R\. F\. Boyce\(1974\)SEQUEL: a structured english query language\.InProceedings of the 1974 ACM SIGFIDET \(Now SIGMOD\) Workshop on Data Description, Access and Control,SIGFIDET ’74,New York, NY, USA,pp\. 249–264\.External Links:ISBN 9781450374156,[Link](https://doi.org/10.1145/800296.811515),[Document](https://dx.doi.org/10.1145/800296.811515)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px7.p1.1)\.
- \[11\]J\. S\. Chan, N\. Chowdhury, O\. Jaffe, J\. Aung, D\. Sherburn, E\. Mays, G\. Starace, K\. Liu, L\. Maksin, T\. Patwardhan, L\. Weng, and A\. Mądry\(2025\)MLE\-bench: evaluating machine learning agents on machine learning engineering\.External Links:2410\.07095,[Link](https://arxiv.org/abs/2410.07095)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px4.p1.1)\.
- \[12\]A\. Cheng, S\. Liu, M\. Pan, Z\. Li, S\. Agarwal, M\. Cemri, B\. Wang, A\. Krentsel, T\. Xia, J\. Park, S\. Yang, J\. Chen, L\. Agrawal, A\. Naren, S\. Li, R\. Ma, A\. Desai, J\. Xing, K\. Sen, M\. Zaharia, and I\. Stoica\(2025\)Let the barbarians in: how AI can accelerate systems performance research\.External Links:2512\.14806,[Link](https://arxiv.org/abs/2512.14806)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px8.p1.1)\.
- \[13\]A\. Cheng, S\. Liu, M\. Pan, Z\. Li, B\. Wang, A\. Krentsel, T\. Xia, M\. Cemri, J\. Park, S\. Yang, J\. Chen, L\. Agrawal, A\. Desai, J\. Xing, K\. Sen, M\. Zaharia, and I\. Stoica\(2025\)Barbarians at the gate: how AI is upending systems research\.External Links:2510\.06189,[Link](https://arxiv.org/abs/2510.06189)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px8.p1.1)\.
- \[14\]E\. F\. Codd\(1970\-06\)A relational model of data for large shared data banks\.Commun\. ACM13\(6\),pp\. 377–387\.External Links:ISSN 0001\-0782,[Link](https://doi.org/10.1145/362384.362685),[Document](https://dx.doi.org/10.1145/362384.362685)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px7.p1.1)\.
- \[15\]J\. C\. Corbett, J\. Dean, M\. Epstein, A\. Fikes, C\. Frost, J\. J\. Furman, S\. Ghemawat, A\. Gubarev, C\. Heiser, P\. Hochschild, W\. Hsieh, S\. Kanthak, E\. Kogan, H\. Li, A\. Lloyd, S\. Melnik, D\. Mwaura, D\. Nagle, S\. Quinlan, R\. Rao, L\. Rolig, Y\. Saito, M\. Szymaniak, C\. Taylor, R\. Wang, and D\. Woodford\(2013\-08\)Spanner: Google’s globally distributed database\.ACM Trans\. Comput\. Syst\.31\(3\)\.External Links:ISSN 0734\-2071,[Link](https://doi.org/10.1145/2491245),[Document](https://dx.doi.org/10.1145/2491245)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px5.p1.1)\.
- \[16\]B\. Dageville, T\. Cruanes, M\. Zukowski, V\. Antonov, A\. Avanes, J\. Bock, J\. Claybaugh, D\. Engovatov, M\. Hentschel, J\. Huang, A\. W\. Lee, A\. Motivala, A\. Q\. Munir, S\. Pelley, P\. Povinec, G\. Rahn, S\. Triantafyllis, and P\. Unterbrunner\(2016\)The Snowflake elastic data warehouse\.InProceedings of the 2016 International Conference on Management of Data,SIGMOD ’16,New York, NY, USA,pp\. 215–226\.External Links:ISBN 9781450335317,[Link](https://doi.org/10.1145/2882903.2903741),[Document](https://dx.doi.org/10.1145/2882903.2903741)Cited by:[§2](https://arxiv.org/html/2605.20690#S2.SS0.SSS0.Px2.p1.1)\.
- \[17\]DB\-Engines ranking\.Note:[https://db\-engines\.com/en/ranking](https://db-engines.com/en/ranking)Accessed April 2026\.Cited by:[§2](https://arxiv.org/html/2605.20690#S2.SS0.SSS0.Px2.p1.1)\.
- \[18\]dbt \(data build tool\)\.Note:[https://www\.getdbt\.com/](https://www.getdbt.com/)Accessed April 2026\.Cited by:[§2](https://arxiv.org/html/2605.20690#S2.SS0.SSS0.Px2.p1.1)\.
- \[19\]dbt Mesh\.Note:[https://www\.getdbt\.com/product/dbt\-mesh](https://www.getdbt.com/product/dbt-mesh)Accessed April 2026\.Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px3.p1.1)\.
- \[20\]J\. Duggan, A\. J\. Elmore, M\. Stonebraker, M\. Balazinska, B\. Howe, J\. Kepner, S\. Madden, D\. Maier, T\. Mattson, and S\. Zdonik\(2015\-08\)The BigDAWG polystore system\.SIGMOD Rec\.44\(2\),pp\. 11–16\.External Links:ISSN 0163\-5808,[Link](https://doi.org/10.1145/2814710.2814713),[Document](https://dx.doi.org/10.1145/2814710.2814713)Cited by:[§2](https://arxiv.org/html/2605.20690#S2.SS0.SSS0.Px2.p1.1)\.
- \[21\]B\. El, M\. Yuksekgonul, and J\. Zou\(2025\)Inefficiencies of meta agents for agent design\.External Links:2510\.06711,[Link](https://arxiv.org/abs/2510.06711)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px8.p1.1)\.
- \[22\]M\. H\. Erol, B\. El, M\. Suzgun, M\. Yuksekgonul, and J\. Zou\(2026\)Cost\-of\-Pass: an economic framework for evaluating language models\.External Links:2504\.13359,[Link](https://arxiv.org/abs/2504.13359)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px8.p1.1)\.
- \[23\]F\. Färber, S\. K\. Cha, J\. Primsch, C\. Bornhövd, S\. Sigg, and W\. Lehner\(2012\-01\)SAP HANA database: data management for modern business applications\.SIGMOD Rec\.40\(4\),pp\. 45–51\.External Links:ISSN 0163\-5808,[Link](https://doi.org/10.1145/2094114.2094126),[Document](https://dx.doi.org/10.1145/2094114.2094126)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px5.p1.1)\.
- \[24\]Fivetran\.Note:[https://www\.fivetran\.com/](https://www.fivetran.com/)Accessed April 2026\.Cited by:[§2](https://arxiv.org/html/2605.20690#S2.SS0.SSS0.Px2.p1.1)\.
- \[25\]P\. Hamadanian, P\. Karimi, A\. Nasr\-Esfahany, K\. Noorbakhsh, J\. Chandler, A\. ParandehGheibi, M\. Alizadeh, and H\. Balakrishnan\(2026\)Glia: a human\-inspired AI for automated systems design and optimization\.External Links:2510\.27176,[Link](https://arxiv.org/abs/2510.27176)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.20690#S1.p1.1)\.
- \[26\]A\. Kemper and T\. Neumann\(2011\)HyPer: a hybrid OLTP&OLAP main memory database system based on virtual memory snapshots\.In2011 IEEE 27th International Conference on Data Engineering,Vol\.,pp\. 195–206\.External Links:[Document](https://dx.doi.org/10.1109/ICDE.2011.5767867)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px5.p1.1)\.
- \[27\]O\. Khattab, A\. Singhvi, P\. Maheshwari, Z\. Zhang, K\. Santhanam, S\. Vardhamanan, S\. Haq, A\. Sharma, T\. T\. Joshi, H\. Moazam, H\. Miller, M\. Zaharia, and C\. Potts\(2023\)DSPy: compiling declarative language model calls into self\-improving pipelines\.External Links:2310\.03714,[Link](https://arxiv.org/abs/2310.03714)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px1.p1.1)\.
- \[28\]M\. Kleppmann\(2017\)Designing data\-intensive applications\.O’Reilly Media\.External Links:ISBN 978\-1449373320Cited by:[§2](https://arxiv.org/html/2605.20690#S2.SS0.SSS0.Px3.p1.1)\.
- \[29\]P\. T\. J\. Kon, J\. Liu, Y\. Qiu, W\. Fan, T\. He, L\. Lin, H\. Zhang, O\. M\. Park, G\. S\. Elengikal, Y\. Kang, A\. Chen, M\. Chowdhury, M\. Lee, and X\. Wang\(2024\)IaC\-eval: a code generation benchmark for cloud infrastructure\-as\-code programs\.InAdvances in Neural Information Processing Systems,A\. Globerson, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. Tomczak, and C\. Zhang \(Eds\.\),Vol\.37,pp\. 134488–134506\.External Links:[Document](https://dx.doi.org/10.52202/079017-4273),[Link](https://proceedings.neurips.cc/paper_files/paper/2024/file/f26b29298ae8acd94bd7e839688e329b-Paper-Datasets_and_Benchmarks_Track.pdf)Cited by:[§2](https://arxiv.org/html/2605.20690#S2.SS0.SSS0.Px2.p1.1)\.
- \[30\]Y\. Lai, C\. Li, Y\. Wang, T\. Zhang, R\. Zhong, L\. Zettlemoyer, W\. Yih, D\. Fried, S\. Wang, and T\. Yu\(2023\)DS\-1000: a natural and reliable benchmark for data science code generation\.InProceedings of the 40th International Conference on Machine Learning,ICML’23\.Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px4.p1.1)\.
- \[31\]R\. T\. Lange, Y\. Imajuku, and E\. Cetin\(2025\)ShinkaEvolve: towards open\-ended and sample\-efficient program evolution\.External Links:2509\.19349,[Link](https://arxiv.org/abs/2509.19349)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px1.p1.1)\.
- \[32\]Y\. Lee, R\. Nair, Q\. Zhang, K\. Lee, O\. Khattab, and C\. Finn\(2026\)Meta\-Harness: end\-to\-end optimization of model harnesses\.External Links:2603\.28052,[Link](https://arxiv.org/abs/2603.28052)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px1.p1.1)\.
- \[33\]J\. Li, B\. Hui, G\. Qu, J\. Yang, B\. Li, B\. Li, B\. Wang, B\. Qin, R\. Geng, N\. Huo, X\. Zhou, C\. Ma, G\. Li, K\. C\.C\. Chang, F\. Huang, R\. Cheng, and Y\. Li\(2023\)Can LLM already serve as a database interface? a big bench for large\-scale database grounded text\-to\-SQLs\.InProceedings of the 37th International Conference on Neural Information Processing Systems,NIPS ’23,Red Hook, NY, USA\.Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px7.p1.1)\.
- \[34\]C\. Liu, M\. Russo, M\. Cafarella, L\. Cao, P\. B\. Chen, Z\. Chen, M\. Franklin, T\. Kraska, S\. Madden, and G\. Vitagliano\(2024\)A declarative system for optimizing ai workloads\.External Links:2405\.14696,[Link](https://arxiv.org/abs/2405.14696)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px2.p1.1)\.
- \[35\]S\. Liu, S\. Agarwal, M\. Maheswaran, M\. Cemri, Z\. Li, Q\. Mang, A\. Naren, E\. Boneh, A\. Cheng, M\. Z\. Pan, A\. Du, K\. Keutzer, A\. Cheung, A\. G\. Dimakis, K\. Sen, M\. Zaharia, and I\. Stoica\(2026\)EvoX: meta\-evolution for automated discovery\.External Links:2602\.23413,[Link](https://arxiv.org/abs/2602.23413)Cited by:[§1](https://arxiv.org/html/2605.20690#S1.p1.1)\.
- \[36\]S\. Liu, M\. Cemri, S\. Agarwal, A\. Krentsel, A\. Naren, Q\. Mang, Z\. Li, A\. Gupta, M\. Maheswaran, A\. Cheng, M\. Pan, E\. Boneh, K\. Ramchandran, K\. Sen, A\. G\. Dimakis, M\. Zaharia, and I\. Stoica\(2026\)SkyDiscover: a flexible framework for AI\-driven scientific and algorithmic discovery\.External Links:[Link](https://skydiscover-ai.github.io/blog.html)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px1.p1.1)\.
- \[37\]S\. Liu, S\. Ponnapalli, S\. Shankar, S\. Zeighami, A\. Zhu, S\. Agarwal, R\. Chen, S\. Suwito, S\. Yuan, I\. Stoica, M\. Zaharia, A\. Cheung, N\. Crooks, J\. E\. Gonzalez, and A\. G\. Parameswaran\(2025\)Supporting our ai overlords: redesigning data systems to be agent\-first\.External Links:2509\.00997,[Link](https://arxiv.org/abs/2509.00997)Cited by:[§2](https://arxiv.org/html/2605.20690#S2.SS0.SSS0.Px2.p1.1)\.
- \[38\]A\. Novikov, N\. Vũ, M\. Eisenberger, E\. Dupont, P\. Huang, A\. Z\. Wagner, S\. Shirobokov, B\. Kozlovskii, F\. J\. R\. Ruiz, A\. Mehrabian, M\. P\. Kumar, A\. See, S\. Chaudhuri, G\. Holland, A\. Davies, S\. Nowozin, P\. Kohli, and M\. Balog\(2025\)AlphaEvolve: a coding agent for scientific and algorithmic discovery\.External Links:2506\.13131,[Link](https://arxiv.org/abs/2506.13131)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.20690#S1.p1.1)\.
- \[39\]M\. Z\. Pan, N\. Arabzadeh, R\. Cogo, Y\. Zhu, A\. Xiong, L\. A\. Agrawal, H\. Mao, E\. Shen, S\. Pallerla, L\. Patel, S\. Liu, T\. Shi, X\. Liu, J\. Q\. Davis, E\. Lacavalla, A\. Basile, S\. Yang, P\. Castro, D\. Kang, J\. E\. Gonzalez, K\. Sen, D\. Song, I\. Stoica, M\. Zaharia, and M\. Ellis\(2026\)Measuring agents in production\.External Links:2512\.04123,[Link](https://arxiv.org/abs/2512.04123)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px1.p1.1)\.
- \[40\]A\. Pappu, B\. El, H\. Cao, C\. di Nolfo, Y\. Sun, M\. Cao, and J\. Zou\(2026\)Multi\-agent teams hold experts back\.External Links:2602\.01011,[Link](https://arxiv.org/abs/2602.01011)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.20690#S2.SS0.SSS0.Px3.p1.1)\.
- \[41\]L\. Patel, S\. Jha, M\. Pan, H\. Gupta, P\. Asawa, C\. Guestrin, and M\. Zaharia\(2025\-07\)Semantic operators and their optimization: enabling llm\-based data processing with accuracy guarantees in lotus\.Proc\. VLDB Endow\.18\(11\),pp\. 4171–4184\.External Links:ISSN 2150\-8097,[Link](https://doi.org/10.14778/3749646.3749685),[Document](https://dx.doi.org/10.14778/3749646.3749685)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px2.p1.1)\.
- \[42\]A\. Pavlo, G\. Angulo, J\. Arulraj, H\. Lin, J\. Lin, L\. Ma, P\. Menon, T\. C\. Mowry, M\. Perron, I\. Quah, S\. Santurkar, A\. Tomasic, S\. Toor, D\. V\. Aken, Z\. Wang, Y\. Wu, R\. Xian, and T\. Zhang\(2017\)Self\-driving database management systems\.InConference on Innovative Data Systems Research,External Links:[Link](https://api.semanticscholar.org/CorpusID:265531108)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px6.p1.1)\.
- \[43\]M\. Pourreza and D\. Rafiei\(2023\)DIN\-SQL: decomposed in\-context learning of text\-to\-SQL with self\-correction\.InProceedings of the 37th International Conference on Neural Information Processing Systems,NIPS ’23,Red Hook, NY, USA\.Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px7.p1.1)\.
- \[44\]Pulumi AI\.Note:[https://www\.pulumi\.com/ai/](https://www.pulumi.com/ai/)Accessed April 2026\.Cited by:[§2](https://arxiv.org/html/2605.20690#S2.SS0.SSS0.Px2.p1.1)\.
- \[45\]R\. Sethi, M\. Traverso, D\. Sundstrom, D\. Phillips, W\. Xie, Y\. Sun, N\. Yegitbasi, H\. Jin, E\. Hwang, N\. Shingte, and C\. Berner\(2019\)Presto: SQL on everything\.In2019 IEEE 35th International Conference on Data Engineering \(ICDE\),Vol\.,pp\. 1802–1813\.External Links:[Document](https://dx.doi.org/10.1109/ICDE.2019.00196)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px5.p1.1)\.
- \[46\]S\. Shankar, T\. Chambers, T\. Shah, A\. G\. Parameswaran, and E\. Wu\(2025\-05\)DocETL: agentic query rewriting and evaluation for complex document processing\.Proc\. VLDB Endow\.18\(9\),pp\. 3035–3048\.External Links:ISSN 2150\-8097,[Link](https://doi.org/10.14778/3746405.3746426),[Document](https://dx.doi.org/10.14778/3746405.3746426)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px2.p1.1)\.
- \[47\]OpenEvolve: an open\-source evolutionary coding agentExternal Links:[Link](https://github.com/algorithmicsuperintelligence/openevolve)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px1.p1.1)\.
- \[48\]J\. Shute, R\. Vingralek, B\. Samwel, B\. Handy, C\. Whipkey, E\. Rollins, M\. Oancea, K\. Littlefield, D\. Menestrina, S\. Ellner, J\. Cieslewicz, I\. Rae, T\. Stancescu, and H\. Apte\(2013\-08\)F1: a distributed SQL database that scales\.Proc\. VLDB Endow\.6\(11\),pp\. 1068–1079\.External Links:ISSN 2150\-8097,[Link](https://doi.org/10.14778/2536222.2536232),[Document](https://dx.doi.org/10.14778/2536222.2536232)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px5.p1.1)\.
- \[49\]M\. Stonebraker and U\. Çetintemel\(2018\)"One size fits all": an idea whose time has come and gone\.InMaking Databases Work: The Pragmatic Wisdom of Michael Stonebraker,pp\. 441–462\.External Links:ISBN 9781947487192,[Link](https://doi.org/10.1145/3226595.3226636)Cited by:[§2](https://arxiv.org/html/2605.20690#S2.SS0.SSS0.Px3.p1.1)\.
- \[50\]Terraform MCP server\.Note:[https://developer\.hashicorp\.com/terraform/docs/tools/mcp\-server](https://developer.hashicorp.com/terraform/docs/tools/mcp-server)Accessed April 2026\.Cited by:[§2](https://arxiv.org/html/2605.20690#S2.SS0.SSS0.Px2.p1.1)\.
- \[51\]D\. Van Aken, A\. Pavlo, G\. J\. Gordon, and B\. Zhang\(2017\)Automatic database management system tuning through large\-scale machine learning\.InProceedings of the 2017 ACM International Conference on Management of Data,SIGMOD ’17,New York, NY, USA,pp\. 1009–1024\.External Links:ISBN 9781450341974,[Link](https://doi.org/10.1145/3035918.3064029),[Document](https://dx.doi.org/10.1145/3035918.3064029)Cited by:[§2](https://arxiv.org/html/2605.20690#S2.SS0.SSS0.Px2.p1.1)\.
- \[52\]G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar\(2023\)Voyager: an open\-ended embodied agent with large language models\.External Links:2305\.16291,[Link](https://arxiv.org/abs/2305.16291)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px9.p1.1)\.
- \[53\]J\. Yang, C\. E\. Jimenez, A\. Wettig, K\. Lieret, S\. Yao, K\. Narasimhan, and O\. Press\(2024\)SWE\-agent: agent\-computer interfaces enable automated software engineering\.InProceedings of the 38th International Conference on Neural Information Processing Systems,NIPS ’24,Red Hook, NY, USA\.External Links:ISBN 9798331314385Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px4.p1.1)\.
- \[54\]T\. Yu, R\. Zhang, K\. Yang, M\. Yasunaga, D\. Wang, Z\. Li, J\. Ma, I\. Li, Q\. Yao, S\. Roman, Z\. Zhang, and D\. Radev\(2019\)Spider: a large\-scale human\-labeled dataset for complex and cross\-domain semantic parsing and text\-to\-SQL task\.External Links:1809\.08887,[Link](https://arxiv.org/abs/1809.08887)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px7.p1.1)\.
- \[55\]J\. Zhang, Y\. Liu, K\. Zhou, G\. Li, Z\. Xiao, B\. Cheng, J\. Xing, Y\. Wang, T\. Cheng, L\. Liu, M\. Ran, and Z\. Li\(2019\)An end\-to\-end automatic cloud database tuning system using deep reinforcement learning\.InProceedings of the 2019 International Conference on Management of Data,SIGMOD ’19,New York, NY, USA,pp\. 415–432\.External Links:ISBN 9781450356435,[Link](https://doi.org/10.1145/3299869.3300085),[Document](https://dx.doi.org/10.1145/3299869.3300085)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px6.p1.1)\.
- \[56\]Q\. Zhang, C\. Hu, S\. Upasani, B\. Ma, F\. Hong, V\. Kamanuru, J\. Rainton, C\. Wu, M\. Ji, H\. Li, U\. Thakker, J\. Zou, and K\. Olukotun\(2026\)Agentic context engineering: evolving contexts for self\-improving language models\.External Links:2510\.04618,[Link](https://arxiv.org/abs/2510.04618)Cited by:[Appendix I](https://arxiv.org/html/2605.20690#A9.SS0.SSS0.Px1.p1.1)\.

## Appendix AExample Agent Skill: ClickHouse

Figure[5](https://arxiv.org/html/2605.20690#A1.F5.fig1)shows a trimmed excerpt ofclickhouse\.yamlskill with one representative entry per block\. The dated comments are real attribution\-log entries: each was added after a specific failure during the learning\-loop experiment \(§[4\.3](https://arxiv.org/html/2605.20690#S4.SS3)\), which is the traceability property cited in §[3](https://arxiv.org/html/2605.20690#S3)\.

Figure 5:Trimmedskills/clickhouse\.yamlshowing one entry per block \(capabilities, operational, anti\_patterns, compositions\)\. Dated comments are real attribution\-log entries from the learning\-loop experiment\.```
skill:
  system: clickhouse
  version: "24.3"
  operator_types: [STORE, TRANSFORM]

  capabilities:
    data_models: [columnar, time_series, event]
    access_patterns: [olap, streaming]
    max_throughput: "500K inserts/sec per node"
    consistency: [eventual]

  operational:
    recommended_images:
      - "clickhouse/clickhouse-server:24.3"
    # CH exposes 9000 (native TCP) and 8123 (HTTP); 9000 is a very
    # common dev-host collision target.
    known_host_port_conflicts:
      - port: 9000
        remap_to: 19000
        reason: "CH native TCP port 9000 commonly occupied on dev hosts"

  anti_patterns:
    - scenario: "TTL expression on a DateTime64 column (CH <=24.x)"
      reason: "TTL requires DateTime or Date; DateTime64 not accepted directly"
      alternative: "Wrap: TTL toDateTime(event_time) + INTERVAL 6 MONTH"
      severity: hard_limit

    - scenario: "OLTP point updates (UPDATE/DELETE by primary key)"
      reason: "MergeTree is append-only; mutations are async background ops"
      alternative: "PostgreSQL for OLTP workloads"
      severity: hard_limit

  compositions:
    - with: kafka
      connector: kafka_engine_materialized_view
      direction: inbound
      semantics: at_least_once
      known_issues:
        - "Kafka Engine creates a virtual table; use MaterializedView to persist"
```

## Appendix BPer\-run detail for DDS one\-shot \(no outer loop\)

Table[7](https://arxiv.org/html/2605.20690#A2.T7)reports every run of then=10n\{=\}10DDS one\-shot configuration*without*the outer 5\-iter feedback loop used in Table[3](https://arxiv.org/html/2605.20690#S4.T3)\. The headline DDS row uses the outer loop \(T0/T1/T2 at 10/10, 10/10, 10/10 at $1\.94, 44 turns\); this appendix shows what DDS reaches with the contract surface alone \(T1 8/10, T2 3/10, median $1\.49, median 17\.5 turns\)\. All runs use the same intent, same model, same tool access, and the canonical skills directory\.

Table 7:Per\-run detail for then=10n\{=\}10DDS one\-shot configuration \(no outer loop\)\.The two T1 failures are diagnosable: run 2 hit a boot\-failure signal at L4 that auto\-repair did not clear within the turn budget, and run 10 hit a host\-port conflict \(the same fault class as the first\-deployment failure F2 in §[4\.3](https://arxiv.org/html/2605.20690#S4.SS3)\) on a host whose port policy was not yet captured in the profile\. The five T1\-but\-not\-T2 runs \(1, 3, 5, 7, 8\) break down as 3smoke\_query\_errorand 2 unsignalled \(runs 5 and 7, with 0 typed signals in the table\), all consistent with materialized\-view\-population timing combined with agent non\-determinism in MV column naming: the stacks are running, and the smoke probe window does not account for producer\-priming variance\. Run 5 reports T0==N with T1==Y because the T0 syntax check flagged a non\-blocking script \(theverify\.pysmoke verifier in Table[2](https://arxiv.org/html/2605.20690#S4.T2)\) rather than a service artifact; the six container artifacts were syntactically valid, so the stack still booted to steady state and passed T1 \(the smoke probe itself failed at T2, consistent with the same root cause as the other T1\-but\-not\-T2 runs\)\. The DDS\-with\-outer\-loop variant in Appendix[C](https://arxiv.org/html/2605.20690#A3)is a separaten=10n\{=\}10batch in which the same root cause closes after at most one extra iteration, taking T2 from 3/10 \(one\-shot\) to 10/10 \(with outer loop\)\.

## Appendix CPer\-run detail for DDS with 5\-iter outer loop

Table[8](https://arxiv.org/html/2605.20690#A3.T8)reports every run of then=10n\{=\}10DDS configuration that uses the same 5\-iteration outer feedback loop as Conditions A/B/C in Table[3](https://arxiv.org/html/2605.20690#S4.T3)\. The L1/L2/L3 framework pipeline is computed once per run \(deterministic for the fixed intent\); only the L4 deployment sub\-agent is re\-invoked on T0/T1/T2 failures with the original brief plus the failure log from the previous iteration\. All runs use the same intent, model, and tool access as Appendix[B](https://arxiv.org/html/2605.20690#A2); the only difference is the outer loop\.

Table 8:Per\-run detail for then=10n\{=\}10DDS \+ 5\-iter outer\-loop configuration\. “First→\\toT1” is the iteration in which T1 first passed \(smaller is better; iteration 1 = no outer feedback needed\)\.Two patterns are worth noting\. \(i\) Convergence is fast: 6 of 10 runs reached T1 within 2 iterations \(2 of 10 in iter 1 with no outer feedback needed, plus 4 of 10 in iter 2\); median is 2 iterations to T1, and all 10 close by iter 4 \(2 runs each at iter 3 and iter 4\)\. \(ii\) The outer loop’s marginal cost over one\-shot DDS is small: median turns rose from 17 \(one\-shot\) to 44 \(with outer loop\), median cost from $1\.49 to $1\.94, while T1 went 8/10→\\to10/10 and T2 went 3/10→\\to10/10\. Even the most expensive DDS\+iter run \(run 7 at 141 turns, $4\.10\) is below the median cost of every iterated baseline \(A: 268 turns/$5\.76; B: 209/$6\.23; C: 229/$5\.06\)\.

## Appendix DSecond case study: a chat platform

To stress the operator algebra beyond the trading workload, we ran DDS on a Signal\-like consumer chat platform\. The original operator set \(INGEST,STORE,TRANSFORM,SERVE,CACHE,QUEUE\) was insufficient, and the planner addedROUTE,NOTIFY, andINDEXto handle message delivery, push notifications, and full\-text search\. The physical mapping chosen by the planner: Kafka→\\toNATS→\\toScyllaDB \+ PostgreSQL \+ Elasticsearch \+ S3\. Two surprises shaped our view of what L2 should represent\. First, security \(end\-to\-end encryption\) acts as an*architectural force*that constrains which component can hold which key, rather than a checkbox attached to any single component; it is a first\-class intent dimension that cross\-cuts L1 and L3\. Second, ephemeral versus persistent data \(presence indicators, typing indicators\) is an*access\-pattern*property rather than a store property, and belongs at L2\. This run is descriptive rather than quantitative: its role is to stress algebra extensibility, which is the open\-operator property at L2, and to surface dimensions the trading workload does not exercise\.

## Appendix EPer\-run failure analysis for iterated baselines \(A, B, C\)

This appendix complements Table[3](https://arxiv.org/html/2605.20690#S4.T3)with per\-run trajectories from the iterated baselines\. All three conditions use the same 5\-iteration feedback loop modeled on real\-world Claude\-Code usage: after each codegen attempt, the harness runs T0/T1/T2 acceptance and feeds the failure log back to the agent for editing in the next iteration\. Up to 5 iterations per run; no per\-run cost or wall\-clock cap\. Conditions vary only in iter\-1 prompt content \(A: NL only; B: NL \+ system names; C: \+ skill YAMLs as prose\)\.

#### Per\-condition summary\.

Across the 10 A runs, T0 passed in every case \(eventually\) but T1 closed only twice; six of the eight T1 failures hititerations\_exhaustedafter 5 rounds with no convergence\. Median total turns across A is 268 \(∼\\sim15×\\timesDDS’s median 17\.5 turns; §[4\.2](https://arxiv.org/html/2605.20690#S4.SS2), Appendix[B](https://arxiv.org/html/2605.20690#A2)\)\. We annotate three failure trajectories that capture distinct ways iteration fails to close the gap, plus one success trajectory showing what convergence looks like under iteration alone\.

### Failure trajectory 1: port conflict resolved, then a different system gets stuck \(run 1\)

Outcome:5 iterations, 249 turns,∼\\sim28 min wall; T0 ✓, T1 ✗, T2 ✗\.

- •Iter 1:48 turns, 0 in\-scope edits \(T0 ✗\)\. Agent explored without producing complete artifacts\.
- •Iter 2:60 turns, 14 in\-scope edits \(T0 ✓, T1 ✗\)\. Stack now boots far enough to hitaddress already in useon PostgreSQL host port 5432\.
- •Iter 3:45 turns, 2 in\-scope edits\. Agent remaps the PostgreSQL port\. T1 still fails:container redpanda is unhealthy\.
- •Iter 4–5:38 \+ 58 turns, 1 \+ 1 in\-scope edits each\. The Redpanda failure persists; the agent guesses at config without resolving it\.

Root cause\.The Redpanda container becomes unhealthy because the agent’s chosen broker configuration is incomplete or wrong, but the failure feedback the harness pipes back is onlydependency failed to start: container redpanda is unhealthy\. Without seeing the broker’s own log, the agent cannot diagnose which flag to fix; it edits adjacent configuration files \(1–2 edits per iteration\) hoping to influence the symptom\. Five iterations and 249 turns yield no convergence\.

How DDS prevents this\.The Kafka skill \(Appendix[A](https://arxiv.org/html/2605.20690#A1)\) listsrecommended\_images: \[apache/kafka:3\.7\.0, \.\.\.\]and skips Redpanda\. The L3 planner therefore selects apache/kafka, the L4 brief requires citing the skill’scompositions\[with: clickhouse\]connector pattern in the generated artifact, andoperational\.known\_host\_port\_conflictspre\-empts the 5432 conflict\. DDS reaches T1 in a single deployment without iteration\.

### Failure trajectory 2: iteration regresses a working stack \(run 5\)

Outcome:5 iterations, 273 turns,∼\\sim47 min wall; T0 ✓, T1 ✗, T2 ✗\.

- •Iter 1:36 turns, 20 in\-scope edits\.*T0 ✓, T1 ✓*, T2 ✗\. The agent’s first\-shot TimescaleDB\-based stack actually*boots cleanly*; only the smoke query fails, plausibly a one\-line fix\.
- •Iter 2:78 turns, 5 in\-scope edits\. The agent restructures the topology \(adding Kafka and switching from TimescaleDB to a more elaborate stack\) to “fix” T2\. The new compose hits a container\-name conflict because the old container from iter 1 was still present\. T1 ✗\.
- •Iter 3–5:27 \+ 78 \+ 54 turns\. Container\-name conflict resolved; nowkafka is unhealthyacross all three iterations\. The agent never recovers the working topology it had in iter 1\.

Root cause\.Iteration here is actively harmful: the agent’s response to a T2 failure was an*architectural rewrite*, not a focused fix\. With no contract pinning the L2 topology, the agent re\-derives the system selection at every iteration boundary, which can pivot the entire stack on a single feedback signal\.

How DDS prevents this\.The L1 intent and L2 typed DAG commit to a topology before any codegen\. The L4 deployment brief is structured: it names which artifacts to produce, which skill fields to cite, and which acceptance checks must pass\. The agent edits within these contracts; an architectural pivot is structurally impossible without a pattern\-layer \(L2\) revision, which is owned by the user, not the L4 sub\-agent\. T2 fixes stay at the L4 codegen level \(e\.g\., a smoke\-query fix\), not architectural\.

### Failure trajectory 3: same error, five iterations \(run 7\)

Outcome:5 iterations, 326 turns,∼\\sim35 min wall; T0 ✓, T1 ✗, T2 ✗\.

- •Iter 1:64 turns, 0 in\-scope edits \(T0 ✗\)\. Initial output incomplete\.
- •Iter 2:131 turns, 20 in\-scope edits\. T0 ✓, T1 ✗ withredpanda is unhealthy\.
- •Iter 3–5:34 \+ 44 \+ 53 turns; 1 \+ 1 \+ 1 in\-scope edits per iteration\. The sameredpanda is unhealthyerror recurs every iteration\. The agent identifies the symptom but cannot find the root cause from the failure feedback alone\.

Root cause\.The symptom is observable \(Redpanda unhealthy\) but the cause \(a specific incompatible broker flag, or a missing network exposure\) is invisible to the agent unless it reads the container logs directly\. The harness’s failure feedback strings the docker compose stderr, which only reports the symptom\. Iteration in the absence of root\-cause attribution becomes a guessing loop\.

How DDS prevents this\.The L4 attribution loop*types*the runtime signal\. A persistenthealthcheck\_unhealthyon a known system is classified asL3\_skill\(composition gap\) rather thanL4\(codegen slip\), and the framework routes it to a skill\-field edit instead of a codegen retry\. Either the skill already covers the flag \(the loop closes\) or the skill is patched once \(and every future deployment of that system inherits the fix; §[4\.3](https://arxiv.org/html/2605.20690#S4.SS3)\)\.

### Success trajectory: convergence in 2 iterations \(run 10\)

Outcome:2 iterations, 77 turns,∼\\sim16 min wall; T0 ✓, T1 ✓, T2 ✓;stop\_reason=passed\.

- •Iter 1:27 turns, 17 in\-scope edits\. T0 ✓, T1 ✓\(the stack actually boots\), T2 ✗ withsmoke\_query\_error: the verifier query references a column the agent named slightly differently in the DDL\.
- •Iter 2:50 turns, 6 in\-scope edits\. T0 ✓, T1 ✓, T2 ✓\. Agent reconciles column names across the producer, DDL, and verifier\.

Why it works\.This is the canonical pattern where iteration helps: iter 1 produced a bootable stack, the failure was a localized name mismatch, and iter 2 made the focused fix without restructuring anything\. No architectural pivot, no symptom\-vs\-root\-cause gap\.

Comparison with DDS\.The same workload took DDS a median of 17 turns at $1\.49 to reach T1, with separate runs reaching T2; the iterated raw agent here used 77 turns at∼\\sim$3 to reach T2 on this single successful run\. The other A success \(run 6\) needed 3 iterations and 289 turns\. The framework contribution is not that iteration cannot work \(it can, sometimes\), but that the per\-system composition knowledge an iterated agent must rediscover at every run is, in DDS, a reusable artifact \(the L3 skill\) edited once and cited inline thereafter\.

### What this tells us

Three observations compress the A trajectories\. \(i\) The dominant failure mode under iteration is not the absence of feedback but the inability to act on it: when the symptom is generic \(X is unhealthy\) and the root cause is in the system’s own log, iteration loops without progress \(run 7\)\. \(ii\) Iteration can actively regress a stack when the agent’s response to a downstream failure is an architectural pivot rather than a focused fix \(run 5\); a contract that pins L2 prevents this\. \(iii\) When iteration does converge, it does so on a localized error class \(smoke\-query column mismatch, run 10\) at substantial cost in turns; that cost is exactly what DDS’s skill artifact amortizes across deployments\.

## Appendix FSupplementary figures from the trading case study

This section collects supplementary material referenced from the main body\. Figure[6](https://arxiv.org/html/2605.20690#A6.F6)shows the typed intent for the trading workload \(§[4](https://arxiv.org/html/2605.20690#S4)\); the headline pass rates and the attribution\-and\-learning loop are reported in tabular form \(Tables[3](https://arxiv.org/html/2605.20690#S4.T3)and[6](https://arxiv.org/html/2605.20690#S4.T6)\)\.

```
intent:
  data_model:
    entities: [market_tick, ohlcv_bar, position, order]
    primary_types: [time_series, relational, event]

  access_pattern:
    read:  [olap_range_scan, point_lookup, streaming]
    write: [high_throughput_append, transactional_update]

  scale:
    ingest_rate_events_per_sec: 100
    retention_history_years: 5
    concurrent_users: 1

  latency:
    point_lookup_p99_ms: 10
    analytical_query_p99_ms: 2000

  consistency:
    ohlcv_aggregate: eventual
    positions: strong

  cost:
    monthly_usd_budget: 100       # ceiling: build + maintain
    preference: simplicity        # soft warning: under-specified
```

Figure 6:Trading\-workload intent populating all six L1 dimensions \(§[3](https://arxiv.org/html/2605.20690#S3)\)\. The framework emits one soft warning \(cost preference under\-specified, defaulted to*simplicity*\) and no hard errors\.
## Appendix GExtended Discussion: L5 Evaluator Layer

This appendix expands the L5 evaluator roadmap sketched in §[5](https://arxiv.org/html/2605.20690#S5)\. L5 is a design proposal, not a layer implemented in the prototype evaluated in §[4](https://arxiv.org/html/2605.20690#S4); the descriptions below use the conditional or future tense to make this explicit\.

#### What the framework would own\.

A per\-intent brief would derive mandatory probe dimensions from the intent: every declared access pattern would have a query path, every declared SLO on latency, throughput, no\-loss, and freshness would be measured, and declared capacity would be compared against measured footprint\. The framework would also own an access contract for what the evaluator may read or mutate, a report schema, and benchmark\-failure\-to\-layer attribution rules that would route a probe failure to the layer owning the violated decision, parallel to the L4 loop\.

#### What the agent would own\.

The evaluator sub\-agent would generate per\-domain artifacts \(load driver, correctness probe, resilience probe\), each citing the skill fields it targets, and a small hand\-authored golden probe set per domain would serve as cross\-validation, catching agent regressions on the probes themselves\. The evaluator would mix three modes that each catch a distinct failure class: black\-box \(produce traffic at declared ingress, observe at declared egress, assert against the user contract\), white\-box \(read intent, plan, deployment, and skills to check realizability before any traffic\), and gray\-box \(container logs, metrics endpoints, and system tables to cross\-check declared against actual\)\. The agent’s contribution over a fixed harness is that it would compose intent\-specific semantic probes alongside conventional metrics: OHLCV bar continuity and aggregate correctness in trading, message\-ordering preservation under retransmission in chat, and similar domain\-meaningful invariants that no design\-time benchmark suite ships\.

#### Why L5 leads the roadmap\.

L5 would close the architecture: the L4 loop today carries only crash\-time signals, while the evaluator would surface SLO\-time and correctness\-time signals that route back to every earlier layer \(capacity\-vs\-footprint mismatch to intent revision at L1, SLO violation to plan alternatives at L2, anti\-pattern hits to skill patches at L3, codegen slips to code patches at L4\), and T3 \(declared SLOs holding under load\) would become the first acceptance tier the evaluator makes routine\.

#### Production\-grade gaps and where they land in DDS\.

Where we are: the prototype runs on local Docker Compose and clears a smoke\-level acceptance bar \(boot, healthchecks, end\-to\-end query\) on a single intent in a single host environment; it is proof\-of\-life, not production\-grade\. The gap to a real backend spans five concern groups: behavior under load \(load testing at declared scale, backpressure, cost measurement under real workload\), correctness under failure \(durability, replay semantics, delivery guarantees, persistent\-volume recovery\), operational lifecycle \(schema migration, high availability\), broader deployment scope \(cloud targets beyond Compose, secrets, security boundaries\), and observability \(alerting, SLO cross\-checking\)\. Each of these lands at a specific DDS layer rather than at the architecture itself\. Load and observability concerns are L5 evaluator targets, with signals routing back to L1 \(intent revision\) and L3 \(skill patches\)\. Failure\-correctness and lifecycle concerns express as L1 declared dimensions plus L3 skill content\. Cloud deployment beyond Compose is an L4 codegen target rather than an architectural gap, since the L3 plan is product\-list\-plus\-config and neutral to the runner\. Security and secrets are a cross\-cutting contract over L1, L3, and host policy, which the chat case study \(Appendix[D](https://arxiv.org/html/2605.20690#A4)\) flagged as architectural rather than as a checkbox\. Closing these gaps is engineering inside the existing typed contracts, not restructuring of those contracts\.

#### Other roadmap axes\.

At L2, the operator algebra is open: each new application domain adds operators \(the chat platform in Appendix[D](https://arxiv.org/html/2605.20690#A4)addedROUTE,NOTIFY, andINDEX\), and a principled path from these additions back into the L2 type system, with composition\-rule inference rather than hand edits, is the next algebraic step\. A cost\-objective formulation over multi\-system topologies, analogous to a physical query optimizer but across products rather than within one, is a natural extension that takes its cost data from L3 skills and its ranking from the L2 planner\. Skills are expert\-authored today, and the next phase is to extract composition rules and anti\-patterns semi\-automatically from documentation, incident reports, and post\-mortems, with the combined L4 attribution log and L5 evaluator reports as training signal\. Two further extensions add constraints rather than capabilities and stay within the existing structure: multi\-tenant deployment and cross\-cloud data residency contribute intent dimensions at L1 and composition constraints at L3, and security and governance, which the chat case study flagged as architectural rather than as a checkbox, become a contract layer cross\-cutting L1 and L3 over who may hold what key and where data may live\.

## Appendix HExtended Open Problems by Layer

This appendix expands the open problems sketched in §[5](https://arxiv.org/html/2605.20690#S5)\. Two questions live at L1 because the L1 contract has both a static intent specification \(subject to drift\) and a draft\-producing elicitation sub\-step \(depicted as L0 in Fig\.[2](https://arxiv.org/html/2605.20690#S3.F2)\); the others sit at L4 and L3\.

#### L4: attribution confidence\.

Some runtime signals \(consumer\-lag, p99 violation\) are genuinely ambiguous: they may attribute to an L2 pattern choice \(the topology cannot meet the SLO\) or an L3 product choice \(this product cannot meet the SLO under this configuration\), depending on workload context\. The fault\-injection harness in §[4\.3](https://arxiv.org/html/2605.20690#S4.SS3)marks consumer\-lag as ambiguous and accepts either label\. A principled confidence model that decides when to apply a skill patch automatically, when to ask the user, and when to consider a re\-plan remains to be designed\. The natural training signal is the attribution log itself, paired with the eventual fix and the post\-fix outcome; the L5 evaluator above is the natural source of richer labeled signals once it lands\.

#### L1: intent drift\.

A real workload evolves: traffic mix changes, consistency requirements relax or tighten, and an intent signed off today may mis\-describe tomorrow’s traffic\. The right refresh cadence and the right owner of that refresh \(user, planner sub\-agent, or framework\) is unsettled\. Periodic re\-validation against observed metrics is one direction; user\-initiated revision is another; an evaluator\-triggered drift signal \(declared capacity vs\. measured footprint\) is a third\. The cost of staleness is borne by every downstream layer, so the dial belongs above L1 rather than inside any sub\-agent\.

#### L3: skill versioning and deprecation\.

The systems beneath skills change underneath us: Kafka 3 to 4, ClickHouse storage formats, vendor image deprecation, client\-library breaking changes\. Skill fields that name images, versions, or storage formats go stale\. A principled versioning and deprecation policy that lets old skills age out without losing the attribution\-log history attached to them is open\. The naive solution \(rewrite the skill\) loses the trail of which past failures motivated which fields\. A better approach versions fields rather than files and preserves the lineage from each runtime signal to the field it patched\.

#### L1 \(elicitation sub\-step\): protocol\.

A draft intent is only as good as the dialogue that produced it\. The trade\-off between dialogue length and intent coverage across the six L1 dimensions has not been measured\. The right protocol minimizes user effort while maximally covering the six dimensions and surfaces under\-specification cheaply \(the cost\-preference soft warning in the trading walkthrough is one example\)\. How to bound the dialogue without leaving dimensions blank, how to handle conflicts between user\-stated and inferred values, and how to elicit numeric envelopes without forcing premature precision are open design questions\.

## Appendix IExtended Related Work

#### Agentic discovery systems\.

DDS sits in the recent line of agentic\-discovery systems that impose structure on an LLM\-driven search space\. AlphaEvolve\[[38](https://arxiv.org/html/2605.20690#bib.bib56)\]structures evolutionary search over programs with benchmark verifiers; SkyDiscover\[[36](https://arxiv.org/html/2605.20690#bib.bib58)\], OpenEvolve\[[47](https://arxiv.org/html/2605.20690#bib.bib59)\], and ShinkaEvolve\[[31](https://arxiv.org/html/2605.20690#bib.bib60)\]extend the same paradigm to broader algorithmic and scientific discovery\. GEPA\[[1](https://arxiv.org/html/2605.20690#bib.bib34)\]structures prompt\-optimization with reflective natural\-language attribution from agent traces; ACE\[[56](https://arxiv.org/html/2605.20690#bib.bib35)\]structures context evolution; DSPy\[[27](https://arxiv.org/html/2605.20690#bib.bib33)\]compiles declarative LM\-module graphs; Meta\-Harness\[[32](https://arxiv.org/html/2605.20690#bib.bib32)\]jointly optimizes the harness\. Glia\[[25](https://arxiv.org/html/2605.20690#bib.bib57)\]is the closest sibling on the systems side, structuring multi\-agent reasoning over distributed\-systems designs to produce expert\-level configurations\. DDS extends this paradigm to a different search space \(multi\-system data backends\) with a different verifier \(deployment outcome rather than a benchmark\) and a different memory unit \(per\-system editable skill artifacts instead of per\-task prompts\)\. Empirical work on agent reliability \(multi\-agent failure modes\[[9](https://arxiv.org/html/2605.20690#bib.bib39)\], multi\-agent under\-performance\[[40](https://arxiv.org/html/2605.20690#bib.bib40)\], production\-agent reliability\[[39](https://arxiv.org/html/2605.20690#bib.bib47)\]\) grounds the architectural target: typed failure routing across system boundaries and composition knowledge as an editable, citable artifact\.

The systems below are*adjacent*to DDS rather than closest neighbors: most operate at a different abstraction level \(operators inside one engine, contracts inside one warehouse, knobs inside one product, queries against one assumed schema, code edits inside one repository\) than DDS’s cross\-system composition target\. Coding agents like Claude Code \(§[2](https://arxiv.org/html/2605.20690#S2)\) are the closest neighbor on the deployment side; Glia \(above\) is the closest on the agentic\-discovery side\. Everything below sits one level over from those\.

#### LLM\-native pipelines \(different abstraction level\)\.

LOTUS\[[41](https://arxiv.org/html/2605.20690#bib.bib17)\], DocETL\[[46](https://arxiv.org/html/2605.20690#bib.bib18)\], and Palimpzest\[[34](https://arxiv.org/html/2605.20690#bib.bib19)\]introduce declarative semantic operators \(map, filter, join, aggregate over unstructured data\) with cost–quality optimization inside a single LLM\-native engine\. These are declarative over*operators that run LLMs on data*, while DDS is declarative over*which heterogeneous systems implement which operators across a multi\-system backend*\. The two abstractions compose, and a DDSTRANSFORMnode can be backed by such a semantic\-operator pipeline at L3, but they sit one level apart and are not the same problem\.

#### Cross\-project composition in the modern data stack \(extended\)\.

dbt Mesh\[[19](https://arxiv.org/html/2605.20690#bib.bib54)\]adds typed contracts and cross\-project references on top of dbt, lifting the dbt model from a single project to a federation of projects with explicit producer/consumer contracts; Apache Iceberg\[[5](https://arxiv.org/html/2605.20690#bib.bib55)\]provides a typed table format with schema and partitioning evolution that several engines can share\. Both target the boundary problem DDS occupies but stay within the data\-warehouse perimeter: contracts are expressed between SQL projects or between query engines reading the same table format, not between heterogeneous systems \(a queue, an OLTP store, a cache, a search index\) chosen against an intent\. DDS’s L2 operator DAG and L3 skill contract are designed exactly to cross those system boundaries, and an Iceberg table or a dbt\-Mesh contract is a natural physical\-layer instance at a DDSSTOREorTRANSFORMnode\.

#### Coding\-agent landscape \(extended\)\.

Beyond the benchmarks anchored in §[2](https://arxiv.org/html/2605.20690#S2), the coding\-agent landscape also includes SWE\-agent\[[53](https://arxiv.org/html/2605.20690#bib.bib22)\]\(agent\-computer interface scaffolding\), MLE\-bench\[[11](https://arxiv.org/html/2605.20690#bib.bib23)\]\(ML engineering tasks\), and DS\-1000\[[30](https://arxiv.org/html/2605.20690#bib.bib24)\]\(data\-science code\)\. All measure competence inside a single repository, notebook, or task; none tests whether a declared multi\-system backend can boot and stay healthy end\-to\-end under typed failure attribution, which is what our T1 evaluates\.

#### Federated query and HTAP engines\.

Federated query engines such as Presto/Trino\[[45](https://arxiv.org/html/2605.20690#bib.bib6)\]and Spark SQL\[[7](https://arxiv.org/html/2605.20690#bib.bib7)\]extend the polystore idea to execution over pre\-existing stores, and HTAP / “NewSQL” systems \(HyPer\[[26](https://arxiv.org/html/2605.20690#bib.bib8)\], SAP HANA\[[23](https://arxiv.org/html/2605.20690#bib.bib11)\], Google F1\[[48](https://arxiv.org/html/2605.20690#bib.bib9)\]over Spanner\[[15](https://arxiv.org/html/2605.20690#bib.bib10)\]\) consolidate operational and analytical workloads within one product\. In DDS these engines appear as candidate L3 products at individual operators; the framework chooses among them under a typed intent and skill contracts rather than assuming a single target\.

#### Self\-driving databases beyond one\-knob tuning\.

Peloton\[[42](https://arxiv.org/html/2605.20690#bib.bib15)\]and CDBTune\[[55](https://arxiv.org/html/2605.20690#bib.bib16)\]extend the self\-tuning line beyond OtterTune’s config search to physical design and deep\-RL knob policies, still within one product\. DDS differs in two ways: the attribution loop produces*cross\-system*patches \(e\.g\., a Kafka\-side retention change driven by a ClickHouse\-side signal\), and the unit of learning is an editable agent skill reviewed like code rather than a black\-box policy\.

#### Text\-to\-SQL and declarative foundations\.

Text\-to\-SQL benchmarks and systems such as Spider\[[54](https://arxiv.org/html/2605.20690#bib.bib27)\], BIRD\[[33](https://arxiv.org/html/2605.20690#bib.bib28)\], and DIN\-SQL\[[43](https://arxiv.org/html/2605.20690#bib.bib29)\]translate natural\-language queries over an*assumed*schema\. DDS addresses the upstream problem of composing a schema and backend that can serve the intent in the first place, and a text\-to\-SQL pipeline could itself be the product at aSERVEnode\. The declarative\-what, imperative\-how separation\[[14](https://arxiv.org/html/2605.20690#bib.bib1),[10](https://arxiv.org/html/2605.20690#bib.bib2)\]is the conceptual ancestor of L1 and L2, which DDS lifts from within\-product queries to across\-product composition\.

#### AI\-driven systems research and accountability\.

Inefficiencies of Meta Agents\[[21](https://arxiv.org/html/2605.20690#bib.bib41)\]argues against fully\-automated meta\-agent design loops on cost and behavioral\-diversity grounds, supporting our position that human\-authored declarative skill artifacts are the right unit of composition knowledge\. AI\-Driven Research for Systems \(ADRS\), introduced by “Barbarians at the Gate”\[[13](https://arxiv.org/html/2605.20690#bib.bib43)\]and extended by “Let the Barbarians In”\[[12](https://arxiv.org/html/2605.20690#bib.bib44)\]to systems performance research, argues AI is upending systems research methodology by exploiting cheap reliable verifiers; DDS’s T0/T1/T2 acceptance gates are an instance of the same pattern at composition time, where the verifier is a runnable backend rather than a benchmark target\. Cost\-of\-Pass\[[22](https://arxiv.org/html/2605.20690#bib.bib42)\]formalizes accuracy–cost tradeoffs in LM evaluation; our cost\-and\-turn measurements in Table[3](https://arxiv.org/html/2605.20690#S4.T3)adopt the same accountability stance\.

#### Persistent, editable skill libraries for LM agents \(foundations\)\.

The idea of replacing ephemeral in\-context learning with a persistent, editable skill library has roots that predate DDS\. Voyager\[[52](https://arxiv.org/html/2605.20690#bib.bib37)\]introduces a lifelong\-learning embodied agent in Minecraft whose skill library accumulates discovered behaviors as reusable code that the agent re\-uses on later tasks\. Subsequent work on*agent skills*\[[4](https://arxiv.org/html/2605.20690#bib.bib38)\]casts the same idea as an OS\-level facility: a structured, versioned bundle that an agent loads on demand and edits over time\. DDS’s L3 agent\-skill artifact \(§[3](https://arxiv.org/html/2605.20690#S3)\) is in this lineage; the contribution is not the editable\-library idea itself but its application as the unit of*composition knowledge*for multi\-system data backends, with the framework shape \(typed L1–L4 contracts and L4 attribution\) providing the channels by which skill edits enter and survive across deployments\.

Similar Articles

DeSQ: Decomposition-based SPARQL Query Generation

arXiv cs.CL

DeSQ is a decomposition-based framework for generating SPARQL queries from natural language questions. It breaks complex questions into atomic constraints, maps them to SPARQL fragments, and assembles them into complete queries, outperforming state-of-the-art on four out of five benchmarks.