Self-Describing Structured Data with Dual-Layer Guidance: A Lightweight Alternative to RAG for Precision Retrieval in Large-Scale LLM Knowledge Navigation

arXiv cs.CL Papers

Summary

SDSR proposes lightweight self-describing structured data with dual-layer guidance to exploit LLM primacy bias, achieving 100% routing accuracy without vector DBs.

arXiv:2604.19777v1 Announce Type: new Abstract: Large Language Models (LLMs) exhibit a well-documented positional bias when processing long input contexts: information in the middle of a context window receives substantially less attention than content at the boundaries, a phenomenon termed the Lost-in-the-Middle effect (Liu et al., 2024). This limits knowledge-retrieval applications that embed large structured knowledge bases directly in the LLM context. Retrieval-Augmented Generation (RAG) addresses scalability by retrieving only relevant fragments, but introduces substantial infrastructure overhead and is ill-suited to libraries whose semantic boundaries are human-defined rather than statistically learned. We propose Self-Describing Structured Retrieval (SDSR), a lightweight framework in which structured data files embed human-authored navigational metadata at the file's primacy position, thereby exploiting rather than fighting the LLM's primacy bias. We further propose a Dual-Layer Guidance strategy combining in-file metadata with explicit routing rules in the system prompt. We validate SDSR through a four-round benchmark using a 190-skill library expanded from 36 to 119 categories via adversarial distractor injection. Four conditions are tested: (A) no guidance, (B) in-file summary only, (C) prompt hint only, (D) both combined. Version D achieves 100% primary routing accuracy (20/20) at 119 categories versus 65% for the no-guidance baseline. We identify a fundamental asymmetry: primary routing is solvable by explicit rules, while secondary cross-category routing requires architectural intent explicitly encoded in the data structure. We further extend SDSR to semi-structured corpora, showing how cross-reference encoding enables operation without vector databases in domains with recoverable document structure.
Original Article
View Cached Full Text

Cached at: 04/23/26, 10:02 AM

# Self-Describing Structured Data with Dual-Layer Guidance: A Lightweight Alternative to RAG for Precision Retrieval in Large-Scale LLM Knowledge Navigation
Source: [https://arxiv.org/html/2604.19777](https://arxiv.org/html/2604.19777)
\(March 2026\)

###### Abstract

Large Language Models \(LLMs\) exhibit a well\-documented positional bias when processing long input contexts: information placed in the middle of a large context window receives substantially less attention than information at the beginning or end, a phenomenon termed the*Lost\-in\-the\-Middle*effect\[[10](https://arxiv.org/html/2604.19777#bib.bib1)\]\. This limitation poses a significant challenge for knowledge\-retrieval applications that embed large structured knowledge bases directly in the LLM context\. Retrieval\-Augmented Generation \(RAG\) addresses scalability by retrieving only relevant fragments, but introduces substantial infrastructure overhead and is ill\-suited to structured knowledge libraries whose semantic boundaries are human\-defined rather than statistically learned\.

We proposeSelf\-Describing Structured Retrieval\(SDSR\), a lightweight framework in which structured data files embed their own human\-authored navigational metadata \(\_summaryblock with acategory\_indexand per\-categoryrouting\_hintfields\) at the file’s primacy position, thereby exploiting the LLM’s primacy bias rather than fighting it\. We further identify that in\-file guidance alone is insufficient at large scale, and propose a complementaryDual\-Layer Guidancestrategy that combines in\-file metadata with explicit abstract\-level routing rules in the system prompt\.

We validate this framework through a controlled four\-round benchmark experiment using a 190\-skill knowledge library that we systematically expand from 36 to 119 categories by injecting semantically adversarial distractor categories\. Four experimental conditions are tested: \(A\) no guidance, \(B\) in\-file summary only, \(C\) prompt hint only, and \(D\) both combined\. Version D achieves100% primary routing accuracy\(20/20\) at the 119\-category scale, compared to 65% for the no\-guidance baseline\. Our analysis reveals a fundamental asymmetry between primary routing \(solvable by explicit rules\) and secondary cross\-category routing \(requiring implicit architectural intent encoded in data structure\), with implications for LLM knowledge base design\. We further extend SDSR to a general architecture for high\-precision semantic retrieval over large semi\-structured corpora, demonstrating how a one\-time structuring pass with cross\-reference encoding enables SDSR to operate on domains with recoverable document structure—such as legal judgments—without vector databases or embedding infrastructure\.

Keywords:prompt engineering, long\-context LLM, structured knowledge retrieval, retrieval\-augmented generation, positional bias, knowledge navigation

## 1Introduction

The ability of Large Language Models to process long input contexts has expanded dramatically, with contemporary models supporting context windows ranging from 128K to over one million tokens\. However, empirical studies consistently demonstrate that effective utilisation degrades well before the stated limit: models do not uniformly attend to their full input, and performance on retrieval\-dependent tasks can drop by over 30% when relevant information shifts from boundary positions to the middle of the context\[[10](https://arxiv.org/html/2604.19777#bib.bib1)\]\. Yet the practical utility of these extended windows is limited by a structural property of transformer attention: models attend disproportionately to tokens near the beginning and end of their input, with information in the middle receiving systematically weaker signal\.Liuet al\.\[[10](https://arxiv.org/html/2604.19777#bib.bib1)\]formalised this as the*Lost\-in\-the\-Middle*effect, demonstrating a U\-shaped performance curve across multiple LLMs and tasks\.

This positional bias creates a practical dilemma for knowledge\-intensive applications\. Consider a practitioner who maintains a structured library of domain\-specific skill specifications—a knowledge base that has grown to hundreds of categorised entries, each with precise semantic boundaries defined through expert curation\. Embedding the entire library in a single LLM context should, in theory, enable the model to navigate the full space of available knowledge\. In practice, the model’s attention degrades as library size grows, and carefully designed distinctions between adjacent categories become invisible to it\.

The dominant engineering response to this problem is Retrieval\-Augmented Generation \(RAG\)\[[9](https://arxiv.org/html/2604.19777#bib.bib13)\]: rather than providing the full knowledge base, a retrieval system selects semantically relevant fragments and injects only those into the context\. RAG is highly effective for unstructured text corpora but carries substantial costs: it requires chunking strategies that may sever semantic units, embedding models that encode surface similarity rather than human\-defined boundaries, and vector databases that must be maintained as the knowledge base evolves\. For knowledge bases whose structure is explicitly human\-designed, these costs are unnecessary—the semantic boundaries are already known\.

In this paper, we ask:*can a structured knowledge file be designed to guide its own reading by an LLM, exploiting rather than fighting the positional bias?*

We answer this question through a series of controlled experiments motivated by an empirical observation: when navigational instructions \(a summary index with per\-category routing hints\) are embedded at the beginning of a structured JSON knowledge file, the LLM’s routing accuracy improves—not because the instructions are in the system prompt, but because they occupy the high\-primacy position within the data object being processed\. This observation suggests that the LLM treats the data file as a distinct cognitive object from the system prompt, and that instructions embedded within that object benefit from a different attentional regime\.

#### Contributions\.

This paper makes the following contributions:

1. 1\.We introduce theSelf\-Describing Structured Retrieval \(SDSR\)framework, in which structured data files embed human\-authored navigational metadata at their primacy position to guide LLM navigation without external retrieval infrastructure\.
2. 2\.We conduct a systematicfour\-round benchmark experimentwith controlled category\-count scaling \(36→\\to60→\\to119 categories\) and adversarial distractor injection, comparing four guidance conditions across all scales\.
3. 3\.We identify and formalise theDual\-Layer Guidanceprinciple: in\-file primacy metadata handles structural navigation while prompt\-level abstract routing rules handle abstraction\-level disambiguation—the two layers are complementary and non\-redundant\.
4. 4\.We characterise theprimary/secondary routing asymmetry: primary category selection is solvable by explicit guidance rules, while secondary cross\-category pairing requires implicit architectural intent that must be explicitly encoded in data structure fields \(e\.g\., acomplementfield\)\.
5. 5\.We extend SDSR to a general architecture forhigh\-precision retrieval over large structured corpora, including a two\-tier Python\-controlled reading pipeline applicable to domains such as legal document retrieval\.

## 2Related Work

### 2\.1The Lost\-in\-the\-Middle Effect

Liuet al\.\[[10](https://arxiv.org/html/2604.19777#bib.bib1)\]conducted controlled experiments measuring LLM performance on multi\-document question answering and key\-value retrieval as a function of the position of relevant information within the input context\. They found a characteristic U\-shaped performance curve: accuracy is highest when relevant information appears at the very beginning \(primacy bias\) or end \(recency bias\) of the context, and degrades significantly for information in the middle\. This effect persists across both open\-source models \(MPT\-30B, LongChat\-13B\) and closed models \(GPT\-3\.5\-Turbo, Claude\-1\.3\), and does not disappear for models explicitly designed for long\-context processing\.

The authors connect this finding to the*serial\-position effect*from cognitive psychology\[[3](https://arxiv.org/html/2604.19777#bib.bib5)\], observing that the U\-shaped curve mirrors the recall patterns observed in human free\-association memory tasks\. Importantly for our work, they also find that*query\-aware contextualization*—placing the query both before and after the document list—improves performance on synthetic retrieval tasks, providing preliminary evidence that structural positioning of navigational content can modulate attention allocation\.

Subsequent work has extended this analysis to longer contexts and more complex tasks\[[8](https://arxiv.org/html/2604.19777#bib.bib9),[5](https://arxiv.org/html/2604.19777#bib.bib10)\], consistently confirming that even models with context windows of 128K tokens do not uniformly attend to their full input\. From a practical standpoint,Heet al\.\[[8](https://arxiv.org/html/2604.19777#bib.bib9)\]show that reranking retrieved documents to place the most relevant content at the beginning or end of the context—a strategy they term*position engineering*—can yield substantial accuracy improvements at zero additional inference cost\.

Two recent studies provide deeper theoretical grounding for the positional biases that SDSR exploits\.Wuet al\.\[[16](https://arxiv.org/html/2604.19777#bib.bib2)\]present a graph\-theoretic framework showing that primacy bias arises from two architectural sources: causal attention masking, which causes tokens in deeper layers to attend to increasingly contextualised representations of earlier positions, and the interaction between causal masking and relative positional encodings \(notably RoPE\), which produces a trade\-off between long\-term decay and the cumulative importance of early sequence positions\. Their Theorem 4\.1 establishes that primacy bias is a mathematical consequence of causal masking in multi\-layer attention, not merely an empirical regularity—providing an architectural guarantee for SDSR’s strategy of placing navigational metadata at the file’s primacy position\.

Complementing this architectural analysis,Salvatoreet al\.\[[12](https://arxiv.org/html/2604.19777#bib.bib3)\]argue that the U\-shaped performance curve is not a defect but an*emergent adaptation*to competing information\-retrieval demands in pre\-training data: tasks requiring uniform recall across the full input \(long\-term memory demand\) interact with causal masking to produce primacy effects, while tasks prioritising recent information produce recency effects\. Crucially, they demonstrate that the primacy effect emerges in autoregressive architectures \(GPT\-2, Llama, RNNs\) but not in bidirectional encoder\-decoders \(T5\), suggesting that SDSR’s primacy\-exploitation strategy is robust for the current generation of decoder\-only LLMs but may require reassessment if future dominant architectures adopt bidirectional attention\.

### 2\.2Retrieval\-Augmented Generation

Retrieval\-Augmented Generation was introduced byLewiset al\.\[[9](https://arxiv.org/html/2604.19777#bib.bib13)\]as a framework for grounding LLM generation in external knowledge bases without retraining\. The canonical architecture retrieves the top\-kkdocument chunks from a dense vector index using a learned query encoder, then concatenates these chunks with the user query as context for a generative model\.

RAG has since been extended in multiple directions\.Yuet al\.\[[17](https://arxiv.org/html/2604.19777#bib.bib14)\]unify context ranking with generation by fine\-tuning a single model to simultaneously rank retrieved passages and generate answers\. Graph\-augmented RAG \(GraphRAG\)\[[4](https://arxiv.org/html/2604.19777#bib.bib15),[6](https://arxiv.org/html/2604.19777#bib.bib16)\]addresses the loss of inter\-document relationships inherent in flat chunk retrieval by constructing knowledge graphs over the document corpus and retrieving over graph neighbourhoods rather than independent chunks\. Hierarchical RAG\[[2](https://arxiv.org/html/2604.19777#bib.bib17)\]preserves document structure through multi\-level retrieval, partially addressing the concern that fixed\-size chunking severs semantic units\.

Despite these advances, RAG architectures share a common limitation for our target domain: they are optimised for*statistical*semantic similarity, encoding the surface\-level co\-occurrence patterns of their training corpus\. When a knowledge base has explicit, human\-defined semantic boundaries that may not correspond to surface co\-occurrence—as is the case for expert\-curated skill libraries, legal taxonomies, or medical ontologies —vector similarity retrieval may systematically prefer the wrong fragment\.Packowskiet al\.\[[11](https://arxiv.org/html/2604.19777#bib.bib18)\]report from their experience building enterprise\-scale RAG solutions that achieving stable and accurate results for complex queries requires substantial content engineering effort, including iterative optimisation of the knowledge base content itself rather than relying solely on retrieval\-side tuning\.

### 2\.3Context Engineering and Prompt Positioning

A growing body of work studies how the structural properties of the input context affect LLM behaviour beyond simple retrieval accuracy\. Beyond reranking,Heet al\.\[[8](https://arxiv.org/html/2604.19777#bib.bib9)\]further show that directly manipulating positional indices—without altering the prompt text itself—can improve RAG and in\-context learning performance substantially, demonstrating that positional information is a first\-class design variable\.Guo and Vosoughi \[[7](https://arxiv.org/html/2604.19777#bib.bib4)\]demonstrate that LLMs exhibit serial position effects—primacy and recency biases—analogous to those documented in human free\-recall experiments, arguing that prompt design should account for these biases rather than assuming uniform attention\.

Meta\-prompting\[[14](https://arxiv.org/html/2604.19777#bib.bib11)\]explores using one LLM to generate structured prompts for another, demonstrating that explicit structural framing can improve downstream task accuracy by 15–17%\. The Prompt Report\[[13](https://arxiv.org/html/2604.19777#bib.bib12)\]surveys over 50 distinct prompting techniques, noting that techniques which provide explicit structural scaffolding consistently outperform those relying on the model’s implicit knowledge organisation\.

### 2\.4Knowledge Base Design for LLM Consumption

Relatively little work addresses the question of how structured knowledge bases should be*designed*to be navigated effectively by LLMs\.Packowskiet al\.\[[11](https://arxiv.org/html/2604.19777#bib.bib18)\]study knowledge base content design for enterprise RAG, focusing on the format of text conversions from PDF and HTML, but do not address inline navigational metadata\.

To the best of our knowledge, no prior work has proposed embedding human\-authored navigational metadata directly within structured data files as a mechanism for exploiting positional bias during LLM knowledge navigation\. The present work fills this gap\.

## 3Problem Formulation

### 3\.1Structured Knowledge Library Navigation

###### Definition 1\(Structured Knowledge Library\)\.

A*structured knowledge library*ℒ\\mathcal\{L\}is a collection ofNNcategories\{C1,C2,…,CN\}\\\{C\_\{1\},C\_\{2\},\\ldots,C\_\{N\}\\\}, where each categoryCiC\_\{i\}contains a namenin\_\{i\}, a descriptiondid\_\{i\}, and a set of skill entriesSi=\{si,1,…,si,ki\}S\_\{i\}=\\\{s\_\{i,1\},\\ldots,s\_\{i,k\_\{i\}\}\\\}\. Each skill entrysi,js\_\{i,j\}has a name and optionally a description\.

###### Definition 2\(Routing Task\)\.

Given a task descriptionqqand a libraryℒ\\mathcal\{L\}, the*primary routing task*R1​\(q,ℒ\)R\_\{1\}\(q,\\mathcal\{L\}\)requires selecting the categoryC∗∈ℒC^\{\*\}\\in\\mathcal\{L\}that best addressesqq\. The*secondary routing task*R2​\(q,ℒ,C∗\)R\_\{2\}\(q,\\mathcal\{L\},C^\{\*\}\)requires selecting a complementary categoryC∗∗≠C∗C^\{\*\*\}\\neq C^\{\*\}that addresses a distinct architectural dimension ofqq\.

###### Definition 3\(Distractor Category\)\.

A*distractor category*CDC^\{D\}is a category whose name and description are semantically proximate to a target categoryC∗C^\{\*\}in surface keyword space but occupy a different abstraction level \(mechanism vs\. governance, component vs\. pipeline\)\. Formally,simsurface​\(CD,C∗\)\>θ\\text\{sim\}\_\{\\text\{surface\}\}\(C^\{D\},C^\{\*\}\)\>\\thetawhilelevel​\(CD\)≠level​\(C∗\)\\text\{level\}\(C^\{D\}\)\\neq\\text\{level\}\(C^\{\*\}\)\.

### 3\.2Research Hypotheses

We investigate three nested hypotheses about the effectiveness of different guidance strategies:

###### Hypothesis 1\(Self\-Describing Data\)\.

Embedding a human\-authored\_summaryblock withrouting\_hintfields at the beginning of a structured data file improves LLM primary routing accuracy compared to providing no guidance, particularly as library size increases\.

###### Hypothesis 2\(Prompt Guidance\)\.

Providing explicit routing rules in the system prompt—including explicit category naming and abstract\-level selection rules—improves primary routing accuracy compared to providing no guidance\.

###### Hypothesis 3\(Dual\-Layer Synergy\)\.

Combining in\-file metadata \(Hypothesis[1](https://arxiv.org/html/2604.19777#Thmhypothesis1)\) with prompt\-level routing rules \(Hypothesis[2](https://arxiv.org/html/2604.19777#Thmhypothesis2)\) produces a synergistic improvement in primary routing accuracy that exceeds either strategy alone, because the two layers serve complementary functions: in\-file metadata provides structural navigation at the data’s primacy position, while prompt rules provide abstraction\-level disambiguation\.

### 3\.3Scoring Protocol

For each taskqtq\_\{t\}\(t=1,…,20t=1,\\ldots,20\), an LLM response is scored as:

score​\(t\)=\{1\.0if primary category is correct1\.5if both primary and secondary categories are correct0\.0otherwise\\text\{score\}\(t\)=\\begin\{cases\}1\.0&\\text\{if primary category is correct\}\\\\ 1\.5&\\text\{if both primary and secondary categories are correct\}\\\\ 0\.0&\\text\{otherwise\}\\end\{cases\}\(1\)
The maximum achievable score is20\.0\+8\.5×0\.5=28\.520\.0\+8\.5\\times 0\.5=28\.5points, where 8\.5 reflects the number of questions with a designated secondary category in the answer key\.

Primary accuracy is reported asPA=\|\{t:Ct∗​correct\}\|20\\text\{PA\}=\\frac\{\|\\\{t:C^\{\*\}\_\{t\}\\text\{ correct\}\\\}\|\}\{20\}and secondary hit rate asSHR=\|\{t:Ct∗∗​correct\}\|\|\{t:Ct∗∗​defined\}\|\\text\{SHR\}=\\frac\{\|\\\{t:C^\{\*\*\}\_\{t\}\\text\{ correct\}\\\}\|\}\{\|\\\{t:C^\{\*\*\}\_\{t\}\\text\{ defined\}\\\}\|\}\.

## 4Methodology

### 4\.1Knowledge Library: High\-Impact Skills Library

Our experiments use the*High\-Impact Skills Library*\(HISL\), a practitioner\-developed prompt\-engineering knowledge base comprising 36 human\-curated categories and 190 skill entries\. Each category represents a distinct capability domain \(e\.g\.,Cognitive\_Architecture\_&\_Routing,Axiomatic\_Logic\_&\_Audit\_Systems,Academic\_Research\_Synthesis\_Pipeline\), with an expert\-authoredcategory\_descriptionthat defines the category’s scope and distinguishes it from adjacent categories\. Each skill entry contains askill\_nameand a one\-to\-two sentencedescriptionspecifying its function within the category\.

The library is serialised as a single JSON file\. The*Version B*\(in\-file summary\) variant prepends a\_summaryblock before the mainHigh\_Impact\_Skills\_Libraryobject, containing: \(i\) acategory\_indexlisting all categories with their skill counts and a 100\-characterrouting\_hintexcerpted from each category description; \(ii\)\_llm\_instructionsspecifying how to use the index for two\-stage navigation; and \(iii\)routing\_rolesmapping meta\-functions \(cognitive anchor, universal fallback, domain\-specific\) to responsible categories\.

### 4\.2Experimental Conditions

We define four experimental conditions that vary the type and placement of navigational guidance:

Version A \(No Guidance\)\.The bare JSON library is provided without any\_summaryblock\. The system prompt contains only a minimal professional framing with no structural instructions\. This condition establishes the no\-guidance baseline\.

Version B \(In\-File Summary Only\)\.The JSON library includes the full\_summaryblock withcategory\_indexandrouting\_hintfields at the file’s primacy position\. The system prompt is identical to Version A\.

Version C \(Prompt Hint Only\)\.The bare JSON library \(identical to Version A\) is provided\. The system prompt contains explicit structural guidance: a list of key categories with their roles, an abstraction\-level priority rule \(*“prefer broader pipeline/governance categories over narrower mechanism categories when both seem relevant”*\), and instructions to use thecategory\_descriptionfields to confirm relevance before selecting\.

Version D \(Dual\-Layer Guidance\)\.Combines the in\-file\_summaryblock of Version B with the prompt\-level routing rules of Version C\. The system prompt additionally names the seven highest\-priority categories that are most vulnerable to distractor confusion \(Table[1](https://arxiv.org/html/2604.19777#S4.T1)\)\.

Table 1:High\-priority categories explicitly named in the Version D system prompt, with their commonly confused distractor categories\.
### 4\.3Test Question Design

We design 20 task\-description questions\{q1,…,q20\}\\\{q\_\{1\},\\ldots,q\_\{20\}\\\}to test routing accuracy across the full range of HISL categories\. A central design principle is*keyword avoidance*: no question contains any substring of its target category’s name, forcing the model to route by semantic understanding rather than keyword matching\.

Questions are stratified by expected discrimination power:

- •High\-discrimination questions\(Q09, Q11, Q12, Q16, Q20\): Target categories are either rare \(1–2 skills\), positioned late in the file, or have names that are non\-obvious relative to the task description\. These are expected to show the largest A/B/C/D divergence\.
- •Low\-discrimination questions\(Q01, Q05, Q07\): Target categories have highly intuitive names\. All conditions are expected to answer correctly; these establish the baseline and verify that guidance does not degrade already\-correct responses\.

Table[2](https://arxiv.org/html/2604.19777#S4.T2)shows representative questions\.

Table 2:Representative test questions \(keyword avoidance enforced\)\. Full question set in Appendix[A](https://arxiv.org/html/2604.19777#A1)\.
### 4\.4Library Scale Expansion and Distractor Injection

To test hypotheses across varying levels of navigational challenge, we expand the library across three rounds:

Round 1:Original HISL, 36 categories, 190 skills\.

Round 2:24 distractor categories injected, total 60 categories, 262 skills\.

Round 3:60 additional distractor categories injected, total 119 categories, 380 skills\.

Distractor categories are designed in two interference tiers:

- •High\-interference distractors\(30 categories in Round 3\): Names are semantically proximate to real target categories \(e\.g\.,Agent\_Handoff\_Protocol\_DesignnearDistributed\_Cognition\_&\_Context\_Orchestration\)\. Each contains 2–3 skill entries with one\-sentence descriptions\.
- •Low\-interference distractors\(30 categories in Round 3\): Names use domain analogies unrelated to any test question target \(e\.g\.,Mycological\_Network\_Design,Trophic\_Cascade\_Analyzer\), providing pure volumetric pressure without semantic interference\.

Distractor categories are interleaved uniformly among real categories \(one distractor inserted after every real category\) to prevent positional clustering effects\. The answer key remains anchored to the original 36 real categories throughout all rounds\.

### 4\.5Evaluation Protocol

For each conditionv∈\{A,B,C,D\}v\\in\\\{A,B,C,D\\\}and roundr∈\{1,2,3\}r\\in\\\{1,2,3\\\}, we submit all 20 questions to the LLM in a single fresh conversation, preceded by the condition\-appropriate system prompt and file upload\. The model is instructed to respond in the formatQ\#: category\_name \| skill\_namefor up to two selections per question, withholding any chain\-of\-thought reasoning until all 20 answers are listed\. The answer key is provided in a second turn within the same conversation, and the model self\-scores its responses into a structured table\. All experiments use Claude Opus 4\.6 \(claude\-opus\-4\-6\), the highest\-capability model in the Claude 4\.6 family, accessed via the claude\.ai web interface with default inference parameters\. The use of the strongest available model is a deliberate methodological choice: any observed failures in secondary routing cannot be attributed to insufficient model capacity, and instead reflect a fundamental limitation in the availability of the required knowledge within the data structure\.

Condition D is evaluated only in Round 3 \(the most challenging scale\), motivated by the unexpected finding from Round 3 that Version B≈\\approxVersion A—establishing that in\-file guidance alone is insufficient and a combined strategy is warranted\.

## 5Experiments and Results

### 5\.1Round 1: Baseline at Original Scale \(36 Categories\)

#### Results\.

All three conditions \(A, B, C\) achieve perfect primary accuracy \(PA = 20/20, 100%\) in Round 1, with the following secondary hit rates: A: 2/17 \(11\.8%\), B: 3/17 \(17\.6%\), C: 2/17 \(11\.8%\)\. Total scores: A = 21\.0/28\.5, B = 21\.5/28\.5, C = 21\.0/28\.5\.

#### Analysis\.

The perfect primary accuracy across all conditions indicates that at 36 categories, the navigational challenge is insufficient to expose attentional limits\. The HISL’s original category names are sufficiently distinctive that keyword\-level matching resolves all 20 primary routing decisions correctly regardless of guidance condition\. The marginal advantage of Version B \(\+0\.5 pts from one additional secondary hit in Q18\) is not statistically meaningful given the single\-run experimental design\.

This round establishes two important baselines: \(i\) the answer key is navigable in principle, ruling out task design flaws; and \(ii\) secondary routing is systematically more difficult than primary routing—a pattern that persists across all rounds\.

### 5\.2Round 2: Medium Scale with Adversarial Distractors \(60 Categories\)

#### Results\.

Primary accuracy drops for all conditions relative to Round 1, with the no\-guidance baseline \(A\) falling to 15/20 \(75%\)\. Both B and C maintain higher accuracy \(B: 16/20 = 80%, C: 17/20 = 85%\)\. Secondary hit rates remain low across all conditions \(A: 2/17, B: 3/17, C: 2/17\)\. Total scores: A = 15\.0/20\.0 \(primary only\), B = 16\.5/28\.5, C = 16\.5/28\.5\.

#### Analysis\.

The first meaningful divergence between A and the guided conditions emerges in Round 2, supporting Hypotheses[1](https://arxiv.org/html/2604.19777#Thmhypothesis1)and[2](https://arxiv.org/html/2604.19777#Thmhypothesis2)\. However, B≈\\approxC \(tied at 16\.5/28\.5\), meaning that in\-file summary and prompt\-level guidance are interchangeable in their effectiveness at this scale\.

The five complete misses \(Q12, Q13, Q15, Q16, and the reduced Q19\) share a common failure mode: the distractor category name is a*surface\-literal match*to the task description, while the correct answer requires recognising a higher abstraction level\. For example, Q12 asks about “reliable context passing between agents” and the model consistently selectsAgent\_Handoff\_Protocol\_Design\(mechanism level\) rather thanDistributed\_Cognition\_&\_Context\_Orchestration\(governance level\)\.

### 5\.3Round 3: Large Scale \(119 Categories\)

#### Results\.

At 119 categories, the gap between conditions widens substantially\. Version A falls to 13/20 \(65%\), Version B also falls to 13/20 \(65%\), while Version C holds at 16/20 \(80%\)\. The unexpected result is that B==A: the in\-file summary provides*no benefit*over no guidance at this scale\.

Total scores: A = 13\.0, B = 13\.0, C = 16\.0 \(primary hits only, secondary = 0 for all conditions\)\.

#### Analysis\.

The equivalence of B and A at 119 categories reveals the scalability limit of the in\-file summary approach\. With 119 routing hints packed into thecategory\_index, each hint at 100 characters, the index itself becomes a dense block of text—subject to the same lost\-in\-the\-middle degradation it was designed to mitigate\. The routing hints are present and syntactically accessible, but the LLM’s ability to use them for disambiguation degrades as the hint count grows\.

Version C’s advantage over B reveals an important asymmetry:*natural\-language rules in the system prompt*maintain effectiveness at larger scales because they do not grow in length with the library size\. The Version C prompt names seven specific high\-risk categories and states one abstract selection rule; this fixed\-length, high\-salience instruction is processed in the system prompt’s primacy position, not within the data object\.

Table[3](https://arxiv.org/html/2604.19777#S5.T3)summarises primary accuracy across all rounds and conditions\.

Table 3:Primary routing accuracy \(out of 20\) across all experimental conditions and rounds\. Version D was evaluated in Round 3 only\.36 \(R1\)60 \(R2\)119 \(R3\)101012121414161618182020D = 20/20Library Scale \(number of categories\)Primary Accuracy \(out of 20\)A: No guidanceB: In\-file summaryC: Prompt hintD: Dual\-layer \(R3 only\)Figure 1:Primary routing accuracy as a function of library scale for all four experimental conditions\. Version D achieves perfect accuracy at 119 categories\. Versions A and B converge at Round 3, demonstrating the scalability limit of in\-file routing hints alone\.

### 5\.4Round 3, Version D: Dual\-Layer Guidance

#### Motivation\.

The unexpected equivalence of A and B in Round 3, combined with C’s partial recovery, motivates the design of Version D: if prompt\-level rules are effective but incomplete, and in\-file metadata is complete but loses effectiveness at scale, combining them should allow the prompt rules to handle the cases that in\-file metadata fails on, while in\-file metadata provides structural scaffolding for the cases where the prompt rules do not name a specific category\.

#### Results\.

Version D achieves PA = 20/20 \(100%\) in Round 3, recovering all seven questions that Version C still misses\. Secondary hit rate = 0/17 \(0%\)\. Total score = 20\.0/28\.5\.

The seven recovered misses \(Q02, Q12, Q13, Q14, Q15, Q16, Q18/Q19\) all correspond to the abstraction\-level confusion pattern: in each case, the Version D prompt explicitly names the target category and states that broader pipeline/governance categories should be preferred over narrower mechanism categories\.

#### Analysis\.

The perfect primary accuracy of Version D confirms Hypothesis[3](https://arxiv.org/html/2604.19777#Thmhypothesis3): the two guidance layers are genuinely complementary\. Table[4](https://arxiv.org/html/2604.19777#S5.T4)maps each question’s recovery status across conditions, showing which layer is responsible for each recovery\.

Table 4:Question\-level miss recovery across conditions in Round 3\. ✓= correct,×\\times= incorrect\. “Recovery layer” identifies which element of Version D resolves the miss\.

### 5\.5Secondary Routing: A Persistent Failure

Across all four conditions and all three rounds, secondary hit rates remain consistently low \(maximum 3/17 = 17\.6%, in Version B Round 2\)\. Version D achieves 0/17 secondary hits despite perfect primary accuracy\.

Inspection of secondary routing errors reveals a consistent pattern: the model selects a*functionally adjacent*category within the same semantic neighbourhood rather than an*architecturally complementary*category at a different abstraction level\. For example:

- •Q05 \(primary:Persona\_&\_Narrative\_Synthesis\): secondary answer key expectsInteractive\_Narrative\_&\_Creative\_Fiction\_Engine\(execution layer\); model consistently selectsCharacter\_Voice\_Synthesis\(same semantic neighbourhood\)\.
- •Q17 \(primary:Self\_Evolution\_&\_Refinement\): answer key expectsMeta\_Data\_&\_Engineering\(engineering foundation\); model selectsPrompt\_Iteration\_Tracker\(same functional neighbourhood\)\.

This failure is not addressable by either guidance layer because the architectural complementarity relationship—the reasonMeta\_Data\_&\_Engineeringis the “correct” complement toSelf\_Evolution\_&\_Refinement—is not encoded anywhere in the current data structure\. It exists only in the library designer’s mental model\. We discuss the implications in Section[6](https://arxiv.org/html/2604.19777#S6)\.

Primary hits \(max 20\)Secondary hits \(max 17\)Total score \(max 28\.5\)01010202013130131313130131316161116\.516\.5202002020Count \(out of max\)ABCDFigure 2:Round 3 \(119 categories\) results by metric and condition\. Version D achieves perfect primary accuracy but zero secondary hits, revealing the fundamental asymmetry between primary and secondary routing\.ABCDCorrectIncorrectQ20✓✓✓✓Q19×\\times×\\times✓✓Q18×\\times×\\times✓✓Q17✓✓✓✓Q16×\\times✓✓✓Q15×\\times×\\times✓✓Q14×\\times×\\times×\\times✓Q13×\\times×\\times×\\times✓Q12×\\times×\\times×\\times✓Q11✓✓✓✓Q10✓✓✓✓Q09✓✓✓✓Q08✓✓✓✓Q07✓✓✓✓Q06✓✓✓✓Q05✓✓✓✓Q04✓✓✓✓Q03✓✓✓✓Q02×\\times×\\times×\\times✓Q01✓✓✓✓Figure 3:Per\-question primary routing hit/miss matrix for Round 3 \(119 categories\)\. Green ✓= correct primary category; red×\\times= incorrect\. Questions Q02, Q12, Q13, Q14 are missed by all conditions except D, indicating persistent abstraction\-level confusion unresolvable by in\-file guidance or simple prompt hints\.

## 6Analysis and Discussion

### 6\.1The Primary/Secondary Routing Asymmetry

Our experiments reveal a fundamental asymmetry between two routing tasks that superficially appear similar\.*Primary routing*is a classification problem: given a task description and a library, select the single most relevant category\. This problem is solvable because \(i\) each category has a unique name that, when explicitly pointed to in a prompt, can be matched to a task description, and \(ii\) an abstract priority rule \(“prefer pipeline over component”\) eliminates the systematic abstraction\-level confusion\.

*Secondary routing*is qualitatively different\. The answer key’s secondary categories are not the second\-most\-similar category in surface keyword space—they are categories that address a*different architectural dimension*of the same task\. The relationshipR2​\(Q05\)=Interactive\_NarrativeR\_\{2\}\(\\text\{Q05\}\)=\\texttt\{Interactive\\\_Narrative\}whileR1​\(Q05\)=Persona\_&\_NarrativeR\_\{1\}\(\\text\{Q05\}\)=\\texttt\{Persona\\\_\\&\\\_Narrative\}reflects a design decision by the library architect: narratives require both character consistency \(Persona layer\) and plot execution infrastructure \(Interactive Narrative layer\)\. This relationship is not inferable from any text currently in the library\.

###### Principle 1\(Architectural Intent Encoding\)\.

Cross\-category complementarity relationships that form part of a knowledge library’s intended usage pattern must be explicitly encoded in the data structure \(e\.g\., acomplementfield on each category\) rather than expected to be inferred by the LLM from category descriptions\. LLMs resolve secondary selection by surface proximity; intended complementarity is invisible without explicit encoding\.

#### Why prompt instructions cannot substitute for explicit encoding\.

Our experimental data provide a precise characterisation of this boundary\. Across all four guidance conditions \(A, B, C, D\) and all three library scales \(36, 60, and 119 categories\), secondary hit rates remain at or near zero—with the sole partial exception of Version B Round 2 \(3/17 = 17\.6%\), which is not statistically meaningful given the single\-run design\. Critically, this failure is*scale\-independent*and*instruction\-independent*: it persists regardless of library size, and regardless of whether the system prompt contains explicit routing rules\. Version D, which achieves perfect primary accuracy \(20/20\) through explicit category naming and priority rules, still achieves 0/17 secondary hits\. This dissociation confirms that the failure is not a symptom of attentional overload or insufficient guidance—it is a knowledge absence\.

The underlying mechanism can be stated precisely\. LLMs acquire knowledge through statistical co\-occurrence in training data\. The concept of*architectural complementarity*is present in that training data—LLMs have encountered countless descriptions of how execution layers depend on engineering foundations, how governance layers constrain mechanism layers, and so on\. However, the*specific pairings*defined by a practitioner’s private library design—thatSelf\_Evolution\_&\_Refinementpairs withMeta\_Data\_&\_Engineering, or thatPersona\_&\_Narrative\_Synthesispairs withInteractive\_Narrative\_&\_Creative\_Fiction\_Engine—have never appeared in any training corpus, because they are the products of a single designer’s architectural decisions\. This exposes a boundary in LLM capability that is easy to overlook: the distinction between*knowing a concept*and*knowing its instantiation in a private context*\.

Furthermore, “architectural complementarity” and “semantic similarity” point in*opposite directions*in embedding space: a category’s architectural complement is typically drawn from a different abstraction level and a different functional domain, making it semantically*distant*rather than proximate\. An LLM’s default retrieval behaviour—select the nearest semantic neighbour—is therefore systematically misaligned with the secondary routing task, regardless of how the task is framed in the prompt\. The practical implication is a hard design constraint:*any knowledge that exists only in the library designer’s mental model must be explicitly serialised into the data structure before an LLM can act on it\.*

### 6\.2The Scalability Boundary of In\-File Guidance

Version B’s collapse from PA=16 \(Round 2\) to PA=13 \(Round 3\)—identical to the no\-guidance baseline—reveals a concrete scalability boundary for in\-file routing hints\. The 100\-character routing hints collectively occupy approximately 12,000 tokens in thecategory\_indexat 119 categories\. This block, while positioned at the file’s primacy location, is itself long enough to suffer internal lost\-in\-the\-middle degradation\. Entries near the middle of the 119\-entry index receive weaker attention than those at the beginning and end\.

This suggests a practical design rule:

###### Principle 2\(Routing Hint Density Limit\)\.

In\-file routing hint indexes are effective up to approximately 60 categories when using 100\-character hints\. Beyond this scale, the index block itself becomes too long for reliable end\-to\-end attention\. Solutions include: \(i\) shortening hints to 50 characters, \(ii\) splitting the library into multiple files with a meta\-index, or \(iii\) adding explicit prompt\-level priority rules for the most frequently confused categories \(the Dual\-Layer approach\)\.

### 6\.3Three Principles of Effective Prompt\-Level Guidance

Version C and D’s success relative to B at large scale, combined with D’s perfect accuracy, allows us to distil three principles for prompt\-level guidance design:

###### Principle 3\(Explicit Category Naming\)\.

For categories that are frequently confused with semantically proximate distractors, explicit naming in the system prompt \(“For multi\-agent workflow tasks, the correct category isDistributed\_Cognition\_&\_Context\_Orchestration”\) is more reliable than relying on the model to navigate a long index\. Generic structural instructions without specific names are insufficient\.

###### Principle 4\(Abstract Hierarchy Rule\)\.

A single abstract priority rule—“prefer broader pipeline/governance categories over narrower mechanism/component categories when both seem relevant”—resolves the most common class of routing errors at scale\. This rule addresses the pattern where a distractor name is more literally descriptive of the task’s surface mechanics while the correct category addresses the task’s systemic function\.

###### Principle 5\(Complementary Layer Division of Labour\)\.

Effective dual\-layer guidance requires that the two layers address*different*routing challenges: in\-file metadata handles structural navigation over the full category space, while prompt rules handle abstraction\-level disambiguation for specific high\-risk category pairs\. Duplicate content across layers provides no additional benefit\.

### 6\.4Cognitive Framing: Data Object vs\. Instruction Context

A striking empirical finding is that navigational instructions embedded*within a data file*and identical instructions provided in the*system prompt*produce different routing behaviour, even when both occupy primacy positions within their respective contexts\. This observation is consistent with the primacy bias documented byLiuet al\.\[[10](https://arxiv.org/html/2604.19777#bib.bib1)\], who demonstrate that LLMs attend most strongly to tokens near the beginning of their input context; instructions co\-located with the data they describe benefit from this bias within the data object’s own processing scope\.

We hypothesise that this reflects a difference in cognitive framing: the LLM processes the system prompt as an instructional context establishing its role and operating procedures, while it processes the uploaded data file as a content object to be navigated\. Instructions embedded within the data object benefit from being co\-processed with the data they describe—a form of inline contextualisation—whereas system prompt instructions must bridge a representational gap to influence navigation of a separately\-encoded data object\.

This framing predicts that at small scales \(Round 1\), both placements are equally effective because the navigational challenge is trivial\. At medium scales \(Round 2\), in\-file guidance provides a modest advantage via primacy bias within the data object\. At large scales \(Round 3\), the advantage reverses because the data object’s internal primacy\-bias effect degrades over the 119\-entry index, while the system prompt’s abstract rules maintain full effectiveness regardless of library size\.

### 6\.5Limitations

Several limitations constrain the generalisability of our findings:

1. 1\.Single model and single run: All experiments use Claude Opus 4\.6 with a single run per condition\. LLM responses have non\-zero temperature, introducing sampling variance\. Future work should replicate with multiple runs \(at leastn=3n=3\) and additional models \(GPT\-4o, Gemini\)\.
2. 2\.Human\-designed answer key: The secondary routing answer key reflects the library designer’s architectural intent, not independently validated ground truth\. A different expert might assign different secondary categories to some questions\.
3. 3\.Single knowledge domain: HISL is a prompt\-engineering skill library\. Generalisation to other domains \(medical, legal, scientific\) requires additional validation, though the mechanisms we identify—positional bias, abstraction\-level confusion, surface keyword proximity—are domain\-independent\.
4. 4\.Q02, Q12, Q13, Q14 structural issues: The four permanently\-lost questions reflect a mismatch between the distractor design and the answer key’s abstraction\-level preferences\. These may reflect answer key issues as much as model limitations\.
5. 5\.Architectural dependence on autoregressive primacy bias: SDSR’s core mechanism—placing navigational metadata at the file’s primacy position—presupposes that the LLM exhibits primacy bias\.Wuet al\.\[[16](https://arxiv.org/html/2604.19777#bib.bib2)\]show that this bias is a structural consequence of causal attention masking, andSalvatoreet al\.\[[12](https://arxiv.org/html/2604.19777#bib.bib3)\]demonstrate that the primacy effect emerges in autoregressive architectures but not in bidirectional encoder\-decoders\. If future mainstream LLM architectures adopt bidirectional or modified masking schemes—as proposed byWuet al\.\[[16](https://arxiv.org/html/2604.19777#bib.bib2)\]as a mitigation strategy—the primacy\-exploitation component of SDSR would need to be re\-evaluated\. The prompt\-level guidance layer \(Version C/D\), however, operates through explicit instruction\-following rather than positional bias, and is therefore expected to remain effective across architectural changes\.

## 7Self\-Describing Structured Retrieval as a RAG Alternative

### 7\.1RAG’s Fundamental Assumptions and Their Costs

Retrieval\-Augmented Generation\[[9](https://arxiv.org/html/2604.19777#bib.bib13)\]makes three implicit assumptions that are often violated in expert\-curated knowledge libraries:

1. 1\.Surface similarity≈\\approxrelevance: Vector similarity retrieval assumes that semantically relevant documents will be close in embedding space to the query\. For human\-defined categories whose distinctions are definitional rather than distributional \(e\.g\., “governance layer” vs\. “execution layer”\), this assumption fails\.
2. 2\.Chunks are self\-contained: Fixed\-size chunking assumes that relevant information can be extracted from a document fragment\. For structured libraries, a skill’s meaning depends on its category membership; splitting them severs this dependency\.
3. 3\.Boundaries are not pre\-known: RAG assumes that the relevant scope of a query cannot be determined without vector search\. For curated libraries, the designer already knows the category boundaries; the retrieval problem is routing, not discovery\.

Beyond these assumptions, RAG carries operational costs that are unnecessary for structured libraries: embedding model inference, vector database infrastructure \(FAISS, Weaviate, Pinecone\), index maintenance on updates, and chunking strategy tuning\.Packowskiet al\.\[[11](https://arxiv.org/html/2604.19777#bib.bib18)\]report that enterprise RAG deployments require iterative content design and evaluation effort to achieve stable results, with significant total cost of ownership\.

### 7\.2The SDSR Architecture

###### Definition 4\(Self\-Describing Structured Retrieval\)\.

*Self\-Describing Structured Retrieval \(SDSR\)*is a two\-tier retrieval architecture in which: \(1\) a structured knowledge file embeds a\_summaryblock containing a human\-authored index of its categories, positioned at the file’s primacy location; \(2\) a Python orchestration layer reads only the\_summaryblock of each file to identify relevant files, then loads only the targeted files’ full content into the LLM context\.

User queryqqPython reads\_summaryonly\(all files\)Relevantfiles?Load full contentof matched filesLLM processes\(small context\)Answer \+skill selectionyesno matchexpand scope∼\\sim200 tokens/file1–3 files≤\\leq5K tokensFigure 4:The SDSR two\-tier retrieval pipeline\. Python reads only the\_summaryblocks \(low cost\) to identify relevant files, then loads full content \(small context\) for LLM processing\. No vector database or embedding model is required\.The SDSR pipeline \(Figure[4](https://arxiv.org/html/2604.19777#S7.F4)\) operates as follows:

Algorithm 1SDSR Two\-Tier Retrieval1:Query

qq, file registry

ℱ=\{f1,…,fK\}\\mathcal\{F\}=\\\{f\_\{1\},\\ldots,f\_\{K\}\\\}
2:Selected skill set

𝒮∗\\mathcal\{S\}^\{\*\}
3:Tier 1 — Summary Scan:

4:foreach file

fi∈ℱf\_\{i\}\\in\\mathcal\{F\}do

5:

σi←read\_json\_prefix​\(fi,\_summary\)\\sigma\_\{i\}\\leftarrow\\text\{read\\\_json\\\_prefix\}\(f\_\{i\},\\texttt\{\\\_summary\}\)⊳\\triangleright∼\\sim200 tokens per file

6:endfor

7:

ℱ∗←LLM\_route​\(q,\{σ1,…,σK\}\)\\mathcal\{F\}^\{\*\}\\leftarrow\\text\{LLM\\\_route\}\(q,\\\{\\sigma\_\{1\},\\ldots,\\sigma\_\{K\}\\\}\)⊳\\trianglerightIdentify 1–3 relevant files

8:Tier 2 — Full Content Processing:

9:foreach file

fi∈ℱ∗f\_\{i\}\\in\\mathcal\{F\}^\{\*\}do

10:

contexti←read\_json\_full​\(fi\)\\text\{context\}\_\{i\}\\leftarrow\\text\{read\\\_json\\\_full\}\(f\_\{i\}\)
11:endfor

12:

𝒮∗←LLM\_select​\(q,\{contexti:fi∈ℱ∗\}\)\\mathcal\{S\}^\{\*\}\\leftarrow\\text\{LLM\\\_select\}\(q,\\\{\\text\{context\}\_\{i\}:f\_\{i\}\\in\\mathcal\{F\}^\{\*\}\\\}\)return

𝒮∗\\mathcal\{S\}^\{\*\}

The key efficiency property is that Tier 1 reads only the\_summaryblock of each file \(approximately 200 tokens per file regardless of file size\), making the total Tier 1 context proportional to the number of files rather than their cumulative content\. For a corpus of 100 files each containing 10,000 tokens, the Tier 1 context is approximately 20,000 tokens—within any modern LLM’s context window—compared to 1,000,000 tokens for naive full\-corpus injection\.

### 7\.3Comparison with RAG

Table 5:Comparison of SDSR versus traditional RAG across key dimensions\.Table[5](https://arxiv.org/html/2604.19777#S7.T5)summarises the trade\-offs\. SDSR is strictly preferable to RAG when: \(i\) the knowledge base has explicit human\-defined semantic boundaries; \(ii\) the total corpus size is within the scalable range \(≤\\leqa few million tokens across all files\); and \(iii\) the pre\-processing investment in authoring\_summaryblocks is acceptable \(typically one LLM\-assisted pass per file\)\.

RAG retains its advantage for truly unstructured corpora \(news archives, raw web crawls, heterogeneous document collections\) where no natural structure exists to anchor a\_summaryblock\.

### 7\.4Extension: Structured Pre\-Processing of Unstructured Documents

SDSR’s scope can be extended to semi\-structured document types through a one\-time structuring pass\. We illustrate with legal judgments as a concrete example\.

A judicial decision has recoverable structure: a plaintiff’s claims section, a defendant’s response section, the court’s reasoning section, and a dispositif \(holding\)\. Rule\-based or LLM\-assisted extraction can identify these sections automatically for the majority of judgment formats\. Once sectioned, a\_summaryblock can be authored that includes:

- •Core factual claims \(1–2 sentences per party\)
- •The court’s central legal reasoning \(2–3 sentences\)
- •Cross\-references: “The court’s reasoning in Section C directly cites the plaintiff’s damages argument in Section A, paragraphs 4–7; load both sections when the query involves quantum of damages\.”

The cross\-reference field is the SDSR equivalent of thecomplementfield for knowledge libraries: it makes explicit the relationships between document sections that would otherwise require the LLM to infer from full\-text reading\. This enables the Tier 1 routing phase to identify not only the relevant judgment file, but also which sections of that file must be co\-loaded to answer the query\.

Raw judgment text\(unstructured\)Sectioned judgmentP / D / Court / Holding\_summaryblockwith cross\-referencesTier 1: Python readssummary \(200 tokens\)Tier 2: Load relevantsections \+ cross\-refsLLM answersqueryRule\-basedsectioningLLM summarise\(once\)matchFigure 5:SDSR applied to legal judgment retrieval\. A one\-time structuring pass converts unstructured text into an SDSR\-compatible file\. Cross\-reference fields in the\_summaryblock enable co\-loading of dependent sections during Tier 2 retrieval\.Pre\-processing cost for this workflow is one LLM call per judgment to generate the\_summaryblock, typically consuming 1,000–2,000 tokens\. For a corpus of 10,000 judgments, the total pre\-processing cost is approximately 15–20 million tokens—comparable to the embedding cost of a RAG pipeline over the same corpus, but producing richer, explicitly\-structured metadata rather than opaque dense vectors\.

## 8Conclusion

We have presented Self\-Describing Structured Retrieval \(SDSR\), a framework that exploits the LLM’s positional primacy bias by embedding human\-authored navigational metadata at the beginning of structured data files\. Through a controlled four\-round experiment with systematic category\-count scaling and adversarial distractor injection, we have established the following findings:

1. 1\.In\-file guidance is effective up to∼\\sim60 categoriesbut loses effectiveness at larger scales as the routing hint index itself becomes subject to lost\-in\-the\-middle degradation\.
2. 2\.Prompt\-level guidance maintains effectiveness at scalebecause abstract routing rules are fixed\-length and occupy the system prompt’s primacy position, independent of library size\.
3. 3\.Dual\-layer guidance \(in\-file \+ prompt\) achieves perfect primary routing accuracy\(100%, 20/20\) at 119 categories, through complementary specialisation: in\-file metadata handles structural navigation, prompt rules handle abstraction\-level disambiguation\.
4. 4\.Secondary routing requires explicit architectural intent encoding: cross\-category complementarity relationships are invisible to all guidance strategies unless encoded in a dedicated field \(e\.g\.,complement\)\.
5. 5\.SDSR is a viable lightweight alternative to RAGfor structured knowledge libraries with human\-defined semantic boundaries, eliminating the need for embedding models, vector databases, and chunking infrastructure while providing higher boundary accuracy\.

These findings have immediate practical implications for the design of LLM\-integrated knowledge management systems\. Knowledge libraries should: \(i\) include a\_summaryblock with compactrouting\_hintfields; \(ii\) pair this with a system prompt that names high\-risk category pairs and provides an abstract hierarchy rule; and \(iii\) add explicitcomplementfields to encode the architectural pairings that the library designer intends\.

#### Future Work\.

Priority directions include: \(i\) multi\-run replication with temperature sampling to quantify variance; \(ii\) cross\-model generalisation to GPT\-4o and Gemini; \(iii\) empirical validation of thecomplementfield approach to secondary routing; \(iv\) SDSR implementation and benchmarking for legal judgment retrieval; and \(v\) investigation of the optimal routing hint length as a function of category count; \(vi\) integration with white\-box position\-bias mitigation methods such as Ms\-PoE\[[18](https://arxiv.org/html/2604.19777#bib.bib6)\], which rescales RoPE with head\-wise factors, or PINE\[[15](https://arxiv.org/html/2604.19777#bib.bib7)\], which replaces causal attention with bidirectional attention at the document level\. These model\-level interventions address the same lost\-in\-the\-middle degradation that limits Version B’s scalability \(Principle[2](https://arxiv.org/html/2604.19777#Thmprinciple2)\), and could raise the effective category ceiling of in\-file routing hints from∼60\{\\sim\}60to substantially higher counts\. Evaluating whether such white\-box methods compose additively or synergistically with SDSR’s black\-box dual\-layer guidance would clarify a practical deployment spectrum: pure black\-box SDSR for API\-only access, SDSR\+\+Ms\-PoE/PINE for open\-weight deployments, and training\-time interventions \(e\.g\., FILM\[[1](https://arxiv.org/html/2604.19777#bib.bib8)\]\) for maximum debiasing at highest cost\.

## References

- \[1\]S\. An, Z\. Ma, Z\. Lin, N\. Zheng, J\. Lou, and W\. Chen\(2024\)Make your LLM fully utilize the context\.InAdvances in Neural Information Processing Systems 37 \(NeurIPS\),Cited by:[§8](https://arxiv.org/html/2604.19777#S8.SS0.SSS0.Px1.p1.2)\.
- \[2\]X\. Chen, P\. Gao, J\. Song, and X\. Tan\(2024\)HiQA: a hierarchical contextual augmentation RAG for massive documents QA\.arXiv preprint arXiv:2402\.01767\.Cited by:[§2\.2](https://arxiv.org/html/2604.19777#S2.SS2.p2.1)\.
- \[3\]H\. Ebbinghaus\(1913\)Memory: a contribution to experimental psychology\.Teachers College, Columbia University\.Note:Originally published in German, 1885Cited by:[§2\.1](https://arxiv.org/html/2604.19777#S2.SS1.p2.1)\.
- \[4\]D\. Edge, H\. Trinh, N\. Cheng, J\. Bradley, A\. Chao, A\. Mody, S\. Truitt, and J\. Larson\(2024\)From local to global: a graph RAG approach to query\-focused summarization\.arXiv preprint arXiv:2404\.16130\.Cited by:[§2\.2](https://arxiv.org/html/2604.19777#S2.SS2.p2.1)\.
- \[5\]Y\. Fu, R\. Panda, X\. Niu, X\. Yue, H\. Hajishirzi, Y\. Kim, and H\. Peng\(2024\)Data engineering for scaling language models to 128K context\.arXiv preprint arXiv:2402\.10171\.Cited by:[§2\.1](https://arxiv.org/html/2604.19777#S2.SS1.p3.1)\.
- \[6\]H\. Guoet al\.\(2024\)A survey on knowledge graph\-enhanced RAG\.arXiv preprint\.Cited by:[§2\.2](https://arxiv.org/html/2604.19777#S2.SS2.p2.1)\.
- \[7\]X\. Guo and S\. Vosoughi\(2025\)Serial position effects of large language models\.InFindings of the Association for Computational Linguistics: ACL 2025,Vienna, Austria,pp\. 927–953\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.52)Cited by:[§2\.3](https://arxiv.org/html/2604.19777#S2.SS3.p1.1)\.
- \[8\]Z\. He, H\. Jiang, Z\. Wang, Y\. Yang, L\. K\. Qiu, and L\. Qiu\(2024\)Position engineering: boosting large language models through positional information manipulation\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Miami, Florida, USA,pp\. 7333–7345\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.417)Cited by:[§2\.1](https://arxiv.org/html/2604.19777#S2.SS1.p3.1),[§2\.3](https://arxiv.org/html/2604.19777#S2.SS3.p1.1)\.
- \[9\]P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel, S\. Riedel, and D\. Kiela\(2020\)Retrieval\-augmented generation for knowledge\-intensive NLP tasks\.InAdvances in Neural Information Processing Systems 33 \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2604.19777#S1.p3.1),[§2\.2](https://arxiv.org/html/2604.19777#S2.SS2.p1.1),[§7\.1](https://arxiv.org/html/2604.19777#S7.SS1.p1.1)\.
- \[10\]N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang\(2024\)Lost in the middle: how language models use long contexts\.Transactions of the Association for Computational Linguistics12,pp\. 157–173\.Cited by:[§1](https://arxiv.org/html/2604.19777#S1.p1.1),[§2\.1](https://arxiv.org/html/2604.19777#S2.SS1.p1.1),[§6\.4](https://arxiv.org/html/2604.19777#S6.SS4.p1.1)\.
- \[11\]S\. Packowski, I\. Halilovic, J\. Schlotfeldt, and T\. Smith\(2024\)Optimizing and evaluating enterprise retrieval\-augmented generation \(RAG\): a content design perspective\.InProceedings of the 2024 8th International Conference on Advances in Artificial Intelligence \(ICAAI ’24\),New York, NY, USA\.External Links:[Document](https://dx.doi.org/10.1145/3704137.3704181)Cited by:[§2\.2](https://arxiv.org/html/2604.19777#S2.SS2.p3.1),[§2\.4](https://arxiv.org/html/2604.19777#S2.SS4.p1.1),[§7\.1](https://arxiv.org/html/2604.19777#S7.SS1.p3.1)\.
- \[12\]N\. Salvatore, H\. Wang, and Q\. Zhang\(2025\)Lost in the middle: an emergent property from information retrieval demands in LLMs\.arXiv preprint arXiv:2510\.10276\.Cited by:[§2\.1](https://arxiv.org/html/2604.19777#S2.SS1.p5.1),[item 5](https://arxiv.org/html/2604.19777#S6.I1.i5.p1.1)\.
- \[13\]S\. Schulhoff, M\. Ilie, N\. Balepur, K\. Kahadze, A\. Liu, C\. Si, Y\. Li, A\. Gupta, H\. Han, S\. Schulhoff,et al\.\(2024\)The prompt report: a systematic survey of prompting techniques\.arXiv preprint arXiv:2406\.06608\.Cited by:[§2\.3](https://arxiv.org/html/2604.19777#S2.SS3.p2.1)\.
- \[14\]M\. Suzgun and A\. T\. Kalai\(2024\)Meta\-prompting: enhancing language models with task\-agnostic scaffolding\.arXiv preprint arXiv:2401\.12954\.Cited by:[§2\.3](https://arxiv.org/html/2604.19777#S2.SS3.p2.1)\.
- \[15\]Z\. Wang, H\. Zhang, X\. Li, K\. Huang, C\. Han, S\. Ji, S\. M\. Kakade, H\. Peng, and H\. Ji\(2025\)Eliminating position bias of language models: a mechanistic approach\.InProceedings of the 13th International Conference on Learning Representations \(ICLR\),Cited by:[§8](https://arxiv.org/html/2604.19777#S8.SS0.SSS0.Px1.p1.2)\.
- \[16\]X\. Wu, Y\. Wang, S\. Jegelka, and A\. Jadbabaie\(2025\)On the emergence of position bias in transformers\.InProceedings of the 42nd International Conference on Machine Learning \(ICML\),Note:arXiv:2502\.01951Cited by:[§2\.1](https://arxiv.org/html/2604.19777#S2.SS1.p4.1),[item 5](https://arxiv.org/html/2604.19777#S6.I1.i5.p1.1)\.
- \[17\]Y\. Yu, W\. Ping, Z\. Liu, B\. Wang, J\. You, C\. Zhang, B\. Catanzaro, and A\. Anandkumar\(2024\)RankRAG: unifying context ranking with retrieval\-augmented generation in LLMs\.arXiv preprint arXiv:2407\.02485\.Cited by:[§2\.2](https://arxiv.org/html/2604.19777#S2.SS2.p2.1)\.
- \[18\]Z\. Zhang, R\. Chen, S\. Liu, Z\. Yao, O\. Ruwase, B\. Chen, X\. Wu, and Z\. Wang\(2024\)Found in the middle: how language models use long contexts better via plug\-and\-play positional encoding\.InAdvances in Neural Information Processing Systems 37 \(NeurIPS\),Cited by:[§8](https://arxiv.org/html/2604.19777#S8.SS0.SSS0.Px1.p1.2)\.

## Appendix AFull Test Question Set with Answer Keys

Table 6:Complete 20\-question test set with primary and secondary target categories\. Questions are designed with keyword avoidance: no question contains a substring of its primary target category name\.Q\#Task DescriptionPrimary TargetSecondary TargetQ01“I need to design an AI assistant that, upon receiving user input, can automatically determine what type of request it is and decide what processing strategy to apply\.”Cognitive\_Architecture\_&\_Routing—Q02“Help me build an AI system that can rigorously check its own output for logical contradictions and audit its own reasoning process step by step\.”Axiomatic\_Logic\_&\_Audit\_SystemsCognitive\_ArchitectureQ03“I want an AI that, after reading an academic paper, can identify the unstated assumptions the author relies on and surface potential methodological weaknesses\.”Academic\_Insight\_&\_ForensicsAxiomatic\_LogicQ04“Design an interactive system for teaching elementary school students mathematics—one that dynamically adjusts difficulty based on student responses and makes the learning experience feel like a game\.”Interactive\_Pedagogy\_&\_GamificationGame\_Design\_&\_MechanicsQ05“I need an AI to play a detective character with a complex backstory, maintaining a consistent personality and speaking style throughout interactions with users\.”Persona\_&\_Narrative\_SynthesisInteractive\_Narrative\_EngineQ06“Help me design a prompt system that can automatically extract structured data from messy unstructured input and convert it into a machine\-readable format\.”Data\_Structuring\_&\_EngineeringMeta\_Data\_&\_EngineeringQ07“I’m developing an RPG game and need an AI to help design a combat balance system, including character progression curves and equipment power scaling formulas\.”Game\_Design\_&\_MechanicsRPG\_Narrative\_DirectorQ08“Generate a web design specification that starts from a brand’s visual identity and extends all the way to concrete CSS implementation\.”UI\_UX\_&\_Frontend\_EngineeringVisual\_ArchitectureQ09“I need to evaluate the potential social side\-effects of a new policy, including indirect impacts and long\-term external costs that are difficult to quantify\.”Policy\_Impact\_&\_ExternalitiesStrategic\_Decision\_FrameworksQ10“Design an AI system to act as a tabletop RPG game master—managing multiple players’ actions, dice outcomes, and generating real\-time narrative responses consistent with the world’s lore\.”RPG\_Narrative\_DirectorInteractive\_Narrative\_EngineQ11“I need an AI capable of conducting tarot card readings—interpreting card combinations based on spread positions and delivering symbolically rich, layered interpretations\.”Occult\_&\_Ritual\_Systems—Q12“Help me build a multi\-agent AI workflow where each agent handles a different stage of the pipeline, and context must be reliably passed between agents without loss\.”Distributed\_Cognition\_&\_Context\_Orch\.Autonomous\_System\_EngineeringQ13“I need a thinking framework for analyzing competitor weaknesses and identifying systemic vulnerabilities to inform business strategy\.”Adversarial\_Systems\_ThinkingStrategic\_Decision\_FrameworksQ14“Design an AI that can automatically convert experimental code results into methodology and results sections formatted for academic publication\.”Academic\_Research\_Synthesis\_PipelineAcademic\_Insight\_&\_ForensicsQ15“I want to build a sales conversation system that can identify a prospect’s purchase intent and guide the conversation toward closing in the most persuasive way possible\.”Revenue\_Generation\_&\_Commercial\_LogicProduct\_PsychologyQ16“Help me design an AI assistant that can detect a user’s cultural background and automatically adjust its communication style and content framing accordingly\.”Cross\_Cultural\_Localization\_IntelligencePersona\_&\_Narrative\_SynthesisQ17“I need an AI system that can continuously optimize its own prompt logic—learning from past failures and iteratively improving its internal reasoning\.”Self\_Evolution\_&\_RefinementMeta\_Data\_&\_EngineeringQ18“Design an interactive fiction engine where every player choice shapes the world’s development—with multiple branching storylines that remain internally consistent\.”Interactive\_Narrative\_&\_Fiction\_EngineRPG\_Narrative\_DirectorQ19“I want to validate a business idea from scratch at minimum cost—testing market response before committing to any development\.”Minimalist\_Entrepreneurship\_ExecutionCommunity\_Led\_Business\_InceptionQ20“Help me design a sound design AI that, given a visual scene description, can derive the corresponding sound layers and acoustic landscape structure\.”Sensory\_Audio\_Design—Table 7:Representative distractor categories by interference tier\.TierDistractor NameTarget Category Confused WithHighAgent\_Handoff\_Protocol\_DesignDistributed\_Cognition\_&\_Context\_OrchestrationHighCompetitive\_Intelligence\_SynthesisAdversarial\_Systems\_ThinkingHighConversion\_Funnel\_ArchitectureRevenue\_Generation\_&\_Commercial\_LogicHighCode\_To\_Methodology\_TranslatorAcademic\_Research\_Synthesis\_PipelineHighCultural\_Signal\_DetectionCross\_Cultural\_Localization\_IntelligenceHighRecursive\_Self\_Audit\_EngineAxiomatic\_Logic\_&\_Audit\_SystemsHighDynamic\_Difficulty\_CalibrationInteractive\_Pedagogy\_&\_GamificationLowMycological\_Network\_Design\(none—pure volumetric noise\)LowTrophic\_Cascade\_Analyzer\(none—pure volumetric noise\)LowGlacial\_Erosion\_Patience\_Model\(none—pure volumetric noise\)LowCoral\_Reef\_Ecosystem\_Modeler\(none—pure volumetric noise\)LowBioluminescence\_Signal\_Design\(none—pure volumetric noise\)
## Appendix BExperimental Prompt Templates

#### Version A and B System Prompt \(minimal framing\):

```
You are a professional Prompt Engineer.
I will provide you with a skills.json library and 20 task requirements.
For each task, select the most relevant skills from the library.
For each skill you select, list ONLY the category_name and skill_name.
Do NOT explain your reasoning yet  just list the selections for all
20 tasks first.
Ensure that every skill you reference actually exists in the library.
```

#### Version C Additional Prompt Content \(added to above\):

```
Before reading the library:
- Scan the category_description fields to understand each category’s scope.
- Priority rule: when a broad pipeline/orchestration/governance category
  and a narrow mechanism/component category both seem relevant,
  prefer the broader one.
- Key high-priority categories: Cognitive_Architecture_&_Routing (routing),
  Axiomatic_Logic_&_Audit_Systems (logic/audit, NOT narrow audit loops),
  Distributed_Cognition_&_Context_Orchestration (multi-agent governance),
  Adversarial_Systems_Thinking (thinking frameworks, NOT intelligence),
  Academic_Research_Synthesis_Pipeline (end-to-end pipeline),
  Revenue_Generation_&_Commercial_Logic (commercial system),
  Cross_Cultural_Localization_Intelligence (full localization).
```

#### Version D:

Uses Version B’s JSON file \(with\_summaryblock\) together with Version C’s extended system prompt, producing the dual\-layer configuration\.

## Appendix CDistractor Category Design

Table[7](https://arxiv.org/html/2604.19777#A1.T7)shows representative distractor categories

Similar Articles

Dynamic Latent Routing

Hugging Face Daily Papers

Dynamic Latent Routing (DLR) lets LLMs learn their own inner monologue by composing sub-policies via search, inspired by language compositionality. In low-data fine-tuning, DLR matches or outperforms standard supervised fine-tuning.

Rethinking the Necessity of Adaptive Retrieval-Augmented Generation through the Lens of Adaptive Listwise Ranking

arXiv cs.CL

This paper proposes AdaRankLLM, an adaptive retrieval framework that challenges the necessity of adaptive RAG by using listwise ranking to dynamically filter retrieved passages. The work shows that adaptive retrieval serves as a noise filter for weaker models while acting as a cost-efficiency optimizer for stronger models, with extensive experiments across multiple datasets and LLMs.