@johnsonshi86: https://x.com/johnsonshi86/status/2072112215097024961

X AI KOLs Timeline 07/01/26, 12:17 AM Tools

retrieval-augmented-generation rag corpus-retrieval harness-engineering inference-providers dr-dci bm25

Summary

Describes DR-DCI, an optimization that combines RAG with bash commands on a virtual filesystem to enable agents to perform precise corpus retrieval, and discusses scaling to distributed systems for inference providers.

https://t.co/SefC6wVWIZ

Original Article

View Cached Full Text

Cached at: 07/01/26, 02:09 PM

DR-DCI: Fast Corpus Retrieval as a Harness Engineering Differentiator for Inference Providers

DR-DCI is an optimization built on top of RAG for letting agents run precise, verifiable search across large document collections without scanning the whole corpus on every query. It keeps a RAG-style first pass, an index lookup such as BM25, to narrow a huge corpus down to a manageable set of candidate documents. Then it adds a second layer: those candidate documents get materialized into a virtual file system, a sandboxed workspace the agent can operate on directly. The agent then runs real bash commands, grep, rg, cat, find, directly against that workspace to search, cross-reference, and verify evidence.

This matters most for inference providers and harness builders. The harness around a model, how it exposes tools, manages state, and bounds what the model can touch, is what increasingly separates one agent stack from another, not the model weights themselves. Harness design shows up directly in tool-calling benchmark scores and in how an agent performs on real work.

The material below translates the Dr-DCI paper (surfaced via Jo Kristian Bergum’s X post at the AI Engineer World’s Fair 2026) into systems terms for a Kubernetes and distributed-systems audience, then extends it: what it takes to run this pattern once the corpus isn’t on one machine. That gap is what makes this pattern production-ready at scale, and it’s something worth exploring and benchmarking among inference providers.

The constraint: an LLM has a fixed-size RAM, not a disk

An LLM reads a fixed number of tokens per call, the context window. Think of it as RAM: fast, but small and bounded. A document collection (a “corpus”) can be billions of documents. That’s disk-sized data. You cannot load disk into RAM wholesale. You need an index and a way to fetch only the relevant slice.

That fetching step is what the AI field calls “retrieval.” Everything below is different strategies for doing it.

RAG: query an index, get back chunks, done

RAG (Retrieval-Augmented Generation) is the standard approach:

Offline indexing. Split every document into chunks (commonly 500 tokens). Run each chunk through an embedding model, which outputs a vector (a few hundred floats representing meaning, not exact words). Store all vectors in a vector database.
Query time. Embed the user’s question into the same vector space. Run an approximate-nearest-neighbor search against the vector DB. Get back the top-k closest chunks, typically 5-20.
Generation. Paste those chunks into the LLM’s context window. The LLM answers using only what it was handed.

The LLM never touches the rest of the corpus. It gets one retrieval pass, then it’s on its own.

The failure mode: if the top-k result set misses the actual answer, because a chunk boundary split a fact in half, or the embedding similarity didn’t catch a paraphrase, the LLM has no recourse. It can’t look further. It’s a single request to a single backend with no retry, no fallback, no second query informed by the first response.

BM25: a common retriever, the alternative to embeddings

BM25 is one of the most typical retrievers used in RAG, alongside embedding-based dense retrieval. It predates neural embeddings by decades. No model, no GPU, no training step. It’s a statistical scoring function: rank documents by how often the query’s terms appear, weighted so rare terms count more than common ones, normalized for document length. Build an inverted index once (term to document list, like the index behind any full-text search engine: Lucene, Elasticsearch’s default), then query it directly.

The index comes directly from the corpus itself. Tokenize every document at indexing time (split into words, lowercase, strip filler words, sometimes reduce to a root form), and every distinct token that survives becomes a term in the index. A query gets tokenized the same way at query time. If a query token never appeared in any document, its postings list is simply empty, no match, not an error.

BM25 is fast and exact for names, IDs, error codes, code symbols. It’s worse than embeddings at matching a query to text that means the same thing but uses different words, since it needs the same token in both the query and the document, “car” and “automobile” share no tokens even though a human reads them as the same thing.

DCI: skip the index, give the agent a shell

Direct Corpus Interaction (DCI) is the opposite extreme from RAG. Instead of pre-indexing and ranking, give the LLM agent actual shell tools: rg, grep, find, cat, read. The agent runs its own commands against the raw corpus, the way you’d grep -r a codebase you don’t have indexed.

This buys precision RAG doesn’t have. The agent can issue a second search based on what the first one returned, cross-reference two documents, verify an exact string exists, follow a reference from doc A to doc B. A single retrieval-then-answer pass can’t do any of that.

The failure mode: grep -r over billions of documents is a full scan with no index. Same problem as querying an unindexed table at scale: it gets slow, then it times out.

Dr-DCI: BM25 as a cache-population step in front of grep

This is the part that should click immediately if you’ve built caching layers. Dr-DCI uses BM25 as a fast first pass that populates a small working set. The agent then greps that working set directly, the DCI part.

The loop:

The agent calls a pull(query, k) tool. This runs a BM25 search over the full corpus and returns k candidate documents, whole documents, not chunks.
Those documents get materialized as real files in a scratch directory (the “workspace”), via hard links, so there’s no copy cost.
The agent runs grep/rg/cat/read against that small directory, not the full corpus.
If it doesn’t find what it needs, it calls pull again with a different query. New results get added to the same workspace. Already-pulled files aren’t re-fetched.

The workspace stays small, roughly 1,000-1,400 files, even when the underlying corpus is 10 million documents. BM25 is doing index lookup. DCI is doing the precise, stateful operations once the working set is small enough to fit on local disk.

Why this is faster and more accurate

Three separate failure modes, one fix:

RAG. Chunking splits facts across boundaries, and the agent only ever sees the top-k snippets it gets handed once. If the chunk that mattered wasn’t in that top-k, there’s no way to look further.
BM25 alone. Same snippet limit as RAG: the agent sees a ranked list of snippets, not full documents, and can’t cross-reference across results.
DCI alone. Full-corpus grep is expensive, and gets more expensive the more the corpus is spread across machines, every search touches a distributed set instead of a local one.

Dr-DCI avoids all three: BM25 narrows the corpus to a small candidate set, then DCI runs full-precision search against that bounded set instead of the whole corpus.

The paper’s numbers, on the same benchmark, same tools, only the access pattern changed:

Raw DCI vs Dr-DCI wall time.

Same precision tools. ~20x faster wall time. Because the agent is operating over roughly 1,000 files instead of the entire corpus.

How the workspace gets populated cheap

The wall-time drop comes mostly from bounding the search scope, not from a provisioning trick. The workspace stays around 1,000-1,400 files regardless of corpus size, so every grep call scans a few thousand files instead of the entire corpus.

Populating that bounded workspace still needs to be cheap. The paper’s mechanism for that is specific to a single machine:

Hard links, not copies. Materializing a pulled document creates a new directory entry pointing at the same inode. No file content gets duplicated, no write happens.
Dedup on pull. The harness filters out documents already in the workspace before adding new ones from a pull call, so overlapping retrieval results don’t redo work.
Root-flat namespace. No folders by rank or query. Rank-aware subfolders were tested and reduced accuracy, brittle paths confused the agent’s terminal navigation. Rank gets reported in the tool-call text instead of the file path.
Bounded reads. Read and search tool outputs are truncated with continuation hints, so one grep across the workspace can’t flood the model’s context window.

A hard link only works because the corpus, the index, and the workspace are all on the same disk. The paper never states this as an assumption. It’s an inference from the mechanism: a hard link cannot point across machines, so this design only works when corpus, index, and workspace share a disk. Split the corpus across machines and the mechanism breaks.

Applying this to distributed systems

The paper runs everything on one machine. Turning the workspace-bounding idea into something that holds up across a distributed corpus is mostly unexplored, and it’s the part worth digging into next.

The paper measures one cost directly: search cost across the corpus. Raw DCI’s full-corpus terminal search times out in their results, recovered tool-result durations show p50/p90/p95/p99 single-tool times of 12.4s/97.0s/167.2s/310.2s, with a max of 24,418s. Dr-DCI’s workspace-bounding fixes that cost.

A second cost surfaces once the corpus, index, and compute aren’t on the same machine: workspace creation and materialization, moving the retrieved document bytes from wherever they live to wherever the agent’s tools run. On one machine that cost is a hard link, free. Across machines, it’s a real network fetch.

These are two separate costs:

Distributed search cost. Running BM25 across a sharded corpus. Solved the normal way, the same shape as any sharded search engine: a query fans out to N shards, each shard scores locally, results merge centrally.
Workspace creation and materialization cost. After BM25 ranks the relevant files, those files have to be moved onto a sandboxed workspace where the agent’s bash tools can operate. The cost of that gathering and provisioning is a separate cost, and it’s in scope once the corpus isn’t local.

That second cost is bounded the same way the local case is bounded: roughly 1,000 small documents per query, not the corpus. At 50KB per document, about 50MB, fetched in parallel.

Closing the gap: five patterns

The first four patterns are architectural. The fifth is operational.

1. Shard locality

A BM25 shard is a partition of the corpus: split the corpus into N pieces, and each piece gets its own inverted index, a lookup from term to the list of documents containing it, with term frequency and document length for scoring, built over just that piece’s documents. That index is built once, offline, before any query runs, and written to disk as standing files, the same way a database builds and persists an index rather than recomputing it per query. A query is scored against each shard separately, then the per-shard scores merge and re-rank centrally.

Colocating a shard means storing that shard’s inverted index and the actual document files for its partition on the same host. A query that only needs one shard never leaves that host, no network hop to materialize the result. A query that needs a different shard crosses a host boundary, and that’s the real cost, not anything as coarse as a datacenter.

Worth noting where the query string itself comes from: in Dr-DCI, pull(query, k) isn’t handed a raw user query, it’s handed whatever search string the agent decides to write. The agent is already an LLM reasoning about the task, so query formulation happens as a side effect of that reasoning, no separate query-rewriting step needed. Systems that don’t already have an LLM generating the query, a traditional search box, for instance, usually add a cheap rewriting or expansion step in front of BM25 for the same reason: raw user text often misses the exact tokens the index needs.

2. Content-addressable caching, scoped to what doesn’t change

The paper’s dedup only applies within one agent trajectory. A shared cache keyed by content hash turns a second reference to the same document into a cache hit instead of a second fetch. The strategy splits depending on whether the corpus changes.

Immutable content (archived news articles, completed filings): key by content hash, cache forever. This is the same shape as a pull-through registry cache (Harbor, Zot, any OCI-compliant mirror), content addressed by digest, with no invalidation logic possible because the key can’t change. A multi-tier version, local node cache, then a regional mirror, then origin storage, works the same way: a miss at one tier checks the next before going to origin, and the result gets written back down through whichever tiers missed.

Mutable content (a wiki page, a ticket, a document under active edit): a hash of current content isn’t a stable key, since the key changes the moment the content does. A stale entry can silently serve outdated data. Needs cache invalidation.

A tiered cache for mutable content: a per-node in-memory cache (L1, fastest, smallest), a shared cache like Redis (L2, shared across nodes), and the source of truth behind it (L3). L2 is easy to keep correct: a write-through updates Redis and the source together. L1 is the risk, each node’s local copy can go stale silently if nothing tells it the data changed. Two fixes: short TTLs on L1, so staleness is bounded and self-healing, or pub/sub invalidation, where a “this key changed” event (Redis keyspace notifications, for example) gets broadcast to every node so L1 evicts the key on receipt. TTL is simpler and eventually consistent. Pub/sub is more precise but adds a dependency every node has to handle correctly.

In practice, most real corpora need both paths at once: a long-lived content-addressable cache for the immutable majority, a versioned or write-through path for whatever subset changes.

3. Fast sandbox setup and teardown

Each pull() call materializes a workspace for one query, then discards it. If that workspace is a sandboxed environment, a microVM or lightweight container, rather than just a directory on shared disk, the create and destroy cost matters as much as the data-movement cost.

A base image with the corpus-access tooling pre-staged, plus a thin copy-on-write layer per query, keeps setup close to instant and removes any cleanup cost on teardown. This is the same problem agent sandbox runtimes already solve for code execution, applied here to corpus-search workspaces instead.

4. Prefetch on rank

pull() already returns a ranked list before the agent decides what to inspect. Fetching the full top-k in parallel with the agent reading that ranked preview hides fetch latency behind the model’s own reasoning step.

5. Precompute hot query results

Once this runs in production, query patterns aren’t uniform. Some documents get pulled far more often than others. Track which documents come back most frequently across queries, and pre-stage those into the shared cache ahead of time, instead of waiting for the first miss to populate it.

For BM25 specifically, also cache the ranked result list for the most common queries or query terms, so a repeat query skips the scoring step entirely and goes straight to a cache hit. This makes the content-addressable cache from pattern 2 warm before the first real query lands, instead of only filling up reactively.

Source paper: Dr-DCI: Scaling Direct Corpus Interaction via Dynamic Workspace Expansion, arXiv:2606.14885 (June 2026).

Original post: Jo Kristian Bergum on X, AI Engineer World’s Fair 2026.

Credits to @zhuofengli96475 (and other paper authors) and @jobergum for the presentation.

@johnsonshi86: https://x.com/johnsonshi86/status/2072112215097024961

DR-DCI: Fast Corpus Retrieval as a Harness Engineering Differentiator for Inference Providers

The constraint: an LLM has a fixed-size RAM, not a disk

RAG: query an index, get back chunks, done

BM25: a common retriever, the alternative to embeddings

DCI: skip the index, give the agent a shell

Dr-DCI: BM25 as a cache-population step in front of grep

Why this is faster and more accurate

How the workspace gets populated cheap

Applying this to distributed systems

Closing the gap: five patterns

1. Shard locality

2. Content-addressable caching, scoped to what doesn’t change

3. Fast sandbox setup and teardown

4. Prefetch on rank

5. Precompute hot query results

Similar Articles

Dr-DCI: Scaling Direct Corpus Interaction via Dynamic Workspace Expansion

@leerob: https://x.com/leerob/status/2065469795529588940

@ryancarson: https://x.com/ryancarson/status/2064751272834593135

@zhuofengli96475: DCI just hit #1 on Hugging Face Daily Papers! Try it Now! @HuggingPapers https://huggingface.co/papers/2605.05242…

Just had to rewrite my entire agent infrastructure for reliability, anyone else doing the same?

Submit Feedback

Similar Articles

Dr-DCI: Scaling Direct Corpus Interaction via Dynamic Workspace Expansion

@leerob: https://x.com/leerob/status/2065469795529588940

@ryancarson: https://x.com/ryancarson/status/2064751272834593135

@zhuofengli96475: DCI just hit #1 on Hugging Face Daily Papers! Try it Now! @HuggingPapers https://huggingface.co/papers/2605.05242…

Just had to rewrite my entire agent infrastructure for reliability, anyone else doing the same?