CHOP: Chunkwise Context-Preserving Framework for RAG on Multi Documents
Summary
CHOP is a framework for improving RAG systems on multi-document retrieval by using context-aware metadata and LLM-based chunk relevance evaluation to reduce semantic conflicts and hallucinations. The approach achieves 90.77% Top-1 Hit Rate through intelligent chunking and contextual preservation strategies.
View Cached Full Text
Cached at: 04/20/26, 08:29 AM
# Chunkwise Context-Preserving Framework for RAG on Multi Documents
Source: https://arxiv.org/html/2604.15802
###### Abstract
Retrieval-Augmented Generation (RAG) systems lose retrieval accuracy when similar documents coexist in the vector database, causing unnecessary information, hallucinations, and factual errors. To alleviate this issue, we propose CHOP, a framework that iteratively evaluates chunk relevance with Large Language Models (LLMs) and progressively reconstructs documents by determining their association with specific topics or query types. CHOP integrates two key components: the CNM-Extractor, which generates compact per-chunk signatures capturing categories, key nouns, and model names, and the Continuity Decision Module, which preserves contextual coherence by deciding whether consecutive chunks belong to the same document flow. By prefixing each chunk with context-aware metadata, CHOP reduces semantic conflicts among similar documents and enhances retriever discrimination. Experiments on benchmark datasets show that CHOP alleviates retrieval confusion and provides a scalable approach for building high-quality knowledge bases, achieving a Top-1 Hit Rate of 90.77% and notable gains in ranking quality metrics.
RAG; Chunking strategies; Multi-document retrieval;
## 1. Introduction
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of tasks. Nonetheless, they often underperform in knowledge-intensive or domain-specific scenarios, particularly when queries extend beyond their training distribution or require up-to-date information, leading to hallucinations (Kandpal et al., 2023; Zhang et al., 2025). Retrieval-Augmented Generation (RAG) offers a promising solution by retrieving semantically relevant evidence from external knowledge sources, grounding model outputs to mitigate factual errors and enhance practical utility (Lewis et al., 2020; Feng et al., 2024). However, a major limitation of current RAG pipelines arises from length-based segmentation, which disrupts continuity by fragmenting coreference and local references (e.g., "this method," "Eq. (3)"), thereby reducing interpretability and recall. In multi-document contexts, redundancy, duplication, and conflicting references further increase retrieval ambiguity and destabilize grounding. Moreover, irrelevant or weakly related passages can mislead LLMs, compounding the risk of inaccurate outputs.
Complementary to these observations, **query-side strategies** improve retrieval by refining input representations, such as rewriting queries through transformation (e.g., rewrite–retrieve–read) (Ma et al., 2023), shifting embeddings toward answer-level semantics with hypothetical document embeddings (HyDE) (Gao et al., 2023), abstracting queries into higher-level concepts via step-back prompting (Zheng et al., 2023), and combining keyword, semantic, and vector search through hybrid retrieval (Karpukhin et al., 2020). On the **document side**, methods like ChunkRAG (Singh et al., 2024) segment text into semantically coherent units and apply multi-stage relevance scoring with self-reflection, critic models, redundancy removal, and dynamic thresholding. **Indexing optimization** further improves dense retrieval by explicitly modeling relevance signals (e.g., source quality, cross-document consistency, citation alignment) (Wu et al., 2024).
Despite recent advances, most RAG approaches still process chunks independently, neglecting discourse-level dependencies across and within documents. Ignoring these relationships—such as definitions, coreference links, and discourse transitions—often leads to incomplete or misleading grounding, a problem that is especially severe in multi-document contexts with near-duplicate content and localized references. Addressing this gap requires context-preserving representations and relevance-aware mechanisms that propagate reliability signals across chunks and into generation, ensuring that decisions at time t are conditioned on information established up to t−1.
![Figure 1. Overview of the CHOP architecture comprising two components: the Continuity Decision module, which determines whether consecutive chunks are continuous, and the CNM-Extractor, which generates compact representations. When continuity holds (left), the next chunk inherits the previous CNM; when it does not (right), a new CNM is extracted. Each chunk is then prefixed with its CNM, and the combined representations are embedded to construct the vector database.]
To tackle the limitations of chunk-level independence and semantic conflicts in multi-document collections, we propose **CHOP**, a Chunkwise Context-Preserving framework for RAG. CHOP enhances retrieval by processing chunks sequentially and enriching them with context-aware embeddings. It is designed for long documents with high lexical or semantic overlap across sub-documents and introduces two key components:
(1) the **CNM-Extractor** (Category–Noun–Model Extractor), which generates compact per-chunk signatures capturing categories, key nouns, and model names
(2) the **Continuity Decision Module**, which determines whether to inherit the previous chunk's CNM or extract a new one, depending on contextual continuity.
By prefixing each chunk with its CNM, CHOP regularizes the embedding space, mitigates semantic collisions, and reduces retrieval confusion when similar segments coexist—improving relevance in high-overlap settings without retraining the retriever.
## 2. Method
CHOP is designed for composite long documents characterized by substantial lexical and semantic overlap across sub-documents, a domain that makes retrieval particularly challenging. To handle this, we introduce a sequential pipeline that assigns a consistent per-chunk representation and updates it only when contextual shifts occur. The pipeline consists of two components:
(1) the **CNM-Extractor** (Category–Noun–Model Extractor), which generates key representations from each chunk, and
(2) the **Continuity Decision module**, which determines whether to inherit the previous chunk's CNM or re-extract a new one based on continuity.
### 2.1. CNM-Extractor
The input file F is split into a sequence of fixed-size chunks {C₁,...,Cₙ}. We define CNM as the triplet {Category, Nouns, Model}, which serves as a compact per-chunk signature that anchors each chunk's semantic position in long documents and facilitates downstream retrieval. The fields are specified as follows:
- **Category**: the broad product family described by the chunk (e.g., *camera*, *air conditioner*, *fan*, *flower*, *boat*).
- **Nouns**: one or two key nouns that capture the core target or action, including parts, operations, or functions (e.g., *air conditioner filter*).
- **Model**: the specific model or series name (e.g., *225B*, *X-SERIES*).
As shown in Eq. (1), for each chunk Cᵢ, the CNM-Extractor takes Cᵢ as input, derives its CNM, and produces the corresponding representation Mᵢ.
(1) Mᵢ = {Categoryᵢ, Nounsᵢ, Modelᵢ}
The CNM-Extractor is an LLM-based module that analyzes a chunk and outputs structured information following a predefined template. Listing 1 shows an example prompt. The prompt is divided into instructional sections, and any field not explicitly stated or ambiguous must be set to null. The Nouns field requires 1–2 key nouns, with the first strictly constrained to the form "<category> <specific noun>". The output is enforced as JSON only, avoiding free-form text to prevent format drift. Even when a chunk is processed independently, the extracted CNM is injected as a prefix to stabilize the contextual frame and clarify references to entities, symbols, and units.
| Table 1. Comparison of retrieval performance (by Top-k) |
### 2.2. Continuity Decision
In multi-document contexts (e.g., manuals, statutes, policy books), topic shifts at chunk boundaries often cause semantic drift or noise amplification, which degrades retrieval quality. The Continuity Detector (CD) addresses this by accurately distinguishing between contextual continuity and heterogeneous topic separation.
The Continuity Decision module CD(·) is an LLM-based classifier that, given a pair of adjacent chunks (Cᵢ, Cᵢ₊₁) and a decision prompt, determines whether Cᵢ₊₁ is a continuation of the same document as Cᵢ or begins a new one. The decision function is formally defined in Eq. (2).
(2) contᵢ→ᵢ₊₁ ← CD(Cᵢ, Cᵢ₊₁) ∈ {TRUE, FALSE}
In practice, CD is implemented as an LLM that is prompted with explicit decision rules (Listing 2). The prompt includes several elements to ensure consistent segmentation: a decision goal that instructs the model to treat chunks as part of the same manual unless strong evidence suggests otherwise, a set of decision rules that define explicit conditions for starting a new manual, and paired inputs consisting of the previous anchor text and the current text to enable contextual comparison.
When the continuity decision is TRUE, the two chunks are treated as part of the same topical flow, indicating preserved context. In contrast, FALSE denotes a topic shift and thus a contextual break.
**Listing 1: Prompt Example for Metadata Extraction**
```
Extract product CNM from the text below.
Rules:
- category: The most general product family this text describes. (camera, air conditioner, ...). Use null if absent.
- model: The specified model/series (e.g., 225B, X-SERIES). Use null if absent.
- nouns: 1~2 key product nouns.
* The FIRST item MUST be a core compound noun. If the focus is a specific part/task/function within the same category, format it as "<category> <specific noun>"
* Any remaining items should be a single noun (part/operation/function) e.g., battery, charger, filter, nozzle, fan
- Always give best estimate + confidence score.
- Return JSON ONLY.
Input Text: {text[:1000]}
```
Based on the decision, Mᵢ₊₁ is determined as follows, as defined in Eq. (3):
(3) Mᵢ₊₁ = { Mᵢ, if contᵢ→ᵢ₊₁ = TRUE
{ CNM(Cᵢ₊₁), if contᵢ→ᵢ₊₁ = FALSE
In segments judged to belong to the same flow, CNM is kept piecewise constant to preserve semantic coherence. In implementation, the CNM of the previous chunk is inherited to maintain contextual continuity. When adjacent chunks are deemed dissimilar, CNM is re-extracted with the CNM-Extractor. Re-extraction explicitly marks a topic shift and prevents contamination or error propagation from prior labels.
Given the decided CNM Mᵢ, we construct xᵢ and eᵢ as defined in Eq. (4).
(4) xᵢ = [PFX(Mᵢ) ∥ Cᵢ], eᵢ = Embed(xᵢ)
We embed xᵢ into a d-dimensional vector eᵢ ∈ ℝᵈ and store both the textual input xᵢ and its embedding eᵢ in a vector database. This procedure improves the quality of the evidence set for downstream stages, thereby enhancing the stability and reliability of the RAG pipeline.
**Listing 2: Prompt Example for Continuity Decision**
```
Decision goal: Be conservative and keep texts in the SAME manual...
Decision rules:
- same = true when: same product category/model, or just section shift...
- same = false when: product category changes, new model (e.g., 225B -> 226R), or new doc markers...
Previous anchor: [text]
Current text to judge: [text]
```
| Table 2. Comparison of generation performance (by Top-k) |
## 3. Experiments
### 3.1. Implementation Details
All chunks and queries are encoded with OpenAI's text-embedding-3-large model, producing 3,072-dimensional vectors eᵢ ∈ ℝ³⁰⁷². The embeddings are stored in ChromaDB (Contributors, 2023), with an HNSW (Hierarchical Navigable Small World) index (Malkov and Yashunin, 2018) used for approximate nearest-neighbor search. We perform retrieval by computing cosine similarity between the query and chunk embeddings, and the top-k most similar chunks are selected as matches. Because the focus of this study is classification and decision-making rather than text generation, the temperature is fixed at 0 to ensure deterministic outputs and reproducibility. The backbone LLM in all experiments is Gemma-12B (Team et al., 2024).
**Dataset.** We evaluate on MRAMG-Bench (Yu et al., 2025), a benchmark derived from product manuals in ManualsLib (ManualsLib, 2025). While the original dataset presents each manual as multiple split lists or fragments, we reconstruct each manual into a single continuous file to better emulate real-world usage. This weakens explicit document-boundary cues, requiring retrieval to rely more heavily on contextual continuity. At the same time, natural topic transitions and inter-section dependencies are preserved, making the corpus more faithful to real manual structures. All methods are evaluated under this reconstructed single-document setting.
### 3.2. Retrieval Evaluation
This study systematically verifies whether the proposed method, CHOP, improves retrieval performance. To isolate the effect of CHOP's consistent prefix injection, the retrieval stack—embedding, indexing, and similarity computation—is kept fixed, and only the chunk composition is varied. We then apply diverse chunking strategies and quantify their impact on retrieval performance.
#### 3.2.1. Baseline
- **Naive-500T**: Each document is uniformly split into fixed-length chunks of 500 tokens with an overlap of 100 tokens.
- **Cosine-Chunking** (Singh et al., 2024): Each document is split into sentences, and topic-shift boundaries are detected using a cosine similarity threshold of 0.35 between consecutive sentences, producing adaptively sized chunks.
#### 3.2.2. Evaluation Metrics
Table 1 compares the retrieved set for each query against its gold evidence. Under a Top-K assumption, we report standard retrieval metrics: Hit Rate at K (Hit@K), Mean Reciprocal Rank at K (MRR@K), and Normalized Discounted Cumulative Gain at K (NDCG@K) (Järvelin and Kekäläinen, 2002). Hit@K measures the proportion of cases in which the Top-K retrieved set contains at least one gold-evidence chunk. MRR@K computes the average reciprocal rank of the first relevant chunk per query within the Top-K list. NDCG@K applies logarithmic discount factors to compute the cumulative gain of ranking quality.Similar Articles
Web Retrieval-Aware Chunking (W-RAC) for Efficient and Cost-Effective Retrieval-Augmented Generation Systems
W-RAC introduces a cost-efficient chunking framework for web document processing in RAG systems that reduces LLM token usage by an order of magnitude through structured content representation and retrieval-aware grouping decisions. The method decouples text extraction from semantic chunk planning, achieving comparable or better retrieval performance than traditional chunking approaches while minimizing hallucination risks.
RAG-Anything: All-in-One RAG Framework
RAG-Anything is a new open-source framework that enhances multimodal knowledge retrieval by integrating cross-modal relationships and semantic matching, outperforming existing methods on complex benchmarks.
@akshay_pachaar: Naive RAG vs. Blockify! There's a new RAG approach that: - cuts corpus size by 40x. - reduces tokens per query by 3x. -…
Blockify is a new open-source RAG framework that replaces naive chunking with a patented 'IdeaBlocks' pipeline, claiming 40x corpus size reduction, 3x token efficiency, and 2.3x vector search accuracy improvements. It transforms enterprise documents into structured XML knowledge units for more coherent LLM retrieval.
Disco-RAG: Discourse-Aware Retrieval-Augmented Generation
Disco-RAG proposes a discourse-aware retrieval-augmented generation framework that integrates discourse signals through intra-chunk discourse trees and inter-chunk rhetorical graphs to improve knowledge synthesis in LLMs. The method achieves state-of-the-art results on QA and summarization benchmarks without fine-tuning.
LightRAG: Simple and Fast Retrieval-Augmented Generation
The article introduces LightRAG, an open-source framework that enhances Retrieval-Augmented Generation by integrating graph structures for improved contextual awareness and efficient information retrieval.