@tom_doerr: Knowledge graph framework for RAG using semantic aggregation and hierarchical retrieval https://github.com/KnowledgeXLa…

X AI KOLs Timeline Tools

Summary

LeanRAG is an open-source framework that enhances Retrieval-Augmented Generation using knowledge graph structures with semantic aggregation and hierarchical retrieval for context-aware, high-fidelity responses. It has been accepted at AAAI-26.

Knowledge graph framework for RAG using semantic aggregation and hierarchical retrieval https://t.co/5Ebpd0rmIz https://t.co/9Q4drMArs5
Original Article
View Cached Full Text

Cached at: 06/29/26, 06:24 AM

Knowledge graph framework for RAG using semantic aggregation and hierarchical retrieval

https://t.co/5Ebpd0rmIz https://t.co/9Q4drMArs5


KnowledgeXLab/LeanRAG

Source: https://github.com/KnowledgeXLab/LeanRAG

LeanRAG: Knowledge-Graph-Based Generation with Semantic Aggregation and Hierarchical Retrieval

🎉 This paper has been accepted by AAAI-26! 🎉

Python Version License: MIT arXivPRs Welcome

LeanRAG is an efficient, open-source framework for Retrieval-Augmented Generation, leveraging knowledge graph structures with semantic aggregation and hierarchical retrieval to generate context-aware, concise, and high-fidelity responses.

✨ Features

  • Semantic Aggregation: Clusters entities into semantically coherent summaries and constructs explicit relations to form a navigable aggregation-level knowledge network.
  • Hierarchical, Structure-Guided Retrieval: Initiates retrieval from fine-grained entities and traverses up the knowledge graph to gather rich, highly relevant evidence efficiently.
  • Reduced Redundancy: Optimizes retrieval paths to significantly reduce redundant information—LeanRAG achieves ~46% lower retrieval redundancy compared to flat retrieval baselines (based on benchmark evaluations).
  • Benchmark Performance: Demonstrates superior performance across multiple QA benchmarks with improved response quality and retrieval efficiency.

🏛️ Architecture Overview

Overview of LeanRAG

LeanRAG’s processing pipeline follows these core stages:

  1. Semantic Aggregation

    • Group low-level entities into clusters; generate summary nodes and build adjacency relations among them for efficient navigation.
  2. Knowledge Graph Construction

    • Construct a multi-layer graph where nodes represent entities and aggregated summaries, with explicit inter-node relations for graph-based traversal.
  3. Query Processing & Hierarchical Retrieval

    • Anchor queries at the most relevant detailed entities (“bottom-up”), then traverse upward through the semantic aggregation graph to collect evidence spans.
  4. Redundancy-Aware Synthesis

    • Streamline retrieval paths and avoid overlapping content, ensuring concise evidence aggregation before generating responses.
  5. Generation

    • Use retrieved, well-structured evidence as input to an LLM to produce coherent, accurate, and contextually grounded answers.

🚀 Getting Started

Prerequisites

  • Python 3.10+
  • Conda for environment management

Installation

  1. Clone the repository:

    git clone https://github.com/RaZzzyz/LeanRAG.git
    cd LeanRAG
    
  2. Create a virtual environment:

    conda install -n leanrag python=3.11
    conda activate leanrag
    
  3. Install the required dependencies:

    pip install -r requirements.txt
    

💻 Usage Workflow

Here’s a typical pipeline flow:

Step 1: Document Chunking

In file_chunk.py, split the document into chunks:

  • Chunk size: 1024
  • Sliding step: 128 (i.e., use a sliding window with step 128)

Each dictionary in the resulting chunk file contains two attributes:

  • hash_code: hash calculated from the text content for traceability
  • text: the chunk text content

Step 2: Extract Triples and Entity Descriptions

Two knowledge graph extraction methods are currently provided:

Method 1: CommonKG

Based on Wikipedia entities. First, define a head entity list, then extract triples from the document.

Usage:

  1. Edit the configuration file:
    CommonKG/config/create_kg_conf_test.yaml
    Fill in the model’s url and name, and the path to the chunk file.
  2. Run extraction:
    python CommonKG/create_kg.py
    
    The extraction result will be saved in output_dir.
  3. Process 6-tuples with descriptions:
    python CommonKG/deal_triple.py
    
    Outputs include:
    • entity.jsonl
    • relation.jsonl

Method 2: GraphRAG

Relies on LLM capability to perform few-shot extraction with given examples in the prompt.

Usage:

  1. Edit GraphExtraction/chunk.py to fill in url and model. The chunk_file is the same as in CommonKG, generated from Step 1.
  2. Deduplicate extraction results:
    python GraphExtraction/deal_triple.py
    
    Outputs include:
    • entity.jsonl
    • relation.jsonl

Step 3: Build the Graph

python build_graph.py
  • Cluster extracted entity and relation descriptions and generate relationships.

  • Construct a tree-structured knowledge graph, supporting retrieval and Q&A.

Step 3: Retrieval

python query_graph.py
  1. Select the correct chunks_file.

  2. Query the graph for Top-K entities based on query.

  3. Generate paths between nodes according to the tree structure.

  4. Return same-level relationships and aggregated entity information along the paths to the LLM for final answer generation.

📊 Results & Benchmarks

On four challenging QA benchmarks spanning diverse domains, LeanRAG consistently delivers:

Score

Mix

MetricLeanRAGHiRAGNaiveGraphRAGLightRAGFastGraphRAGKAG
Comprehensiveness8.89±0.018.72±0.028.20±0.018.52±0.018.19±0.026.56±0.027.90±0.03
Empowerment8.16±0.027.86±0.037.52±0.037.73±0.027.56±0.035.82±0.037.41±0.04
Diversity7.73±0.017.21±0.026.65±0.037.04±0.026.69±0.044.88±0.036.42±0.04
Overall8.59±0.018.08±0.027.47±0.027.87±0.017.61±0.045.76±0.027.25±0.03

CS

MetricLeanRAGHiRAGNaiveGraphRAGLightRAGFastGraphRAGKAG
Comprehensiveness8.92±0.018.92±0.018.94±0.018.55±0.028.76±0.026.79±0.018.22±0.02
Empowerment8.68±0.028.66±0.028.69±0.048.28±0.048.50±0.046.67±0.048.52±0.05
Diversity7.87±0.027.84±0.027.79±0.027.42±0.027.63±0.045.45±0.047.03±0.02
Overall8.82±0.028.77±0.028.77±0.038.37±0.048.59±0.046.31±0.037.99±0.03

Legal

MetricLeanRAGHiRAGNaiveGraphRAGLightRAGFastGraphRAGKAG
Comprehensiveness8.88±0.028.68±0.028.85±0.018.95±0.018.24±0.023.87±0.028.41±0.02
Empowerment8.42±0.038.18±0.068.28±0.038.33±0.027.83±0.053.53±0.038.20±0.03
Diversity7.49±0.037.00±0.037.10±0.047.47±0.036.87±0.012.87±0.026.71±0.01
Overall8.49±0.048.00±0.048.21±0.038.44±0.017.74±0.033.43±0.027.83±0.03

Agriculture

MetricLeanRAGHiRAGNaiveGraphRAGLightRAGFastGraphRAGKAG
Comprehensiveness8.94±0.068.99±0.008.85±0.018.97±0.018.71±0.013.28±0.018.22±0.01
Empowerment8.66±0.028.52±0.028.51±0.038.52±0.028.23±0.023.29±0.058.33±0.06
Diversity8.06±0.037.98±0.027.76±0.067.95±0.027.68±0.033.01±0.037.07±0.02
Overall8.87±0.028.87±0.038.69±0.038.85±0.018.56±0.023.17±0.027.95±0.03

Winrate

NaiveRAG vs LeanRAG

MetricMix (NaiveRAG)Mix (LeanRAG)CS (NaiveRAG)CS (LeanRAG)Legal (NaiveRAG)Legal (LeanRAG)Agriculture (NaiveRAG)Agriculture (LeanRAG)
Comprehensiveness11.9%88.1%41.0%59.0%30.0%70.0%37.7%62.3%
Empowerment1.5%98.5%40.5%59.5%24.5%75.5%19.8%80.2%
Diversity3.1%96.9%28.0%72.0%9.0%91.0%10.0%90.0%
Overall2.7%97.3%39.5%60.5%23.5%76.5%19.3%80.7%

GraphRAG vs LeanRAG

MetricMix (GraphRAG)Mix (LeanRAG)CS (GraphRAG)CS (LeanRAG)Legal (GraphRAG)Legal (LeanRAG)Agriculture (GraphRAG)Agriculture (LeanRAG)
Comprehensiveness35.0%65.0%41.0%59.0%49.0%51.0%45.5%54.5%
Empowerment20.0%80.0%33.5%66.5%44.0%56.0%27.0%73.0%
Diversity16.5%83.5%34.0%66.0%44.0%56.0%22.0%78.0%
Overall21.9%78.1%37.5%62.5%47.0%53.0%28.5%71.5%

LightRAG vs LeanRAG

MetricMix (LightRAG)Mix (LeanRAG)CS (LightRAG)CS (LeanRAG)Legal (LightRAG)Legal (LeanRAG)Agriculture (LightRAG)Agriculture (LeanRAG)
Comprehensiveness28.8%71.2%44.5%55.5%25.0%75.0%38.0%62.0%
Empowerment16.5%83.5%35.5%64.5%12.0%88.0%17.0%83.0%
Diversity13.1%86.9%34.0%66.0%40.5%59.5%16.5%83.5%
Overall18.8%81.2%38.5%61.5%21.0%79.0%18.5%81.5%

FastGraphRAG vs LeanRAG

MetricMix (FastGraphRAG)Mix (LeanRAG)CS (FastGraphRAG)CS (LeanRAG)Legal (FastGraphRAG)Legal (LeanRAG)Agriculture (FastGraphRAG)Agriculture (LeanRAG)
Comprehensiveness0.0%100.0%0.5%99.5%1.0%99.0%0.5%99.5%
Empowerment0.0%100.0%0.0%100.0%0.5%99.5%0.0%100.0%
Diversity0.0%100.0%0.8%99.2%2.5%97.5%0.0%100.0%
Overall0.0%100.0%0.0%100.0%4.5%95.5%0.0%100.0%

KAG vs LeanRAG

MetricMix (KAG)Mix (LeanRAG)CS (KAG)CS (LeanRAG)Legal (KAG)Legal (LeanRAG)Agriculture (KAG)Agriculture (LeanRAG)
Comprehensiveness1.5%98.5%5.0%95.0%5.0%95.0%2.5%97.5%
Empowerment1.9%98.1%3.0%97.0%4.5%95.5%2.5%97.5%
Diversity1.2%98.8%4.0%96.0%2.5%97.5%1.0%99.0%
Overall1.2%98.8%3.5%96.5%4.5%95.5%1.0%99.0%

HiRAG vs LeanRAG

MetricMix (HiRAG)Mix (LeanRAG)CS (HiRAG)CS (LeanRAG)Legal (HiRAG)Legal (LeanRAG)Agriculture (HiRAG)Agriculture (LeanRAG)
Comprehensiveness43.8%56.2%46.5%53.5%29.5%70.5%49.5%50.5%
Empowerment26.5%73.5%43.5%56.5%16.5%83.5%26.5%73.5%
Diversity20.4%79.6%44.5%55.5%23.5%76.5%23.5%76.5%
Overall28.1%71.9%45.0%55.0%21.5%78.5%28.0%72.0%

Tokens Consumption

retrieval information tokens

Acknowledgement

We gratefully acknowledge the use of the following open-source projects in our work:

  • nano-graphrag: a simple, easy-to-hack GraphRAG implementation

  • HiRAG: a novel hierarchy entity aggregation and optimized retrieval RAG method

📄 Citation

If you find LeanRAG useful, please cite our paper:

@inproceedings{zhang2026leanrag,
  title={Leanrag: Knowledge-graph-based generation with semantic aggregation and hierarchical retrieval},
  author={Zhang, Yaoze and Wu, Rong and Cai, Pinlong and Wang, Xiaoman and Yan, Guohang and Mao, Song and Wang, Ding and Shi, Botian},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={40},
  number={41},
  pages={34862--34869},
  year={2026}
}

Star History

Star History Chart

Similar Articles

RAG-Anything: All-in-One RAG Framework

Papers with Code Trending

RAG-Anything is a new open-source framework that enhances multimodal knowledge retrieval by integrating cross-modal relationships and semantic matching, outperforming existing methods on complex benchmarks.

LightRAG: Simple and Fast Retrieval-Augmented Generation

Papers with Code Trending

The article introduces LightRAG, an open-source framework that enhances Retrieval-Augmented Generation by integrating graph structures for improved contextual awareness and efficient information retrieval.