Help with a Local Document RAG System (Storage + Ingestion + Query + Highlighting)

Reddit r/LocalLLaMA News

Summary

A detailed technical query about building a local document RAG system covering storage, ingestion, query, and highlighting, seeking advice on vector databases, GraphRAG feasibility, and document highlighting implementations.

Hey folks, I’m working on designing a local, offline document retrieval + LLM pipeline and would love your input on the architecture. Here’s what I’m aiming for: Storage Upload PDF, DOCX, XLSX, CSV, tables All data stored locally (no cloud) Document Ingestion Watch folder (e.g., Watchdog) → auto‑ingest on file add/modify/delete Nested folder structure → auto‑tagging Supported formats: PDF, scanned PDF, DOCX, XLSX, CSV, JPG/PNG Version control on re‑upload Query & Retrieval Restrict queries to a single client’s documents (no cross‑client leakage) Structured queries (e.g., “Show invoices > ₹1 lakh”) Comparative queries (e.g., “Compare FY23 vs FY24 gross profit”) Keyword fallback Highlighting & Rendering Annotated PDF served to frontend XLSX → colored cell export Jump directly to highlighted page Multi‑document highlights in one response Answer Generation Local LLM only Every claim cited with doc + page reference My Questions Parsing: I’m considering LlamaIndex LiteParse. → Should I store document IDs + chunk IDs for PDFs to enable highlighting? Vector DB: Do I need one (e.g., Qdrant)? If yes, how do I store doc IDs + chunk IDs alongside embeddings for highlighting? Would pgvector in Postgres be sufficient? GraphRAGs: How effective are systems like Neo4j or Microsoft GraphRAG? Can they run locally/offline, or are they too computationally heavy? Is this GraphRAG pipeline a good starting point? Highlighting UX: I want something like Turnitin/iThenticate reports → exact sentence highlighted + citation. Any open‑source projects that already do this? I found Kotaemon and AnythingLLM, which are close but don’t highlight documents. TL;DR Trying to build a local RAG system with: Storage + ingestion + tagging Query + retrieval + highlighting Local LLM answer generation with citations Looking for advice on: Vector DB vs pgvector GraphRAG feasibility offline Best way to implement document highlighting + citation preview Would love to hear from anyone who’s built something similar or explored these tools.
Original Article

Similar Articles

LightRAG: Simple and Fast Retrieval-Augmented Generation

Papers with Code Trending

The article introduces LightRAG, an open-source framework that enhances Retrieval-Augmented Generation by integrating graph structures for improved contextual awareness and efficient information retrieval.