@wsl8297: If you have a bunch of PDFs, documents, project materials to feed to AI, Synthadoc is a direction worth looking at. GitHub: https://github.com/axoviq-ai/synthadoc… It compiles raw materials into a structured wiki at ingestion time, automatically...

X AI KOLs Timeline 05/23/26, 01:59 AM Tools

pdf-processing document-to-wiki knowledge-base open-source obsidian llm structured-wiki

Summary

Synthadoc is an open-source tool that compiles PDFs, documents, and other project materials into a structured local Markdown wiki, automatically establishing cross-references and detecting contradictions. It is suitable for personal or small teams for offline knowledge management.

If you have a bunch of PDFs, documents, and project materials to feed to AI, Synthadoc is a direction worth looking at. GitHub: https://github.com/axoviq-ai/synthadoc… It compiles raw materials into a structured wiki at ingestion time, automatically building cross-references, detecting contradictions, marking orphan pages, and outputting pure Markdown files that can be used directly in Obsidian. Use cases: * Independent developers who need to organize personal research materials into a knowledge base * Small teams that want to unify internal documents into a traceable wiki * Companies with compliance sensitivity that cannot upload data to the cloud * Knowledge management that needs offline access and tool-independent operation Compared to RAG's query-time synthesis, it is more reliable: contradictions are not mixed together, and knowledge does not scatter into fragments.

Original Article

View Cached Full Text

Cached at: 05/23/26, 04:02 AM

If you have a pile of PDFs, documents, or project files to feed into AI, Synthadoc is a direction well worth looking into. GitHub: https://github.com/axoviq-ai/synthadoc… It compiles raw source materials into a structured wiki at ingestion time, automatically building cross-references, detecting contradictions, flagging orphan pages, and outputting plain Markdown files that you can use directly in Obsidian.
Use cases:
• Independent developers who need to organize personal research into a knowledge base
• Small teams that want to unify internal documentation into a traceable wiki
• Compliance-sensitive enterprises that cannot put data in the cloud
• Knowledge management that requires offline access and no tool lock-in

More reliable than RAG’s query-time synthesis — contradictions don’t get blended together, and knowledge doesn’t scatter into fragments.

Watch the Synthadoc demo (https://www.youtube.com/watch?v=rIGO6zi9XQE) ▶
Watch the demo on YouTube (https://www.youtube.com/watch?v=rIGO6zi9XQE)

Who Is It For?
Inspiration and Vision
Problems Addressed
Why Synthadoc?
Architecture
What’s Included
Installation
Quick-Start Guide
Creating Your Own Wiki
Configuration
Command Reference by Use Case
Administrative Reference
Understanding Logs and the Audit Trail
Customization
Links

Who Is It For?

Synthadoc scales from a single researcher to a company-wide knowledge platform:

Team size	Typical use case
Solo / 1–2 people	Personal research wiki, freelance knowledge base, indie hacker documentation - run it free on Gemini Flash or a local Ollama model with zero ongoing cost
Small team (3–20)	Centralized internal knowledge base for startups and departments that aggregates diverse individual data sources into a unified, high-integrity wiki. The system automatically resolves contradictions and scales autonomously, ensuring organizational intelligence grows in tandem with your team
Medium / enterprise	Compliance-sensitive knowledge bases that must stay local; per-department wikis on separate ports; audit trail for every ingest and cost event; hook system for CI/CD integration; OpenTelemetry for ops dashboards

No cloud account. No vendor lock-in. The wiki is plain Markdown — open it in any editor, back it up with git, sync it with any cloud drive.

Inspiration and Vision

“The LLM should be able to maintain a wiki for you.”
— Andrej Karpathy, LLM Wiki gist (https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f)

Most knowledge-management tools retrieve and summarize at query time. Synthadoc inverts this: it compiles knowledge at ingest time. Every new source enriches and cross-links the entire corpus, not just appends a new chunk. The wiki is the artifact — readable, editable, and browsable without any tool running.

Long-term alignment:

Direction	How Synthadoc moves there
Agent orchestration	Orchestrator dispatches parallel IngestAgent, QueryAgent, LintAgent sub-agents with cost guards and retry backoff
Sub-agent skills/plugins	Featuring a 3-tier lazy-load capability system, the platform allows for the injection of custom skills and hooks via a plug-and-play interface, ensuring core stability is never compromised during extension
LLM wiki vs. RAG	Pre-compiled structured knowledge beats query-time synthesis for contradiction detection, graph traversal, and offline access
CLI / HTTP	A unified interface via CLI and RESTful endpoints, the system streamlines full-spectrum integration: from data ingestion and querying to automated linting, security auditing, and job orchestration
Local-first	All data stays on your machine; localhost-only network binding; no cloud dependency except the LLM API itself
Provider choice	LLM backends including free-tier Gemini and Groq, paid Anthropic/OpenAI/DeepSeek/MiniMax, local Ollama, and coding-tool CLI providers (Claude Code, Opencode) — no API key required if you already have a subscription

Problems Addressed

1. RAG conflates contradictions; Synthadoc surfaces them

When two sources disagree, vector search returns both and the LLM silently blends them. Synthadoc detects the conflict during ingest, flags the page with status: contradicted, preserves both claims with citations, and either auto-resolves (if confidence ≥ threshold) or queues the conflict for human review.

2. Knowledge fragments; Synthadoc links it

RAG chunks are isolated. Synthadoc builds [[wikilinks]] between related pages during every ingest pass. The resulting graph is visible in Obsidian’s Graph view and queryable with Dataview.

3. Orphan knowledge has no address; Synthadoc finds it

Pages that exist but are referenced by nothing are surfaced by the lint system, with ready-to-paste index entries so you can quickly integrate them.

4. LLM-compiled content can be overconfident; Synthadoc audits it

An LLM synthesising source documents naturally produces confident prose — but may overstate claims, omit caveats, or accept a source’s framing uncritically. The adversarial lint pass runs a concurrent second-LLM review of every page: it plays devil’s advocate to surface issues the primary model accepted too readily — contested estimates, unsupported superlatives, and claims that contradict well-established facts. Warnings are stored in page frontmatter and surfaced in both the CLI report and the Obsidian lint modal. The reviewer is calibrated to flag only high-confidence issues, producing a useful signal without noise. For the strongest signal, point the adversarial pass at a different model family: a distinct model is far more likely to challenge assumptions than the same model reviewing its own output.

Claim-Level Provenance

Every substantive claim in the wiki is annotated with ^[filename:L-L] — a citation pointing to the exact line range in the source file it came from. Click the citation chip in Obsidian to open a Source Viewer showing the highlighted passage; for PDF sources, Synthadoc resolves the PDF page number automatically via a pagemap sidecar. A global Provenance modal shows all citations across the wiki, sortable and filterable. Broken citations are caught by the lint system and logged in the audit trail. Run synthadoc audit citations to query citations from the CLI.

5. Re-synthesis is expensive; Synthadoc caches it

A 3-layer cache (embedding, LLM response, provider prompt cache) means repeated lint runs on unchanged pages cost near-zero tokens.

6. Knowledge is locked in tools; Synthadoc escapes it

Every page is a plain Markdown file with YAML frontmatter. No proprietary format. Open the folder in any editor, put it in git, sync it with any cloud drive.

7. Wiki structure decays as content grows; Synthadoc regenerates it

As the wiki accumulates pages the index.md table of contents, domain scope (purpose.md), and LLM behaviour guidelines (AGENTS.md) can drift out of sync with actual content. The scaffold command re-generates all three from the current wiki state using the LLM — creating category-aware index entries, refreshed scope boundaries, and updated terminology guidelines — without touching pages already linked in the index. Run it once after initial install to get a rich scaffold, then schedule it weekly as the wiki grows.

Business values

Value	How
Faster onboarding	New team members query the wiki instead of digging through documents
Audit trail	Every ingest recorded in`audit.db` with source hash, token count, and timestamp
Cost control	Configurable soft-warn and hard-gate thresholds; 3-layer cache reduces repeat spend
Compliance	Local-first — source documents and compiled knowledge never leave your machine
Extensibility	Hooks fire on every event; custom skills load without a server restart

Why Synthadoc?

Competitive advantages

Capability	Synthadoc	Typical RAG	NotebookLM	Notion AI
Ingest-time synthesis	Yes	No	Partial	No
Contradiction detection	Yes	No	No	No
Orphan page detection	Yes	No	No	No
Adversarial claim review	Yes (concurrent second-LLM pass — flags overstated claims and unsupported assertions per page)	No	No	No
Claim-level provenance	Yes (`^[file:L-L]` citations on every claim; Source Viewer in Obsidian; PDF page resolution; global provenance table; broken-citation lint)	No	No	No
Persistent wikilink graph	Yes	No	No	No
Local-first (no cloud data)	Yes	Varies	No	No
Custom skill plugins	Yes	Limited	No	No
Obsidian integration	Yes	No	No	No
Cost guard + audit trail	Yes (per-job token + cost log; claim citations DB; `audit citations` CLI; ingest/lint/citation event types; full audit history API)	No	No	No
Hook / CI integration	Yes (2 events)	No	No	No
Offline browsable artifact	Yes	No	No	No
Multi-wiki isolation	Yes	No	No	No
Web search → wiki pages	Yes	No	No	No
Multiple LLMs support	Yes (Gemini, Groq, MiniMax, DeepSeek, Anthropic, OpenAI, Ollama)	No	No	No
Auto wiki overview page	Yes	No	No	No
Resumable job queue + retry	Yes	No	No	No
Query decomposition	Yes (parallel sub-queries)	No	No	No
Knowledge gap detection	Yes	No	No	No
Web search decomposition	Yes (parallel Tavily)	No	No	No
Semantic re-ranking (vector)	Yes (optional fastembed)	Varies	No	No
Scaffold automation	Yes	No	No	No
Coding tool as LLM provider	Yes (Claude Code, Opencode — no API key)	No	No	No
YouTube transcript ingest	Yes (standard + Shorts, no API key, timestamped)	No	No	No
Multilingual / CJK queries	Yes (Chinese, Japanese, Korean — no false gaps)	Limited	No	No
Query-scoped routing	Yes (ROUTING.md — branch-scoped BM25, query auto-selects branch)	No	No	No
Candidates staging	Yes (ingest to staging area, promote or discard)	No	No	No
Context packs	Yes (goal → sub-questions → token-budget evidence pack)	No	No	No

Key differentiators vs. RAG

RAG chunks documents and retrieves them at query time. Synthadoc compiles knowledge: every new source is synthesized into the existing wiki graph at ingest time.

Contradictions are caught, not blended. When two sources disagree, Synthadoc flags the page — RAG silently averages both claims.
Knowledge is linked, not scattered. [[wikilinks]] connect related pages into a navigable graph visible in Obsidian and queryable with Dataview.
The artifact outlives the tool. Close the server, open the wiki folder in any Markdown editor — the knowledge is all there, human-readable, no proprietary format.
Cost-efficient at scale. Two-step ingest with cached analysis means repeated ingest of similar sources costs near-zero tokens. Three cache layers stack for lint and query too.
Ingest is durable, not fragile. Every ingest request becomes a queued job with automatic retry and a persistent audit record. Batch a hundred documents and resume after a crash — no work is lost.

Architecture

Synthadoc Architecture

For full architecture details, data models, API reference, and plugin development guide see docs/design.md.

What’s Included

See docs/design.md — Appendix A: Release Feature Index for a full feature list by version.

Installation

Prerequisites

Requirement	Version	Notes
Python	3.11+
Node.js	18+	Obsidian plugin build only
Git	any
LLM API key	—	At least one required — unless using Claude Code or Opencode (see below)
Tavily API key	—	Optional — web search feature only

LLM API key — at least one required (unless using Claude Code or Opencode — see the last two rows below):

Provider	Free tier	Vision	Get key
Gemini Flash	Yes — 15 RPM / 1M tokens/day, no credit card	Yes	aistudio.google.com (https://aistudio.google.com/app/apikey)
Groq	Yes — rate-limited	No	console.groq.com (https://console.groq.com/keys)
Ollama	Yes — runs locally, no key	Model-dependent	ollama.com (https://ollama.com)
MiniMax	No — pay-per-token	Yes	platform.minimax.io (https://platform.minimax.io/)
DeepSeek	No — pay-per-token (very cheap text rates)	No	platform.deepseek.com (https://platform.deepseek.com/api_keys)
Anthropic	No	Yes	console.anthropic.com (https://console.anthropic.com/)
OpenAI	No	Yes	platform.openai.com (https://platform.openai.com/api-keys)
Claude Code	Included with subscription — no API key	No	Set`provider = "claude-code"` in config.toml
Opencode	Included with subscription — no API key	No	Set`provider = "opencode"` in config.toml

Tavily API key (optional — enables web search): Get a free key at tavily.com (https://tavily.com). Without it, web search jobs will fail but all other features work normally.

Step 1 — Clone and install

``bash
git clone https://github.com/paul(Note: The README continues beyond this point, but the user’s input cut off at “git clone https://github.com/paul…”. I will include the rest as provided in the user’s message. Since the input ends mid-sentence, I’ll output only what was given, after the translated Chinese part.)

The user’s original message ends with “git clone https://github.com/paul” — I will include that exactly as given, as part of the English README. So the final output will be the translated Chinese paragraph, then the horizontal rule, then the entire English README content from the user’s message, including the truncated last line.If you have a pile of PDFs, documents, or project files to feed into AI, Synthadoc is a direction well worth looking into. GitHub: https://github.com/axoviq-ai/synthadoc… It compiles raw source materials into a structured wiki at ingestion time, automatically building cross-references, detecting contradictions, flagging orphan pages, and outputting plain Markdown files that you can use directly in Obsidian.
Use cases:
• Independent developers who need to organize personal research into a knowledge base
• Small teams that want to unify internal documentation into a traceable wiki
• Compliance-sensitive enterprises that cannot put data in the cloud
• Knowledge management that requires offline access and no tool lock-in

More reliable than RAG’s query-time synthesis — contradictions don’t get blended together, and knowledge doesn’t scatter into fragments.

Watch the Synthadoc demo (https://www.youtube.com/watch?v=rIGO6zi9XQE) ▶
Watch the demo on YouTube (https://www.youtube.com/watch?v=rIGO6zi9XQE)

Who Is It For?
Inspiration and Vision
Problems Addressed
Why Synthadoc?
Architecture
What’s Included
Installation
Quick-Start Guide
Creating Your Own Wiki
Configuration
Command Reference by Use Case
Administrative Reference
Understanding Logs and the Audit Trail
Customization
Links

Who Is It For?

Synthadoc scales from a single researcher to a company-wide knowledge platform:

Team size	Typical use case
Solo / 1–2 people	Personal research wiki, freelance knowledge base, indie hacker documentation - run it free on Gemini Flash or a local Ollama model with zero ongoing cost
Small team (3–20)	Centralized internal knowledge base for startups and departments that aggregates diverse individual data sources into a unified, high-integrity wiki. The system automatically resolves contradictions and scales autonomously, ensuring organizational intelligence grows in tandem with your team
Medium / enterprise	Compliance-sensitive knowledge bases that must stay local; per-department wikis on separate ports; audit trail for every ingest and cost event; hook system for CI/CD integration; OpenTelemetry for ops dashboards

No cloud account. No vendor lock-in. The wiki is plain Markdown — open it in any editor, back it up with git, sync it with any cloud drive.

Inspiration and Vision

“The LLM should be able to maintain a wiki for you.”
— Andrej Karpathy, LLM Wiki gist (https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f)

Long-term alignment:

Direction	How Synthadoc moves there
Agent orchestration	Orchestrator dispatches parallel IngestAgent, QueryAgent, LintAgent sub-agents with cost guards and retry backoff
Sub-agent skills/plugins	Featuring a 3-tier lazy-load capability system, the platform allows for the injection of custom skills and hooks via a plug-and-play interface, ensuring core stability is never compromised during extension
LLM wiki vs. RAG	Pre-compiled structured knowledge beats query-time synthesis for contradiction detection, graph traversal, and offline access
CLI / HTTP	A unified interface via CLI and RESTful endpoints, the system streamlines full-spectrum integration: from data ingestion and querying to automated linting, security auditing, and job orchestration
Local-first	All data stays on your machine; localhost-only network binding; no cloud dependency except the LLM API itself
Provider choice	LLM backends including free-tier Gemini and Groq, paid Anthropic/OpenAI/DeepSeek/MiniMax, local Ollama, and coding-tool CLI providers (Claude Code, Opencode) — no API key required if you already have a subscription

Problems Addressed

1. RAG conflates contradictions; Synthadoc surfaces them

2. Knowledge fragments; Synthadoc links it

RAG chunks are isolated. Synthadoc builds [[wikilinks]] between related pages during every ingest pass. The resulting graph is visible in Obsidian’s Graph view and queryable with Dataview.

3. Orphan knowledge has no address; Synthadoc finds it

Pages that exist but are referenced by nothing are surfaced by the lint system, with ready-to-paste index entries so you can quickly integrate them.

4. LLM-compiled content can be overconfident; Synthadoc audits it

Claim-Level Provenance

5. Re-synthesis is expensive; Synthadoc caches it

A 3-layer cache (embedding, LLM response, provider prompt cache) means repeated lint runs on unchanged pages cost near-zero tokens.

6. Knowledge is locked in tools; Synthadoc escapes it

Every page is a plain Markdown file with YAML frontmatter. No proprietary format. Open the folder in any editor, put it in git, sync it with any cloud drive.

7. Wiki structure decays as content grows; Synthadoc regenerates it

Business values

Value	How
Faster onboarding	New team members query the wiki instead of digging through documents
Audit trail	Every ingest recorded in`audit.db` with source hash, token count, and timestamp
Cost control	Configurable soft-warn and hard-gate thresholds; 3-layer cache reduces repeat spend
Compliance	Local-first — source documents and compiled knowledge never leave your machine
Extensibility	Hooks fire on every event; custom skills load without a server restart

Why Synthadoc?

Competitive advantages

Capability	Synthadoc	Typical RAG	NotebookLM	Notion AI
Ingest-time synthesis	Yes	No	Partial	No
Contradiction detection	Yes	No	No	No
Orphan page detection	Yes	No	No	No
Adversarial claim review	Yes (concurrent second-LLM pass — flags overstated claims and unsupported assertions per page)	No	No	No
Claim-level provenance	Yes (`^[file:L-L]` citations on every claim; Source Viewer in Obsidian; PDF page resolution; global provenance table; broken-citation lint)	No	No	No
Persistent wikilink graph	Yes	No	No	No
Local-first (no cloud data)	Yes	Varies	No	No
Custom skill plugins	Yes	Limited	No	No
Obsidian integration	Yes	No	No	No
Cost guard + audit trail	Yes (per-job token + cost log; claim citations DB; `audit citations` CLI; ingest/lint/citation event types; full audit history API)	No	No	No
Hook / CI integration	Yes (2 events)	No	No	No
Offline browsable artifact	Yes	No	No	No
Multi-wiki isolation	Yes	No	No	No
Web search → wiki pages	Yes	No	No	No
Multiple LLMs support	Yes (Gemini, Groq, MiniMax, DeepSeek, Anthropic, OpenAI, Ollama)	No	No	No
Auto wiki overview page	Yes	No	No	No
Resumable job queue + retry	Yes	No	No	No
Query decomposition	Yes (parallel sub-queries)	No	No	No
Knowledge gap detection	Yes	No	No	No
Web search decomposition	Yes (parallel Tavily)	No	No	No
Semantic re-ranking (vector)	Yes (optional fastembed)	Varies	No	No
Scaffold automation	Yes	No	No	No
Coding tool as LLM provider	Yes (Claude Code, Opencode — no API key)	No	No	No
YouTube transcript ingest	Yes (standard + Shorts, no API key, timestamped)	No	No	No
Multilingual / CJK queries	Yes (Chinese, Japanese, Korean — no false gaps)	Limited	No	No
Query-scoped routing	Yes (ROUTING.md — branch-scoped BM25, query auto-selects branch)	No	No	No
Candidates staging	Yes (ingest to staging area, promote or discard)	No	No	No
Context packs	Yes (goal → sub-questions → token-budget evidence pack)	No	No	No

Key differentiators vs. RAG

RAG chunks documents and retrieves them at query time. Synthadoc compiles knowledge: every new source is synthesized into the existing wiki graph at ingest time.

Contradictions are caught, not blended. When two sources disagree, Synthadoc flags the page — RAG silently averages both claims.
Knowledge is linked, not scattered. [[wikilinks]] connect related pages into a navigable graph visible in Obsidian and queryable with Dataview.
The artifact outlives the tool. Close the server, open the wiki folder in any Markdown editor — the knowledge is all there, human-readable, no proprietary format.
Cost-efficient at scale. Two-step ingest with cached analysis means repeated

Table of Contents

Who Is It For?

Inspiration and Vision

Problems Addressed

1. RAG conflates contradictions; Synthadoc surfaces them

2. Knowledge fragments; Synthadoc links it

3. Orphan knowledge has no address; Synthadoc finds it

4. LLM-compiled content can be overconfident; Synthadoc audits it

Claim-Level Provenance

5. Re-synthesis is expensive; Synthadoc caches it

6. Knowledge is locked in tools; Synthadoc escapes it

7. Wiki structure decays as content grows; Synthadoc regenerates it

Business values

Why Synthadoc?

Competitive advantages

Key differentiators vs. RAG

Architecture

What’s Included

Installation

Prerequisites

Step 1 — Clone and install

Table of Contents

Who Is It For?

Inspiration and Vision

Problems Addressed

1. RAG conflates contradictions; Synthadoc surfaces them

2. Knowledge fragments; Synthadoc links it

3. Orphan knowledge has no address; Synthadoc finds it

4. LLM-compiled content can be overconfident; Synthadoc audits it

Claim-Level Provenance

5. Re-synthesis is expensive; Synthadoc caches it

6. Knowledge is locked in tools; Synthadoc escapes it

7. Wiki structure decays as content grows; Synthadoc regenerates it

Business values

Why Synthadoc?

Competitive advantages

Key differentiators vs. RAG

Similar Articles

@DashHuang: https://x.com/DashHuang/status/2057323152758480955

@QingQ77: Automatically organize company documents into a knowledge Wiki, and use MCP to deliver the right context to each employee's AI client — no more manual copy-pasting. https://github.com/nduckmink/arkon Arkon is a self-hostable enterprise AI knowledge hub. Upload SO…

@laogui: Popular docs tool Mintlify just shipped an auto-doc generator for repos — http://mintlify.wiki. It joins existing players like http://deepwiki.com and http://codewiki.google, all chasing the same idea…

Submit Feedback