@wsl8297: If you have a bunch of PDFs, documents, project materials to feed to AI, Synthadoc is a direction worth looking at. GitHub: https://github.com/axoviq-ai/synthadoc… It compiles raw materials into a structured wiki at ingestion time, automatically...
Summary
Synthadoc is an open-source tool that compiles PDFs, documents, and other project materials into a structured local Markdown wiki, automatically establishing cross-references and detecting contradictions. It is suitable for personal or small teams for offline knowledge management.
View Cached Full Text
Cached at: 05/23/26, 04:02 AM
If you have a pile of PDFs, documents, or project files to feed into AI, Synthadoc is a direction well worth looking into. GitHub: https://github.com/axoviq-ai/synthadoc… It compiles raw source materials into a structured wiki at ingestion time, automatically building cross-references, detecting contradictions, flagging orphan pages, and outputting plain Markdown files that you can use directly in Obsidian.
Use cases:
• Independent developers who need to organize personal research into a knowledge base
• Small teams that want to unify internal documentation into a traceable wiki
• Compliance-sensitive enterprises that cannot put data in the cloud
• Knowledge management that requires offline access and no tool lock-in
More reliable than RAG’s query-time synthesis — contradictions don’t get blended together, and knowledge doesn’t scatter into fragments.
Watch the Synthadoc demo (https://www.youtube.com/watch?v=rIGO6zi9XQE) ▶
Watch the demo on YouTube (https://www.youtube.com/watch?v=rIGO6zi9XQE)
Table of Contents
- Who Is It For?
- Inspiration and Vision
- Problems Addressed
- Why Synthadoc?
- Architecture
- What’s Included
- Installation
- Quick-Start Guide
- Creating Your Own Wiki
- Configuration
- Command Reference by Use Case
- Administrative Reference
- Understanding Logs and the Audit Trail
- Customization
- Links
Who Is It For?
Synthadoc scales from a single researcher to a company-wide knowledge platform:
| Team size | Typical use case |
|---|---|
| Solo / 1–2 people | Personal research wiki, freelance knowledge base, indie hacker documentation - run it free on Gemini Flash or a local Ollama model with zero ongoing cost |
| Small team (3–20) | Centralized internal knowledge base for startups and departments that aggregates diverse individual data sources into a unified, high-integrity wiki. The system automatically resolves contradictions and scales autonomously, ensuring organizational intelligence grows in tandem with your team |
| Medium / enterprise | Compliance-sensitive knowledge bases that must stay local; per-department wikis on separate ports; audit trail for every ingest and cost event; hook system for CI/CD integration; OpenTelemetry for ops dashboards |
No cloud account. No vendor lock-in. The wiki is plain Markdown — open it in any editor, back it up with git, sync it with any cloud drive.
Inspiration and Vision
“The LLM should be able to maintain a wiki for you.”
— Andrej Karpathy, LLM Wiki gist (https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f)
Most knowledge-management tools retrieve and summarize at query time. Synthadoc inverts this: it compiles knowledge at ingest time. Every new source enriches and cross-links the entire corpus, not just appends a new chunk. The wiki is the artifact — readable, editable, and browsable without any tool running.
Long-term alignment:
| Direction | How Synthadoc moves there |
|---|---|
| Agent orchestration | Orchestrator dispatches parallel IngestAgent, QueryAgent, LintAgent sub-agents with cost guards and retry backoff |
| Sub-agent skills/plugins | Featuring a 3-tier lazy-load capability system, the platform allows for the injection of custom skills and hooks via a plug-and-play interface, ensuring core stability is never compromised during extension |
| LLM wiki vs. RAG | Pre-compiled structured knowledge beats query-time synthesis for contradiction detection, graph traversal, and offline access |
| CLI / HTTP | A unified interface via CLI and RESTful endpoints, the system streamlines full-spectrum integration: from data ingestion and querying to automated linting, security auditing, and job orchestration |
| Local-first | All data stays on your machine; localhost-only network binding; no cloud dependency except the LLM API itself |
| Provider choice | LLM backends including free-tier Gemini and Groq, paid Anthropic/OpenAI/DeepSeek/MiniMax, local Ollama, and coding-tool CLI providers (Claude Code, Opencode) — no API key required if you already have a subscription |
Problems Addressed
1. RAG conflates contradictions; Synthadoc surfaces them
When two sources disagree, vector search returns both and the LLM silently blends them. Synthadoc detects the conflict during ingest, flags the page with status: contradicted, preserves both claims with citations, and either auto-resolves (if confidence ≥ threshold) or queues the conflict for human review.
2. Knowledge fragments; Synthadoc links it
RAG chunks are isolated. Synthadoc builds [[wikilinks]] between related pages during every ingest pass. The resulting graph is visible in Obsidian’s Graph view and queryable with Dataview.
3. Orphan knowledge has no address; Synthadoc finds it
Pages that exist but are referenced by nothing are surfaced by the lint system, with ready-to-paste index entries so you can quickly integrate them.
4. LLM-compiled content can be overconfident; Synthadoc audits it
An LLM synthesising source documents naturally produces confident prose — but may overstate claims, omit caveats, or accept a source’s framing uncritically. The adversarial lint pass runs a concurrent second-LLM review of every page: it plays devil’s advocate to surface issues the primary model accepted too readily — contested estimates, unsupported superlatives, and claims that contradict well-established facts. Warnings are stored in page frontmatter and surfaced in both the CLI report and the Obsidian lint modal. The reviewer is calibrated to flag only high-confidence issues, producing a useful signal without noise. For the strongest signal, point the adversarial pass at a different model family: a distinct model is far more likely to challenge assumptions than the same model reviewing its own output.
Claim-Level Provenance
Every substantive claim in the wiki is annotated with ^[filename:L-L] — a citation pointing to the exact line range in the source file it came from. Click the citation chip in Obsidian to open a Source Viewer showing the highlighted passage; for PDF sources, Synthadoc resolves the PDF page number automatically via a pagemap sidecar. A global Provenance modal shows all citations across the wiki, sortable and filterable. Broken citations are caught by the lint system and logged in the audit trail. Run synthadoc audit citations to query citations from the CLI.
5. Re-synthesis is expensive; Synthadoc caches it
A 3-layer cache (embedding, LLM response, provider prompt cache) means repeated lint runs on unchanged pages cost near-zero tokens.
6. Knowledge is locked in tools; Synthadoc escapes it
Every page is a plain Markdown file with YAML frontmatter. No proprietary format. Open the folder in any editor, put it in git, sync it with any cloud drive.
7. Wiki structure decays as content grows; Synthadoc regenerates it
As the wiki accumulates pages the index.md table of contents, domain scope (purpose.md), and LLM behaviour guidelines (AGENTS.md) can drift out of sync with actual content. The scaffold command re-generates all three from the current wiki state using the LLM — creating category-aware index entries, refreshed scope boundaries, and updated terminology guidelines — without touching pages already linked in the index. Run it once after initial install to get a rich scaffold, then schedule it weekly as the wiki grows.
Business values
| Value | How |
|---|---|
| Faster onboarding | New team members query the wiki instead of digging through documents |
| Audit trail | Every ingest recorded inaudit.db with source hash, token count, and timestamp |
| Cost control | Configurable soft-warn and hard-gate thresholds; 3-layer cache reduces repeat spend |
| Compliance | Local-first — source documents and compiled knowledge never leave your machine |
| Extensibility | Hooks fire on every event; custom skills load without a server restart |
Why Synthadoc?
Competitive advantages
| Capability | Synthadoc | Typical RAG | NotebookLM | Notion AI |
|---|---|---|---|---|
| Ingest-time synthesis | Yes | No | Partial | No |
| Contradiction detection | Yes | No | No | No |
| Orphan page detection | Yes | No | No | No |
| Adversarial claim review | Yes (concurrent second-LLM pass — flags overstated claims and unsupported assertions per page) | No | No | No |
| Claim-level provenance | Yes (^[file:L-L] citations on every claim; Source Viewer in Obsidian; PDF page resolution; global provenance table; broken-citation lint) | No | No | No |
| Persistent wikilink graph | Yes | No | No | No |
| Local-first (no cloud data) | Yes | Varies | No | No |
| Custom skill plugins | Yes | Limited | No | No |
| Obsidian integration | Yes | No | No | No |
| Cost guard + audit trail | Yes (per-job token + cost log; claim citations DB; audit citations CLI; ingest/lint/citation event types; full audit history API) | No | No | No |
| Hook / CI integration | Yes (2 events) | No | No | No |
| Offline browsable artifact | Yes | No | No | No |
| Multi-wiki isolation | Yes | No | No | No |
| Web search → wiki pages | Yes | No | No | No |
| Multiple LLMs support | Yes (Gemini, Groq, MiniMax, DeepSeek, Anthropic, OpenAI, Ollama) | No | No | No |
| Auto wiki overview page | Yes | No | No | No |
| Resumable job queue + retry | Yes | No | No | No |
| Query decomposition | Yes (parallel sub-queries) | No | No | No |
| Knowledge gap detection | Yes | No | No | No |
| Web search decomposition | Yes (parallel Tavily) | No | No | No |
| Semantic re-ranking (vector) | Yes (optional fastembed) | Varies | No | No |
| Scaffold automation | Yes | No | No | No |
| Coding tool as LLM provider | Yes (Claude Code, Opencode — no API key) | No | No | No |
| YouTube transcript ingest | Yes (standard + Shorts, no API key, timestamped) | No | No | No |
| Multilingual / CJK queries | Yes (Chinese, Japanese, Korean — no false gaps) | Limited | No | No |
| Query-scoped routing | Yes (ROUTING.md — branch-scoped BM25, query auto-selects branch) | No | No | No |
| Candidates staging | Yes (ingest to staging area, promote or discard) | No | No | No |
| Context packs | Yes (goal → sub-questions → token-budget evidence pack) | No | No | No |
Key differentiators vs. RAG
RAG chunks documents and retrieves them at query time. Synthadoc compiles knowledge: every new source is synthesized into the existing wiki graph at ingest time.
- Contradictions are caught, not blended. When two sources disagree, Synthadoc flags the page — RAG silently averages both claims.
- Knowledge is linked, not scattered.
[[wikilinks]]connect related pages into a navigable graph visible in Obsidian and queryable with Dataview. - The artifact outlives the tool. Close the server, open the wiki folder in any Markdown editor — the knowledge is all there, human-readable, no proprietary format.
- Cost-efficient at scale. Two-step ingest with cached analysis means repeated ingest of similar sources costs near-zero tokens. Three cache layers stack for lint and query too.
- Ingest is durable, not fragile. Every ingest request becomes a queued job with automatic retry and a persistent audit record. Batch a hundred documents and resume after a crash — no work is lost.
Architecture
Synthadoc Architecture
For full architecture details, data models, API reference, and plugin development guide see docs/design.md.
What’s Included
See docs/design.md — Appendix A: Release Feature Index for a full feature list by version.
Installation
Prerequisites
| Requirement | Version | Notes |
|---|---|---|
| Python | 3.11+ | |
| Node.js | 18+ | Obsidian plugin build only |
| Git | any | |
| LLM API key | — | At least one required — unless using Claude Code or Opencode (see below) |
| Tavily API key | — | Optional — web search feature only |
LLM API key — at least one required (unless using Claude Code or Opencode — see the last two rows below):
| Provider | Free tier | Vision | Get key |
|---|---|---|---|
| Gemini Flash | Yes — 15 RPM / 1M tokens/day, no credit card | Yes | aistudio.google.com (https://aistudio.google.com/app/apikey) |
| Groq | Yes — rate-limited | No | console.groq.com (https://console.groq.com/keys) |
| Ollama | Yes — runs locally, no key | Model-dependent | ollama.com (https://ollama.com) |
| MiniMax | No — pay-per-token | Yes | platform.minimax.io (https://platform.minimax.io/) |
| DeepSeek | No — pay-per-token (very cheap text rates) | No | platform.deepseek.com (https://platform.deepseek.com/api_keys) |
| Anthropic | No | Yes | console.anthropic.com (https://console.anthropic.com/) |
| OpenAI | No | Yes | platform.openai.com (https://platform.openai.com/api-keys) |
| Claude Code | Included with subscription — no API key | No | Setprovider = "claude-code" in config.toml |
| Opencode | Included with subscription — no API key | No | Setprovider = "opencode" in config.toml |
Tavily API key (optional — enables web search): Get a free key at tavily.com (https://tavily.com). Without it, web search jobs will fail but all other features work normally.
Step 1 — Clone and install
``bash
git clone https://github.com/paul(Note: The README continues beyond this point, but the user’s input cut off at “git clone https://github.com/paul…”. I will include the rest as provided in the user’s message. Since the input ends mid-sentence, I’ll output only what was given, after the translated Chinese part.)
The user’s original message ends with “git clone https://github.com/paul” — I will include that exactly as given, as part of the English README. So the final output will be the translated Chinese paragraph, then the horizontal rule, then the entire English README content from the user’s message, including the truncated last line.If you have a pile of PDFs, documents, or project files to feed into AI, Synthadoc is a direction well worth looking into. GitHub: https://github.com/axoviq-ai/synthadoc… It compiles raw source materials into a structured wiki at ingestion time, automatically building cross-references, detecting contradictions, flagging orphan pages, and outputting plain Markdown files that you can use directly in Obsidian.
Use cases:
• Independent developers who need to organize personal research into a knowledge base
• Small teams that want to unify internal documentation into a traceable wiki
• Compliance-sensitive enterprises that cannot put data in the cloud
• Knowledge management that requires offline access and no tool lock-in
More reliable than RAG’s query-time synthesis — contradictions don’t get blended together, and knowledge doesn’t scatter into fragments.
Watch the Synthadoc demo (https://www.youtube.com/watch?v=rIGO6zi9XQE) ▶
Watch the demo on YouTube (https://www.youtube.com/watch?v=rIGO6zi9XQE)
Table of Contents
- Who Is It For?
- Inspiration and Vision
- Problems Addressed
- Why Synthadoc?
- Architecture
- What’s Included
- Installation
- Quick-Start Guide
- Creating Your Own Wiki
- Configuration
- Command Reference by Use Case
- Administrative Reference
- Understanding Logs and the Audit Trail
- Customization
- Links
Who Is It For?
Synthadoc scales from a single researcher to a company-wide knowledge platform:
| Team size | Typical use case |
|---|---|
| Solo / 1–2 people | Personal research wiki, freelance knowledge base, indie hacker documentation - run it free on Gemini Flash or a local Ollama model with zero ongoing cost |
| Small team (3–20) | Centralized internal knowledge base for startups and departments that aggregates diverse individual data sources into a unified, high-integrity wiki. The system automatically resolves contradictions and scales autonomously, ensuring organizational intelligence grows in tandem with your team |
| Medium / enterprise | Compliance-sensitive knowledge bases that must stay local; per-department wikis on separate ports; audit trail for every ingest and cost event; hook system for CI/CD integration; OpenTelemetry for ops dashboards |
No cloud account. No vendor lock-in. The wiki is plain Markdown — open it in any editor, back it up with git, sync it with any cloud drive.
Inspiration and Vision
“The LLM should be able to maintain a wiki for you.”
— Andrej Karpathy, LLM Wiki gist (https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f)
Most knowledge-management tools retrieve and summarize at query time. Synthadoc inverts this: it compiles knowledge at ingest time. Every new source enriches and cross-links the entire corpus, not just appends a new chunk. The wiki is the artifact — readable, editable, and browsable without any tool running.
Long-term alignment:
| Direction | How Synthadoc moves there |
|---|---|
| Agent orchestration | Orchestrator dispatches parallel IngestAgent, QueryAgent, LintAgent sub-agents with cost guards and retry backoff |
| Sub-agent skills/plugins | Featuring a 3-tier lazy-load capability system, the platform allows for the injection of custom skills and hooks via a plug-and-play interface, ensuring core stability is never compromised during extension |
| LLM wiki vs. RAG | Pre-compiled structured knowledge beats query-time synthesis for contradiction detection, graph traversal, and offline access |
| CLI / HTTP | A unified interface via CLI and RESTful endpoints, the system streamlines full-spectrum integration: from data ingestion and querying to automated linting, security auditing, and job orchestration |
| Local-first | All data stays on your machine; localhost-only network binding; no cloud dependency except the LLM API itself |
| Provider choice | LLM backends including free-tier Gemini and Groq, paid Anthropic/OpenAI/DeepSeek/MiniMax, local Ollama, and coding-tool CLI providers (Claude Code, Opencode) — no API key required if you already have a subscription |
Problems Addressed
1. RAG conflates contradictions; Synthadoc surfaces them
When two sources disagree, vector search returns both and the LLM silently blends them. Synthadoc detects the conflict during ingest, flags the page with status: contradicted, preserves both claims with citations, and either auto-resolves (if confidence ≥ threshold) or queues the conflict for human review.
2. Knowledge fragments; Synthadoc links it
RAG chunks are isolated. Synthadoc builds [[wikilinks]] between related pages during every ingest pass. The resulting graph is visible in Obsidian’s Graph view and queryable with Dataview.
3. Orphan knowledge has no address; Synthadoc finds it
Pages that exist but are referenced by nothing are surfaced by the lint system, with ready-to-paste index entries so you can quickly integrate them.
4. LLM-compiled content can be overconfident; Synthadoc audits it
An LLM synthesising source documents naturally produces confident prose — but may overstate claims, omit caveats, or accept a source’s framing uncritically. The adversarial lint pass runs a concurrent second-LLM review of every page: it plays devil’s advocate to surface issues the primary model accepted too readily — contested estimates, unsupported superlatives, and claims that contradict well-established facts. Warnings are stored in page frontmatter and surfaced in both the CLI report and the Obsidian lint modal. The reviewer is calibrated to flag only high-confidence issues, producing a useful signal without noise. For the strongest signal, point the adversarial pass at a different model family: a distinct model is far more likely to challenge assumptions than the same model reviewing its own output.
Claim-Level Provenance
Every substantive claim in the wiki is annotated with ^[filename:L-L] — a citation pointing to the exact line range in the source file it came from. Click the citation chip in Obsidian to open a Source Viewer showing the highlighted passage; for PDF sources, Synthadoc resolves the PDF page number automatically via a pagemap sidecar. A global Provenance modal shows all citations across the wiki, sortable and filterable. Broken citations are caught by the lint system and logged in the audit trail. Run synthadoc audit citations to query citations from the CLI.
5. Re-synthesis is expensive; Synthadoc caches it
A 3-layer cache (embedding, LLM response, provider prompt cache) means repeated lint runs on unchanged pages cost near-zero tokens.
6. Knowledge is locked in tools; Synthadoc escapes it
Every page is a plain Markdown file with YAML frontmatter. No proprietary format. Open the folder in any editor, put it in git, sync it with any cloud drive.
7. Wiki structure decays as content grows; Synthadoc regenerates it
As the wiki accumulates pages the index.md table of contents, domain scope (purpose.md), and LLM behaviour guidelines (AGENTS.md) can drift out of sync with actual content. The scaffold command re-generates all three from the current wiki state using the LLM — creating category-aware index entries, refreshed scope boundaries, and updated terminology guidelines — without touching pages already linked in the index. Run it once after initial install to get a rich scaffold, then schedule it weekly as the wiki grows.
Business values
| Value | How |
|---|---|
| Faster onboarding | New team members query the wiki instead of digging through documents |
| Audit trail | Every ingest recorded inaudit.db with source hash, token count, and timestamp |
| Cost control | Configurable soft-warn and hard-gate thresholds; 3-layer cache reduces repeat spend |
| Compliance | Local-first — source documents and compiled knowledge never leave your machine |
| Extensibility | Hooks fire on every event; custom skills load without a server restart |
Why Synthadoc?
Competitive advantages
| Capability | Synthadoc | Typical RAG | NotebookLM | Notion AI |
|---|---|---|---|---|
| Ingest-time synthesis | Yes | No | Partial | No |
| Contradiction detection | Yes | No | No | No |
| Orphan page detection | Yes | No | No | No |
| Adversarial claim review | Yes (concurrent second-LLM pass — flags overstated claims and unsupported assertions per page) | No | No | No |
| Claim-level provenance | Yes (^[file:L-L] citations on every claim; Source Viewer in Obsidian; PDF page resolution; global provenance table; broken-citation lint) | No | No | No |
| Persistent wikilink graph | Yes | No | No | No |
| Local-first (no cloud data) | Yes | Varies | No | No |
| Custom skill plugins | Yes | Limited | No | No |
| Obsidian integration | Yes | No | No | No |
| Cost guard + audit trail | Yes (per-job token + cost log; claim citations DB; audit citations CLI; ingest/lint/citation event types; full audit history API) | No | No | No |
| Hook / CI integration | Yes (2 events) | No | No | No |
| Offline browsable artifact | Yes | No | No | No |
| Multi-wiki isolation | Yes | No | No | No |
| Web search → wiki pages | Yes | No | No | No |
| Multiple LLMs support | Yes (Gemini, Groq, MiniMax, DeepSeek, Anthropic, OpenAI, Ollama) | No | No | No |
| Auto wiki overview page | Yes | No | No | No |
| Resumable job queue + retry | Yes | No | No | No |
| Query decomposition | Yes (parallel sub-queries) | No | No | No |
| Knowledge gap detection | Yes | No | No | No |
| Web search decomposition | Yes (parallel Tavily) | No | No | No |
| Semantic re-ranking (vector) | Yes (optional fastembed) | Varies | No | No |
| Scaffold automation | Yes | No | No | No |
| Coding tool as LLM provider | Yes (Claude Code, Opencode — no API key) | No | No | No |
| YouTube transcript ingest | Yes (standard + Shorts, no API key, timestamped) | No | No | No |
| Multilingual / CJK queries | Yes (Chinese, Japanese, Korean — no false gaps) | Limited | No | No |
| Query-scoped routing | Yes (ROUTING.md — branch-scoped BM25, query auto-selects branch) | No | No | No |
| Candidates staging | Yes (ingest to staging area, promote or discard) | No | No | No |
| Context packs | Yes (goal → sub-questions → token-budget evidence pack) | No | No | No |
Key differentiators vs. RAG
RAG chunks documents and retrieves them at query time. Synthadoc compiles knowledge: every new source is synthesized into the existing wiki graph at ingest time.
- Contradictions are caught, not blended. When two sources disagree, Synthadoc flags the page — RAG silently averages both claims.
- Knowledge is linked, not scattered.
[[wikilinks]]connect related pages into a navigable graph visible in Obsidian and queryable with Dataview. - The artifact outlives the tool. Close the server, open the wiki folder in any Markdown editor — the knowledge is all there, human-readable, no proprietary format.
- Cost-efficient at scale. Two-step ingest with cached analysis means repeated
Similar Articles
@DashHuang: https://x.com/DashHuang/status/2057323152758480955
This article explores why GitHub is a better foundation for knowledge collaboration than traditional documentation systems in the AI agent era, due to its advantages such as open collaboration, AI model familiarity, local full context, and structured raw data.
@IndieDevHailey: A blessing for researchers! This open-source tool helps you break through the sea of literature and manage the entire academic workflow with one click. Still struggling with slow literature research, writer's block, improper citations, and harsh peer reviews? Check out this open-source repository: academic-research-skills. It's not an AI ghostwriting tool, but a reliable human-AI collaboration framework—…
Recommends the open-source repository academic-research-skills, which provides a set of human-AI collaborative tools for the entire academic research workflow, including in-depth literature research, paper writing, peer review simulation, and citation audit. It supports AI assistance while keeping the user in control, suitable for graduate students and researchers.
@geekbb: AI-generated technical docs are often thousands of lines long, scrolling in the terminal — nobody wants to read them. md2html lets AI automatically convert those Markdown docs into HTML pages with sidebar table of contents, diagrams, timelines, cards, and callouts, all in a single file to share with the team. https://github.c…
md2html is a tool that converts AI-generated Markdown documents into polished, self-contained HTML pages with sidebar table of contents, diagrams, timelines, and callouts, making them easier to read and share.
@QingQ77: Automatically organize company documents into a knowledge Wiki, and use MCP to deliver the right context to each employee's AI client — no more manual copy-pasting. https://github.com/nduckmink/arkon Arkon is a self-hostable enterprise AI knowledge hub. Upload SO…
Arkon is a self-hostable enterprise AI knowledge hub that automatically compiles company documents into a cross-linked knowledge Wiki. Via the MCP protocol, employees' AI clients (such as Claude Desktop) can automatically retrieve relevant context based on their permissions — no manual document pasting required.
@laogui: Popular docs tool Mintlify just shipped an auto-doc generator for repos — http://mintlify.wiki. It joins existing players like http://deepwiki.com and http://codewiki.google, all chasing the same idea…
Mintlify released mintlify.wiki, an auto-doc generator that keeps docs in lock-step with code, joining similar tools like DeepWiki and CodeWiki.