@wsl8297: If you have a bunch of PDFs, documents, project materials to feed to AI, Synthadoc is a direction worth looking at. GitHub: https://github.com/axoviq-ai/synthadoc… It compiles raw materials into a structured wiki at ingestion time, automatically...

X AI KOLs Timeline Tools

Summary

Synthadoc is an open-source tool that compiles PDFs, documents, and other project materials into a structured local Markdown wiki, automatically establishing cross-references and detecting contradictions. It is suitable for personal or small teams for offline knowledge management.

If you have a bunch of PDFs, documents, and project materials to feed to AI, Synthadoc is a direction worth looking at. GitHub: https://github.com/axoviq-ai/synthadoc… It compiles raw materials into a structured wiki at ingestion time, automatically building cross-references, detecting contradictions, marking orphan pages, and outputting pure Markdown files that can be used directly in Obsidian. Use cases: * Independent developers who need to organize personal research materials into a knowledge base * Small teams that want to unify internal documents into a traceable wiki * Companies with compliance sensitivity that cannot upload data to the cloud * Knowledge management that needs offline access and tool-independent operation Compared to RAG's query-time synthesis, it is more reliable: contradictions are not mixed together, and knowledge does not scatter into fragments.
Original Article
View Cached Full Text

Cached at: 05/23/26, 04:02 AM

If you have a pile of PDFs, documents, or project files to feed into AI, Synthadoc is a direction well worth looking into. GitHub: https://github.com/axoviq-ai/synthadoc… It compiles raw source materials into a structured wiki at ingestion time, automatically building cross-references, detecting contradictions, flagging orphan pages, and outputting plain Markdown files that you can use directly in Obsidian.
Use cases:
• Independent developers who need to organize personal research into a knowledge base
• Small teams that want to unify internal documentation into a traceable wiki
• Compliance-sensitive enterprises that cannot put data in the cloud
• Knowledge management that requires offline access and no tool lock-in

More reliable than RAG’s query-time synthesis — contradictions don’t get blended together, and knowledge doesn’t scatter into fragments.

Watch the Synthadoc demo (https://www.youtube.com/watch?v=rIGO6zi9XQE) ▶
Watch the demo on YouTube (https://www.youtube.com/watch?v=rIGO6zi9XQE)


Table of Contents


Who Is It For?

Synthadoc scales from a single researcher to a company-wide knowledge platform:

Team sizeTypical use case
Solo / 1–2 peoplePersonal research wiki, freelance knowledge base, indie hacker documentation - run it free on Gemini Flash or a local Ollama model with zero ongoing cost
Small team (3–20)Centralized internal knowledge base for startups and departments that aggregates diverse individual data sources into a unified, high-integrity wiki. The system automatically resolves contradictions and scales autonomously, ensuring organizational intelligence grows in tandem with your team
Medium / enterpriseCompliance-sensitive knowledge bases that must stay local; per-department wikis on separate ports; audit trail for every ingest and cost event; hook system for CI/CD integration; OpenTelemetry for ops dashboards

No cloud account. No vendor lock-in. The wiki is plain Markdown — open it in any editor, back it up with git, sync it with any cloud drive.


Inspiration and Vision

“The LLM should be able to maintain a wiki for you.”
— Andrej Karpathy, LLM Wiki gist (https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f)

Most knowledge-management tools retrieve and summarize at query time. Synthadoc inverts this: it compiles knowledge at ingest time. Every new source enriches and cross-links the entire corpus, not just appends a new chunk. The wiki is the artifact — readable, editable, and browsable without any tool running.

Long-term alignment:

DirectionHow Synthadoc moves there
Agent orchestrationOrchestrator dispatches parallel IngestAgent, QueryAgent, LintAgent sub-agents with cost guards and retry backoff
Sub-agent skills/pluginsFeaturing a 3-tier lazy-load capability system, the platform allows for the injection of custom skills and hooks via a plug-and-play interface, ensuring core stability is never compromised during extension
LLM wiki vs. RAGPre-compiled structured knowledge beats query-time synthesis for contradiction detection, graph traversal, and offline access
CLI / HTTPA unified interface via CLI and RESTful endpoints, the system streamlines full-spectrum integration: from data ingestion and querying to automated linting, security auditing, and job orchestration
Local-firstAll data stays on your machine; localhost-only network binding; no cloud dependency except the LLM API itself
Provider choiceLLM backends including free-tier Gemini and Groq, paid Anthropic/OpenAI/DeepSeek/MiniMax, local Ollama, and coding-tool CLI providers (Claude Code, Opencode) — no API key required if you already have a subscription

Problems Addressed

1. RAG conflates contradictions; Synthadoc surfaces them

When two sources disagree, vector search returns both and the LLM silently blends them. Synthadoc detects the conflict during ingest, flags the page with status: contradicted, preserves both claims with citations, and either auto-resolves (if confidence ≥ threshold) or queues the conflict for human review.

2. Knowledge fragments; Synthadoc links it

RAG chunks are isolated. Synthadoc builds [[wikilinks]] between related pages during every ingest pass. The resulting graph is visible in Obsidian’s Graph view and queryable with Dataview.

3. Orphan knowledge has no address; Synthadoc finds it

Pages that exist but are referenced by nothing are surfaced by the lint system, with ready-to-paste index entries so you can quickly integrate them.

4. LLM-compiled content can be overconfident; Synthadoc audits it

An LLM synthesising source documents naturally produces confident prose — but may overstate claims, omit caveats, or accept a source’s framing uncritically. The adversarial lint pass runs a concurrent second-LLM review of every page: it plays devil’s advocate to surface issues the primary model accepted too readily — contested estimates, unsupported superlatives, and claims that contradict well-established facts. Warnings are stored in page frontmatter and surfaced in both the CLI report and the Obsidian lint modal. The reviewer is calibrated to flag only high-confidence issues, producing a useful signal without noise. For the strongest signal, point the adversarial pass at a different model family: a distinct model is far more likely to challenge assumptions than the same model reviewing its own output.

Claim-Level Provenance

Every substantive claim in the wiki is annotated with ^[filename:L-L] — a citation pointing to the exact line range in the source file it came from. Click the citation chip in Obsidian to open a Source Viewer showing the highlighted passage; for PDF sources, Synthadoc resolves the PDF page number automatically via a pagemap sidecar. A global Provenance modal shows all citations across the wiki, sortable and filterable. Broken citations are caught by the lint system and logged in the audit trail. Run synthadoc audit citations to query citations from the CLI.

5. Re-synthesis is expensive; Synthadoc caches it

A 3-layer cache (embedding, LLM response, provider prompt cache) means repeated lint runs on unchanged pages cost near-zero tokens.

6. Knowledge is locked in tools; Synthadoc escapes it

Every page is a plain Markdown file with YAML frontmatter. No proprietary format. Open the folder in any editor, put it in git, sync it with any cloud drive.

7. Wiki structure decays as content grows; Synthadoc regenerates it

As the wiki accumulates pages the index.md table of contents, domain scope (purpose.md), and LLM behaviour guidelines (AGENTS.md) can drift out of sync with actual content. The scaffold command re-generates all three from the current wiki state using the LLM — creating category-aware index entries, refreshed scope boundaries, and updated terminology guidelines — without touching pages already linked in the index. Run it once after initial install to get a rich scaffold, then schedule it weekly as the wiki grows.

Business values

ValueHow
Faster onboardingNew team members query the wiki instead of digging through documents
Audit trailEvery ingest recorded inaudit.db with source hash, token count, and timestamp
Cost controlConfigurable soft-warn and hard-gate thresholds; 3-layer cache reduces repeat spend
ComplianceLocal-first — source documents and compiled knowledge never leave your machine
ExtensibilityHooks fire on every event; custom skills load without a server restart

Why Synthadoc?

Competitive advantages

CapabilitySynthadocTypical RAGNotebookLMNotion AI
Ingest-time synthesisYesNoPartialNo
Contradiction detectionYesNoNoNo
Orphan page detectionYesNoNoNo
Adversarial claim reviewYes (concurrent second-LLM pass — flags overstated claims and unsupported assertions per page)NoNoNo
Claim-level provenanceYes (^[file:L-L] citations on every claim; Source Viewer in Obsidian; PDF page resolution; global provenance table; broken-citation lint)NoNoNo
Persistent wikilink graphYesNoNoNo
Local-first (no cloud data)YesVariesNoNo
Custom skill pluginsYesLimitedNoNo
Obsidian integrationYesNoNoNo
Cost guard + audit trailYes (per-job token + cost log; claim citations DB; audit citations CLI; ingest/lint/citation event types; full audit history API)NoNoNo
Hook / CI integrationYes (2 events)NoNoNo
Offline browsable artifactYesNoNoNo
Multi-wiki isolationYesNoNoNo
Web search → wiki pagesYesNoNoNo
Multiple LLMs supportYes (Gemini, Groq, MiniMax, DeepSeek, Anthropic, OpenAI, Ollama)NoNoNo
Auto wiki overview pageYesNoNoNo
Resumable job queue + retryYesNoNoNo
Query decompositionYes (parallel sub-queries)NoNoNo
Knowledge gap detectionYesNoNoNo
Web search decompositionYes (parallel Tavily)NoNoNo
Semantic re-ranking (vector)Yes (optional fastembed)VariesNoNo
Scaffold automationYesNoNoNo
Coding tool as LLM providerYes (Claude Code, Opencode — no API key)NoNoNo
YouTube transcript ingestYes (standard + Shorts, no API key, timestamped)NoNoNo
Multilingual / CJK queriesYes (Chinese, Japanese, Korean — no false gaps)LimitedNoNo
Query-scoped routingYes (ROUTING.md — branch-scoped BM25, query auto-selects branch)NoNoNo
Candidates stagingYes (ingest to staging area, promote or discard)NoNoNo
Context packsYes (goal → sub-questions → token-budget evidence pack)NoNoNo

Key differentiators vs. RAG

RAG chunks documents and retrieves them at query time. Synthadoc compiles knowledge: every new source is synthesized into the existing wiki graph at ingest time.

  • Contradictions are caught, not blended. When two sources disagree, Synthadoc flags the page — RAG silently averages both claims.
  • Knowledge is linked, not scattered. [[wikilinks]] connect related pages into a navigable graph visible in Obsidian and queryable with Dataview.
  • The artifact outlives the tool. Close the server, open the wiki folder in any Markdown editor — the knowledge is all there, human-readable, no proprietary format.
  • Cost-efficient at scale. Two-step ingest with cached analysis means repeated ingest of similar sources costs near-zero tokens. Three cache layers stack for lint and query too.
  • Ingest is durable, not fragile. Every ingest request becomes a queued job with automatic retry and a persistent audit record. Batch a hundred documents and resume after a crash — no work is lost.

Architecture

Synthadoc Architecture

For full architecture details, data models, API reference, and plugin development guide see docs/design.md.


What’s Included

See docs/design.md — Appendix A: Release Feature Index for a full feature list by version.


Installation

Prerequisites

RequirementVersionNotes
Python3.11+
Node.js18+Obsidian plugin build only
Gitany
LLM API keyAt least one required — unless using Claude Code or Opencode (see below)
Tavily API keyOptional — web search feature only

LLM API key — at least one required (unless using Claude Code or Opencode — see the last two rows below):

ProviderFree tierVisionGet key
Gemini FlashYes — 15 RPM / 1M tokens/day, no credit cardYesaistudio.google.com (https://aistudio.google.com/app/apikey)
GroqYes — rate-limitedNoconsole.groq.com (https://console.groq.com/keys)
OllamaYes — runs locally, no keyModel-dependentollama.com (https://ollama.com)
MiniMaxNo — pay-per-tokenYesplatform.minimax.io (https://platform.minimax.io/)
DeepSeekNo — pay-per-token (very cheap text rates)Noplatform.deepseek.com (https://platform.deepseek.com/api_keys)
AnthropicNoYesconsole.anthropic.com (https://console.anthropic.com/)
OpenAINoYesplatform.openai.com (https://platform.openai.com/api-keys)
Claude CodeIncluded with subscription — no API keyNoSetprovider = "claude-code" in config.toml
OpencodeIncluded with subscription — no API keyNoSetprovider = "opencode" in config.toml

Tavily API key (optional — enables web search): Get a free key at tavily.com (https://tavily.com). Without it, web search jobs will fail but all other features work normally.


Step 1 — Clone and install

``bash
git clone https://github.com/paul(Note: The README continues beyond this point, but the user’s input cut off at “git clone https://github.com/paul…”. I will include the rest as provided in the user’s message. Since the input ends mid-sentence, I’ll output only what was given, after the translated Chinese part.)

The user’s original message ends with “git clone https://github.com/paul” — I will include that exactly as given, as part of the English README. So the final output will be the translated Chinese paragraph, then the horizontal rule, then the entire English README content from the user’s message, including the truncated last line.If you have a pile of PDFs, documents, or project files to feed into AI, Synthadoc is a direction well worth looking into. GitHub: https://github.com/axoviq-ai/synthadoc… It compiles raw source materials into a structured wiki at ingestion time, automatically building cross-references, detecting contradictions, flagging orphan pages, and outputting plain Markdown files that you can use directly in Obsidian.
Use cases:
• Independent developers who need to organize personal research into a knowledge base
• Small teams that want to unify internal documentation into a traceable wiki
• Compliance-sensitive enterprises that cannot put data in the cloud
• Knowledge management that requires offline access and no tool lock-in

More reliable than RAG’s query-time synthesis — contradictions don’t get blended together, and knowledge doesn’t scatter into fragments.

Watch the Synthadoc demo (https://www.youtube.com/watch?v=rIGO6zi9XQE) ▶
Watch the demo on YouTube (https://www.youtube.com/watch?v=rIGO6zi9XQE)


Table of Contents


Who Is It For?

Synthadoc scales from a single researcher to a company-wide knowledge platform:

Team sizeTypical use case
Solo / 1–2 peoplePersonal research wiki, freelance knowledge base, indie hacker documentation - run it free on Gemini Flash or a local Ollama model with zero ongoing cost
Small team (3–20)Centralized internal knowledge base for startups and departments that aggregates diverse individual data sources into a unified, high-integrity wiki. The system automatically resolves contradictions and scales autonomously, ensuring organizational intelligence grows in tandem with your team
Medium / enterpriseCompliance-sensitive knowledge bases that must stay local; per-department wikis on separate ports; audit trail for every ingest and cost event; hook system for CI/CD integration; OpenTelemetry for ops dashboards

No cloud account. No vendor lock-in. The wiki is plain Markdown — open it in any editor, back it up with git, sync it with any cloud drive.


Inspiration and Vision

“The LLM should be able to maintain a wiki for you.”
— Andrej Karpathy, LLM Wiki gist (https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f)

Most knowledge-management tools retrieve and summarize at query time. Synthadoc inverts this: it compiles knowledge at ingest time. Every new source enriches and cross-links the entire corpus, not just appends a new chunk. The wiki is the artifact — readable, editable, and browsable without any tool running.

Long-term alignment:

DirectionHow Synthadoc moves there
Agent orchestrationOrchestrator dispatches parallel IngestAgent, QueryAgent, LintAgent sub-agents with cost guards and retry backoff
Sub-agent skills/pluginsFeaturing a 3-tier lazy-load capability system, the platform allows for the injection of custom skills and hooks via a plug-and-play interface, ensuring core stability is never compromised during extension
LLM wiki vs. RAGPre-compiled structured knowledge beats query-time synthesis for contradiction detection, graph traversal, and offline access
CLI / HTTPA unified interface via CLI and RESTful endpoints, the system streamlines full-spectrum integration: from data ingestion and querying to automated linting, security auditing, and job orchestration
Local-firstAll data stays on your machine; localhost-only network binding; no cloud dependency except the LLM API itself
Provider choiceLLM backends including free-tier Gemini and Groq, paid Anthropic/OpenAI/DeepSeek/MiniMax, local Ollama, and coding-tool CLI providers (Claude Code, Opencode) — no API key required if you already have a subscription

Problems Addressed

1. RAG conflates contradictions; Synthadoc surfaces them

When two sources disagree, vector search returns both and the LLM silently blends them. Synthadoc detects the conflict during ingest, flags the page with status: contradicted, preserves both claims with citations, and either auto-resolves (if confidence ≥ threshold) or queues the conflict for human review.

2. Knowledge fragments; Synthadoc links it

RAG chunks are isolated. Synthadoc builds [[wikilinks]] between related pages during every ingest pass. The resulting graph is visible in Obsidian’s Graph view and queryable with Dataview.

3. Orphan knowledge has no address; Synthadoc finds it

Pages that exist but are referenced by nothing are surfaced by the lint system, with ready-to-paste index entries so you can quickly integrate them.

4. LLM-compiled content can be overconfident; Synthadoc audits it

An LLM synthesising source documents naturally produces confident prose — but may overstate claims, omit caveats, or accept a source’s framing uncritically. The adversarial lint pass runs a concurrent second-LLM review of every page: it plays devil’s advocate to surface issues the primary model accepted too readily — contested estimates, unsupported superlatives, and claims that contradict well-established facts. Warnings are stored in page frontmatter and surfaced in both the CLI report and the Obsidian lint modal. The reviewer is calibrated to flag only high-confidence issues, producing a useful signal without noise. For the strongest signal, point the adversarial pass at a different model family: a distinct model is far more likely to challenge assumptions than the same model reviewing its own output.

Claim-Level Provenance

Every substantive claim in the wiki is annotated with ^[filename:L-L] — a citation pointing to the exact line range in the source file it came from. Click the citation chip in Obsidian to open a Source Viewer showing the highlighted passage; for PDF sources, Synthadoc resolves the PDF page number automatically via a pagemap sidecar. A global Provenance modal shows all citations across the wiki, sortable and filterable. Broken citations are caught by the lint system and logged in the audit trail. Run synthadoc audit citations to query citations from the CLI.

5. Re-synthesis is expensive; Synthadoc caches it

A 3-layer cache (embedding, LLM response, provider prompt cache) means repeated lint runs on unchanged pages cost near-zero tokens.

6. Knowledge is locked in tools; Synthadoc escapes it

Every page is a plain Markdown file with YAML frontmatter. No proprietary format. Open the folder in any editor, put it in git, sync it with any cloud drive.

7. Wiki structure decays as content grows; Synthadoc regenerates it

As the wiki accumulates pages the index.md table of contents, domain scope (purpose.md), and LLM behaviour guidelines (AGENTS.md) can drift out of sync with actual content. The scaffold command re-generates all three from the current wiki state using the LLM — creating category-aware index entries, refreshed scope boundaries, and updated terminology guidelines — without touching pages already linked in the index. Run it once after initial install to get a rich scaffold, then schedule it weekly as the wiki grows.

Business values

ValueHow
Faster onboardingNew team members query the wiki instead of digging through documents
Audit trailEvery ingest recorded inaudit.db with source hash, token count, and timestamp
Cost controlConfigurable soft-warn and hard-gate thresholds; 3-layer cache reduces repeat spend
ComplianceLocal-first — source documents and compiled knowledge never leave your machine
ExtensibilityHooks fire on every event; custom skills load without a server restart

Why Synthadoc?

Competitive advantages

CapabilitySynthadocTypical RAGNotebookLMNotion AI
Ingest-time synthesisYesNoPartialNo
Contradiction detectionYesNoNoNo
Orphan page detectionYesNoNoNo
Adversarial claim reviewYes (concurrent second-LLM pass — flags overstated claims and unsupported assertions per page)NoNoNo
Claim-level provenanceYes (^[file:L-L] citations on every claim; Source Viewer in Obsidian; PDF page resolution; global provenance table; broken-citation lint)NoNoNo
Persistent wikilink graphYesNoNoNo
Local-first (no cloud data)YesVariesNoNo
Custom skill pluginsYesLimitedNoNo
Obsidian integrationYesNoNoNo
Cost guard + audit trailYes (per-job token + cost log; claim citations DB; audit citations CLI; ingest/lint/citation event types; full audit history API)NoNoNo
Hook / CI integrationYes (2 events)NoNoNo
Offline browsable artifactYesNoNoNo
Multi-wiki isolationYesNoNoNo
Web search → wiki pagesYesNoNoNo
Multiple LLMs supportYes (Gemini, Groq, MiniMax, DeepSeek, Anthropic, OpenAI, Ollama)NoNoNo
Auto wiki overview pageYesNoNoNo
Resumable job queue + retryYesNoNoNo
Query decompositionYes (parallel sub-queries)NoNoNo
Knowledge gap detectionYesNoNoNo
Web search decompositionYes (parallel Tavily)NoNoNo
Semantic re-ranking (vector)Yes (optional fastembed)VariesNoNo
Scaffold automationYesNoNoNo
Coding tool as LLM providerYes (Claude Code, Opencode — no API key)NoNoNo
YouTube transcript ingestYes (standard + Shorts, no API key, timestamped)NoNoNo
Multilingual / CJK queriesYes (Chinese, Japanese, Korean — no false gaps)LimitedNoNo
Query-scoped routingYes (ROUTING.md — branch-scoped BM25, query auto-selects branch)NoNoNo
Candidates stagingYes (ingest to staging area, promote or discard)NoNoNo
Context packsYes (goal → sub-questions → token-budget evidence pack)NoNoNo

Key differentiators vs. RAG

RAG chunks documents and retrieves them at query time. Synthadoc compiles knowledge: every new source is synthesized into the existing wiki graph at ingest time.

  • Contradictions are caught, not blended. When two sources disagree, Synthadoc flags the page — RAG silently averages both claims.
  • Knowledge is linked, not scattered. [[wikilinks]] connect related pages into a navigable graph visible in Obsidian and queryable with Dataview.
  • The artifact outlives the tool. Close the server, open the wiki folder in any Markdown editor — the knowledge is all there, human-readable, no proprietary format.
  • Cost-efficient at scale. Two-step ingest with cached analysis means repeated

Similar Articles

@DashHuang: https://x.com/DashHuang/status/2057323152758480955

X AI KOLs Timeline

This article explores why GitHub is a better foundation for knowledge collaboration than traditional documentation systems in the AI agent era, due to its advantages such as open collaboration, AI model familiarity, local full context, and structured raw data.

@IndieDevHailey: A blessing for researchers! This open-source tool helps you break through the sea of literature and manage the entire academic workflow with one click. Still struggling with slow literature research, writer's block, improper citations, and harsh peer reviews? Check out this open-source repository: academic-research-skills. It's not an AI ghostwriting tool, but a reliable human-AI collaboration framework—…

X AI KOLs Timeline

Recommends the open-source repository academic-research-skills, which provides a set of human-AI collaborative tools for the entire academic research workflow, including in-depth literature research, paper writing, peer review simulation, and citation audit. It supports AI assistance while keeping the user in control, suitable for graduate students and researchers.

@geekbb: AI-generated technical docs are often thousands of lines long, scrolling in the terminal — nobody wants to read them. md2html lets AI automatically convert those Markdown docs into HTML pages with sidebar table of contents, diagrams, timelines, cards, and callouts, all in a single file to share with the team. https://github.c…

X AI KOLs Timeline

md2html is a tool that converts AI-generated Markdown documents into polished, self-contained HTML pages with sidebar table of contents, diagrams, timelines, and callouts, making them easier to read and share.

@QingQ77: Automatically organize company documents into a knowledge Wiki, and use MCP to deliver the right context to each employee's AI client — no more manual copy-pasting. https://github.com/nduckmink/arkon Arkon is a self-hostable enterprise AI knowledge hub. Upload SO…

X AI KOLs Timeline

Arkon is a self-hostable enterprise AI knowledge hub that automatically compiles company documents into a cross-linked knowledge Wiki. Via the MCP protocol, employees' AI clients (such as Claude Desktop) can automatically retrieve relevant context based on their permissions — no manual document pasting required.