@wsl8297: 如果你手里有一堆 PDF、文档、项目资料要喂给 AI，Synthadoc 这个方向很值得看。 GitHub：https://github.com/axoviq-ai/synthadoc… 它把原始资料在摄入时就编译成结构化 wiki，自动…

X AI KOLs Timeline 2026/05/23 01:59 工具

pdf-processing document-to-wiki knowledge-base open-source obsidian llm structured-wiki

摘要

Synthadoc 是一个开源工具，可将 PDF、文档等项目资料编译为结构化的本地 Markdown wiki，自动建立交叉引用并检测矛盾，适合个人或小团队进行离线知识管理。

如果你手里有一堆 PDF、文档、项目资料要喂给 AI，Synthadoc 这个方向很值得看。 GitHub：https://github.com/axoviq-ai/synthadoc… 它把原始资料在摄入时就编译成结构化 wiki，自动建立交叉引用、检测矛盾、标记孤立页面，输出纯 Markdown 文件，可以直接在 Obsidian 里用。适用场景： • 需要把个人研究资料整理成知识库的独立开发者 • 想把团队内部文档统一成可追溯 wiki 的小团队 • 对合规敏感、数据不能上云的企业 • 需要离线可访问、无工具依赖的知识管理比 RAG 的查询时合成更可靠，矛盾不会被混在一起，知识不会散落成碎片。

查看原文

查看缓存全文

缓存时间: 2026/05/23 04:02

如果你手里有一堆 PDF、文档、项目资料要喂给 AI，Synthadoc 这个方向很值得看。

GitHub：https://github.com/axoviq-ai/synthadoc…

它把原始资料在摄入时就编译成结构化 wiki，自动建立交叉引用、检测矛盾、标记孤立页面，输出纯 Markdown 文件，可以直接在 Obsidian 里用。

适用场景： • 需要把个人研究资料整理成知识库的独立开发者 • 想把团队内部文档统一成可追溯 wiki 的小团队 • 对合规敏感、数据不能上云的企业 • 需要离线可访问、无工具依赖的知识管理

比 RAG 的查询时合成更可靠，矛盾不会被混在一起，知识不会散落成碎片。

axoviq-ai/synthadoc

Source: https://github.com/axoviq-ai/synthadoc

Synthadoc

      .-+###############+-.
    .##                   ##.
   ##    .----.   .----.    ##
  ##    /######\ /######\    ##
  ##    |######| |######|    ##
  ##    | [SD] | | wiki |    ##
  ##    |######| |######|    ##
  ##    \######/ \######/    ##
   ##    '----'   '----'    ##
    '##                   ##'
      '-+###############+-'

       S Y N T H A D O C
    Community Edition  v0.5.0
  ────────────────────────────────
  Domain-agnostic LLM wiki engine

Document version: v0.5.0

Engineered for solo users and enterprises alike, providing a domain-specific knowledge base that scales seamlessly while maintaining accuracy through autonomous self-optimization.

Built for individuals, small teams, and large organizations who need a knowledge base that stays accurate as documents accumulate.

Synthadoc reads your raw source documents — PDFs, spreadsheets, PPTs, web pages, images, videos, Word files, TXTs — and uses an LLM to synthesize them into a persistent, structured wiki. Cross-references are built automatically, contradictions are detected and surfaced, orphan pages are flagged, and every answer cites its sources. Outputs are stored as local Markdown files, ensuring seamless integration and autonomous management within Obsidian or any wiki-compliant ecosystem.

▶ Watch the demo on YouTube

Who Is It For?
Inspiration and Vision
Problems Addressed
Why Synthadoc?
Architecture
What’s Included
Installation
Quick-Start Guide
Creating Your Own Wiki
Configuration
Command Reference by Use Case
Administrative Reference
Understanding Logs and the Audit Trail
Customization
Links

Who Is It For?

Synthadoc scales from a single researcher to a company-wide knowledge platform:

Team size	Typical use case
Solo / 1–2 people	Personal research wiki, freelance knowledge base, indie hacker documentation - run it free on Gemini Flash or a local Ollama model with zero ongoing cost
Small team (3–20)	Centralized internal knowledge base for startups and departments that aggregates diverse individual data sources into a unified, high-integrity wiki. The system automatically resolves contradictions and scales autonomously, ensuring organizational intelligence grows in tandem with your team
Medium / enterprise	Compliance-sensitive knowledge bases that must stay local; per-department wikis on separate ports; audit trail for every ingest and cost event; hook system for CI/CD integration; OpenTelemetry for ops dashboards

No cloud account. No vendor lock-in. The wiki is plain Markdown — open it in any editor, back it up with git, sync it with any cloud drive.

Inspiration and Vision

“The LLM should be able to maintain a wiki for you.” — Andrej Karpathy, LLM Wiki gist

Most knowledge-management tools retrieve and summarize at query time. Synthadoc inverts this: it compiles knowledge at ingest time. Every new source enriches and cross-links the entire corpus, not just appends a new chunk. The wiki is the artifact — readable, editable, and browsable without any tool running.

Long-term alignment:

Direction	How Synthadoc moves there
Agent orchestration	Orchestrator dispatches parallel IngestAgent, QueryAgent, LintAgent sub-agents with cost guards and retry backoff
Sub-agent skills/plugins	Featuring a 3-tier lazy-load capability system, the platform allows for the injection of custom skills and hooks via a plug-and-play interface, ensuring core stability is never compromised during extension
LLM wiki vs. RAG	Pre-compiled structured knowledge beats query-time synthesis for contradiction detection, graph traversal, and offline access
CLI / HTTP	A unified interface via CLI and RESTful endpoints, the system streamlines full-spectrum integration: from data ingestion and querying to automated linting, security auditing, and job orchestration
Local-first	All data stays on your machine; localhost-only network binding; no cloud dependency except the LLM API itself
Provider choice	LLM backends including free-tier Gemini and Groq, paid Anthropic/OpenAI/DeepSeek/MiniMax, local Ollama, and coding-tool CLI providers (Claude Code, Opencode) — no API key required if you already have a subscription

Problems Addressed

1. RAG conflates contradictions; Synthadoc surfaces them

When two sources disagree, vector search returns both and the LLM silently blends them. Synthadoc detects the conflict during ingest, flags the page with status: contradicted, preserves both claims with citations, and either auto-resolves (if confidence ≥ threshold) or queues the conflict for human review.

2. Knowledge fragments; Synthadoc links it

RAG chunks are isolated. Synthadoc builds [[wikilinks]] between related pages during every ingest pass. The resulting graph is visible in Obsidian’s Graph view and queryable with Dataview.

3. Orphan knowledge has no address; Synthadoc finds it

Pages that exist but are referenced by nothing are surfaced by the lint system, with ready-to-paste index entries so you can quickly integrate them.

4. LLM-compiled content can be overconfident; Synthadoc audits it

An LLM synthesising source documents naturally produces confident prose — but may overstate claims, omit caveats, or accept a source’s framing uncritically. The adversarial lint pass runs a concurrent second-LLM review of every page: it plays devil’s advocate to surface issues the primary model accepted too readily — contested estimates, unsupported superlatives, and claims that contradict well-established facts. Warnings are stored in page frontmatter and surfaced in both the CLI report and the Obsidian lint modal. The reviewer is calibrated to flag only high-confidence issues, producing a useful signal without noise. For the strongest signal, point the adversarial pass at a different model family: a distinct model is far more likely to challenge assumptions than the same model reviewing its own output.

Claim-Level Provenance

Every substantive claim in the wiki is annotated with ^[filename:L-L] — a citation pointing to the exact line range in the source file it came from. Click the citation chip in Obsidian to open a Source Viewer showing the highlighted passage; for PDF sources, Synthadoc resolves the PDF page number automatically via a pagemap sidecar. A global Provenance modal shows all citations across the wiki, sortable and filterable. Broken citations are caught by the lint system and logged in the audit trail. Run synthadoc audit citations to query citations from the CLI.

5. Re-synthesis is expensive; Synthadoc caches it

A 3-layer cache (embedding, LLM response, provider prompt cache) means repeated lint runs on unchanged pages cost near-zero tokens.

6. Knowledge is locked in tools; Synthadoc escapes it

Every page is a plain Markdown file with YAML frontmatter. No proprietary format. Open the folder in any editor, put it in git, sync it with any cloud drive.

7. Wiki structure decays as content grows; Synthadoc regenerates it

As the wiki accumulates pages the index.md table of contents, domain scope (purpose.md), and LLM behaviour guidelines (AGENTS.md) can drift out of sync with actual content. The scaffold command re-generates all three from the current wiki state using the LLM — creating category-aware index entries, refreshed scope boundaries, and updated terminology guidelines — without touching pages already linked in the index. Run it once after initial install to get a rich scaffold, then schedule it weekly as the wiki grows.

Business values

Value	How
Faster onboarding	New team members query the wiki instead of digging through documents
Audit trail	Every ingest recorded in`audit.db` with source hash, token count, and timestamp
Cost control	Configurable soft-warn and hard-gate thresholds; 3-layer cache reduces repeat spend
Compliance	Local-first — source documents and compiled knowledge never leave your machine
Extensibility	Hooks fire on every event; custom skills load without a server restart

Why Synthadoc?

Competitive advantages

Capability	Synthadoc	Typical RAG	NotebookLM	Notion AI
Ingest-time synthesis	Yes	No	Partial	No
Contradiction detection	Yes	No	No	No
Orphan page detection	Yes	No	No	No
Adversarial claim review	Yes (concurrent second-LLM pass — flags overstated claims and unsupported assertions per page)	No	No	No
Claim-level provenance	Yes (`^[file:L-L]` citations on every claim; Source Viewer in Obsidian; PDF page resolution; global provenance table; broken-citation lint)	No	No	No
Persistent wikilink graph	Yes	No	No	No
Local-first (no cloud data)	Yes	Varies	No	No
Custom skill plugins	Yes	Limited	No	No
Obsidian integration	Yes	No	No	No
Cost guard + audit trail	Yes (per-job token + cost log; claim citations DB; `audit citations` CLI; ingest/lint/citation event types; full audit history API)	No	No	No
Hook / CI integration	Yes (2 events)	No	No	No
Offline browsable artifact	Yes	No	No	No
Multi-wiki isolation	Yes	No	No	No
Web search → wiki pages	Yes	No	No	No
Multiple LLMs support	Yes (Gemini, Groq, MiniMax, DeepSeek, Anthropic, OpenAI, Ollama)	No	No	No
Auto wiki overview page	Yes	No	No	No
Resumable job queue + retry	Yes	No	No	No
Query decomposition	Yes (parallel sub-queries)	No	No	No
Knowledge gap detection	Yes	No	No	No
Web search decomposition	Yes (parallel Tavily)	No	No	No
Semantic re-ranking (vector)	Yes (optional fastembed)	Varies	No	No
Scaffold automation	Yes	No	No	No
Coding tool as LLM provider	Yes (Claude Code, Opencode — no API key)	No	No	No
YouTube transcript ingest	Yes (standard + Shorts, no API key, timestamped)	No	No	No
Multilingual / CJK queries	Yes (Chinese, Japanese, Korean — no false gaps)	Limited	No	No
Query-scoped routing	Yes (ROUTING.md — branch-scoped BM25, query auto-selects branch)	No	No	No
Candidates staging	Yes (ingest to staging area, promote or discard)	No	No	No
Context packs	Yes (goal → sub-questions → token-budget evidence pack)	No	No	No

Key differentiators vs. RAG

RAG chunks documents and retrieves them at query time. Synthadoc compiles knowledge: every new source is synthesized into the existing wiki graph at ingest time.

Contradictions are caught, not blended. When two sources disagree, Synthadoc flags the page — RAG silently averages both claims.
Knowledge is linked, not scattered. [[wikilinks]] connect related pages into a navigable graph visible in Obsidian and queryable with Dataview.
The artifact outlives the tool. Close the server, open the wiki folder in any Markdown editor — the knowledge is all there, human-readable, no proprietary format.
Cost-efficient at scale. Two-step ingest with cached analysis means repeated ingest of similar sources costs near-zero tokens. Three cache layers stack for lint and query too.
Ingest is durable, not fragile. Every ingest request becomes a queued job with automatic retry and a persistent audit record. Batch a hundred documents and resume after a crash — no work is lost.

Architecture

Synthadoc Architecture

For full architecture details, data models, API reference, and plugin development guide see docs/design.md.

What’s Included

See docs/design.md — Appendix A: Release Feature Index for a full feature list by version.

Installation

Prerequisites

Requirement	Version	Notes
Python	3.11+
Node.js	18+	Obsidian plugin build only
Git	any
LLM API key	—	At least one required — unless using Claude Code or Opencode (see below)
Tavily API key	—	Optional — web search feature only

LLM API key — at least one required (unless using Claude Code or Opencode — see the last two rows below):

Provider	Free tier	Vision	Get key
Gemini Flash	Yes — 15 RPM / 1M tokens/day, no credit card	Yes	aistudio.google.com
Groq	Yes — rate-limited	No	console.groq.com
Ollama	Yes — runs locally, no key	Model-dependent	ollama.com
MiniMax	No — pay-per-token	Yes	platform.minimax.io
DeepSeek	No — pay-per-token (very cheap text rates)	No	platform.deepseek.com
Anthropic	No	Yes	console.anthropic.com
OpenAI	No	Yes	platform.openai.com
Claude Code	Included with subscription — no API key	No	Set`provider = "claude-code"` in config.toml
Opencode	Included with subscription — no API key	No	Set`provider = "opencode"` in config.toml

Tavily API key (optional — enables web search): Get a free key at tavily.com. Without it, web search jobs will fail but all other features work normally.

Step 1 — Clone and install

git clone https://github.com/paulmchen/synthadoc.git
cd synthadoc
pip3 install -e ".[dev]"

If you already have Synthadoc wikis installed, upgrade the Obsidian plugin in all registered wikis to keep them in sync:

synthadoc plugin upgrade

Step 2 — Run the Python test suite

Validate that the Python engine builds and all tests pass before proceeding:

pytest --ignore=tests/performance/ -q

Expected: all tests pass, 0 failures. If any fail, check the error output before continuing.

Performance benchmarks (optional — Linux/macOS, measures SLOs):

pytest tests/performance/ -v --benchmark-disable

Step 3 — Test the Obsidian plugin

The pre-built main.js is committed to the repo — you do not need to rebuild it unless you modify the plugin source code. To run the plugin unit tests:

cd obsidian-plugin
npm install
npm test         # runs Vitest unit tests

If you modify src/main.ts, rebuild the bundle before installing:

npm run build    # produces main.js

Step 4 — Set your API keys

At least one LLM API key is required — unless you use Claude Code or Opencode as your provider, in which case no separate API key is needed (see Coding tool CLI providers).

Synthadoc defaults to Gemini Flash as the LLM provider — it’s free, requires no credit card, and offers 1 million tokens per day. Get a key at aistudio.google.com/app/apikey (click “Create API key”).

Web search uses Tavily (TAVILY_API_KEY) — optional, only needed for synthadoc ingest "search for: …" jobs.

# macOS / Linux — add to ~/.bashrc or ~/.zshrc to persist
export GEMINI_API_KEY=AIza…          # default — free tier, 1M tokens/day
export GROQ_API_KEY=gsk_…            # alternative free tier — 100K tokens/day
export ANTHROPIC_API_KEY=sk-ant-…    # paid — highest quality
export MINIMAX_API_KEY=…             # paid — text rates (image support)
export DEEPSEEK_API_KEY=…            # paid — text rates (no image support)
export TAVILY_API_KEY=tvly-…         # web search (optional)

# Windows cmd — current session only
set GEMINI_API_KEY=AIza…
set GROQ_API_KEY=gsk_…
set ANTHROPIC_API_KEY=sk-ant-…
set MINIMAX_API_KEY=…
set DEEPSEEK_API_KEY=…
set TAVILY_API_KEY=tvly-…

# Windows cmd — permanent (open a new cmd window after running)
setx GEMINI_API_KEY AIza…
setx GROQ_API_KEY gsk_…
setx ANTHROPIC_API_KEY sk-ant-…
setx MINIMAX_API_KEY …
setx DEEPSEEK_API_KEY sk-…
setx TAVILY_API_KEY tvly-…

To switch provider, edit [agents] in <wiki-root>/.synthadoc/config.toml and restart synthadoc serve. See Appendix — Switching LLM providers for step-by-step instructions.

Step 5 — Verify

synthadoc --version

Step 6 — Install a demo wiki, then start the engine

A wiki is a self-contained, structured knowledge base — a folder of Markdown pages linked by topic, maintained and cross-referenced automatically by Synthadoc. Think of it as a living document that grows smarter with every source you feed it: each ingest pass adds new pages, updates existing ones, and flags contradictions. For your own work, you can build and grow a domain-specific wiki — whether that’s market research, a technical knowledge base, or a team handbook — and query it in plain English or other languages at any time.

A wiki must be installed before the engine can serve it. The fastest way to get started is the History of Computing demo, which ships with 13 pre-built pages and sample source files — no LLM API key required to browse it.

Install the demo wiki:

# Linux / macOS
synthadoc install history-of-computing --target ~/wikis --demo

# Windows (cmd.exe)
synthadoc install history-of-computing --target %USERPROFILE%\wikis --demo

Then start the engine:

# Foreground — keeps the terminal; logs stream to the console
synthadoc serve -w history-of-computing

# Background — releases the terminal; logs go to the wiki log file
synthadoc serve -w history-of-computing --background

The server binds to http://127.0.0.1:7070 by default (port is set in <wiki-root>/.synthadoc/config.toml). The server is localhost-only — it never binds to an external network interface. Leave it running while you work — the Obsidian plugin, CLI ingest commands, and query commands all talk to it.

To stop a background server:

# Linux / macOS
kill <PID>

# Windows (cmd)
taskkill /PID <PID> /F

The PID is printed when the background server starts and saved to <wiki-root>/.synthadoc/server.pid.

Quick-Start Guide

The History of Computing demo includes 13 pre-built pages, raw source files covering clean-merge, contradiction, and orphan scenarios, and a full walkthrough of key Synthadoc feature.

Full step-by-step walkthrough: docs/user-quick-start-guide.md

The guide covers:

Verify the demo server started (banner, health check)
Install Dataview in Obsidian
Install the Synthadoc plugin and open the vault
Review wiki structure and key files (index, purpose, AGENTS.md, dashboard)
Query the pre-built wiki — including knowledge gap detection
Batch ingest all demo source files
Resolve a contradiction
Fix an orphan page
Run the adversarial lint pass — flag overstated claims across all pages
Web search ingestion with automatic decomposition
Ingest a YouTube video
Enrich the wiki with scaffold (regenerate/update index, purpose, AGENTS.md)
Audit features (token cost, history, events)
Schedule recurring operations
Set up query-scoped routing with ROUTING.md
Stage and review candidate pages before promoting them
Build a context pack for grounded LLM prompts
Verify claim provenance — source-line citations, broken citation audit, global provenance table

Creating Your Own Wiki

Unlike the demo (which ships with pre-built pages), your own wiki starts from a domain description and grows as you feed it sources. Three commands are all you need to get started:

synthadoc install market-condition-canada --target ~/wikis --domain "Market conditions and trends in Canada"
synthadoc use market-condition-canada   # set as the default wiki — no -w needed from here on
synthadoc serve

--domain is a free-text description of the subject area — the LLM uses it to generate four domain-aware starter files via scaffold:

File	Purpose
`wiki/index.md`	Table of contents — domain-relevant categories with`[[wikilinks]]`
`wiki/purpose.md`	Scope declaration — tells the ingest agent what belongs and what to ignore
`AGENTS.md`	LLM behaviour guidelines — tone, terminology, and synthesis style
`wiki/dashboard.md`	Live Dataview dashboard — orphan pages, contradictions, page count

Then open the wiki folder in Obsidian as a new vault, install the Dataview community plugin, and copy the Synthadoc plugin files with one command (requires the wiki to be registered via synthadoc install first):

synthadoc plugin install market-condition-canada

This automatically writes the correct server URL into the plugin’s data.json — no manual configuration in Obsidian settings is needed.

The Quick-Start Guide covers the full Obsidian setup in detail — see docs/user-quick-start-guide.md.

Recommended growth loop:

1. Seed with web searches — pull in real content for the topics you care about:

synthadoc ingest "search for: Economy, employment and labour market analysis in Toronto GTA"
synthadoc ingest "search for: Bank of Canada interest rate outlook 2025"
synthadoc jobs list   # watch progress

Each search fans out into up to 20 parallel URL ingest jobs. Query decomposition and web search decomposition (see below) make broad topics yield much richer results than a single search.

2. Review candidates (optional quality gate) — enable staging before large ingest batches so pages below your confidence threshold wait for review rather than entering BM25 immediately:

synthadoc staging policy threshold   # pages below high confidence → wiki/candidates/
synthadoc candidates list            # see what's waiting
synthadoc candidates promote early-internet-history   # approve individually
synthadoc candidates promote --all   # or approve everything at once
synthadoc candidates discard punch-card-era           # discard pages that don't belong

Skip this step if you trust all your sources — staging policy off is the default.

3. Lint and query — check for contradictions, flag overstated claims, verify citations, and confirm the wiki answers your questions:

synthadoc lint run                          # full lint: structural checks + adversarial pass (default)
synthadoc lint run --no-adversarial         # structural only — skip the adversarial LLM review
synthadoc lint report                       # view all issues including citation violations (Check 5)
synthadoc audit citations --broken          # list claim citations that failed validation
synthadoc query "What are the current employment trends in the Toronto GTA?"

4. Re-run scaffold — after pages accumulate, scaffold regenerates a richer index that reflects actual content. Pages already linked in index.md are never overwritten:

synthadoc scaffold

5. Set up routing — once the wiki has ~100+ pages across distinct topic areas, routing narrows each query to the relevant branch, cutting latency and reducing noise in synthesis:

synthadoc routing init   # generate ROUTING.md from current index.md (one-time)

From this point, queries automatically scope to the 1–2 most relevant topic branches. New pages created by ingest are auto-slotted into ROUTING.md — no manual maintenance needed. See Appendix H in the Quick-Start Guide for latency benchmarks across corpus sizes.

6. Build a context pack — assemble cited wiki excerpts within a token budget for use in an external agent prompt:

synthadoc context build "Toronto GTA real estate market" --tokens 4000

Returns ranked page excerpts with relevance scores, confidence levels, and source paths — no synthesis. The POST /context/build REST endpoint and MCP tool call make this callable from any agent pipeline. See docs/design.md — Context packs for the knowledge backend pattern.

7. Schedule recurring updates — keep the wiki fresh and the routing table clean automatically:

synthadoc schedule add --op "ingest --batch raw_sources/" --cron "0 2 * * *"
synthadoc schedule add --op "lint run"      --cron "0 3 * * 0"
synthadoc schedule add --op "scaffold"      --cron "0 4 * * 0"
synthadoc schedule add --op "routing clean" --cron "0 5 * * 0"

Run order matters: lint first (removes dead wikilinks), scaffold next (regenerates index), routing clean last (prunes ROUTING.md entries for deleted pages).

How decomposition works

Both query and web search ingest automatically split complex inputs into focused parallel sub-tasks — a compound question becomes multiple BM25 retrievals merged before synthesis; a broad search topic becomes multiple focused Tavily keyword searches whose results are merged and deduplicated. Both fall back gracefully if the LLM decomposition call fails.

See docs/design.md — Query decomposition and web search decomposition for the full design.

Semantic re-ranking (vector search)

BM25 keyword search is the default. Optional vector re-ranking (BAAI/bge-small-en-v1.5 cosine similarity) improves recall on conceptually related queries — enable it by installing fastembed and setting [search] vector = true in config. The ~130 MB model is downloaded once; BM25 stays active as fallback.

See docs/design.md — Semantic re-ranking for configuration options and performance notes.

Knowledge gap workflow

When a query returns thin or empty results, the wiki doesn’t yet cover the topic. Fill the gap with a targeted web search ingest, wait for jobs, then re-query. Each ingest cycle makes the wiki denser — future queries need the web less.

See docs/design.md — Knowledge gap workflow for the full pattern.

See docs/design.md for a full description of how ingest, contradiction detection, and orphan tracking work under the hood.

Configuration

You do not need to configure anything to run the demo. The demo wiki ships with its own settings and sensible built-in defaults cover everything else. Set your API key env var, run synthadoc serve, and go.

For the full configuration reference — layer precedence, global vs. per-project config, all keys and defaults — see Appendix E — Configuration in the Quick-Start Guide, or docs/design.md — Configuration for the complete technical reference.

Command Reference by Use Case

Setting up a wiki

# Create a new empty wiki (LLM scaffold runs automatically if API key is set)
synthadoc install my-wiki --target ~/wikis --domain "Machine Learning"

# Port is auto-assigned (checks all existing wikis to avoid conflicts, even when stopped)
synthadoc install my-wiki --target ~/wikis --domain "Machine Learning"

# Or pin a specific port manually
synthadoc install my-wiki --target ~/wikis --domain "Machine Learning" --port 7071

# Install the demo (includes pre-built pages and raw sources — no LLM call needed)
synthadoc install history-of-computing --target ~/wikis --demo

# List available demo templates
synthadoc demo list

# Install the Obsidian plugin directly into the active Obsidian vault
synthadoc plugin install history-of-computing

Switching the active wiki

# Set a wiki as the default so -w is not required for any subsequent command
synthadoc use my-wiki

# Check which wiki is currently active
synthadoc use

# Clear the saved default (revert to requiring -w on every command)
synthadoc use --clear

Refreshing wiki scaffold

After install, you can re-run the LLM scaffold at any time to regenerate domain-specific content (index categories, AGENTS.md guidelines, purpose.md scope). Pages already linked in index.md are protected and preserved.

# Regenerate scaffold for an existing wiki
synthadoc scaffold -w my-wiki

# Schedule weekly refresh (runs every Sunday at 4 AM)
synthadoc schedule add --op "scaffold" --cron "0 4 * * 0" -w my-wiki

config.toml and dashboard.md are never modified by scaffold.

Running the server

# Start HTTP API + job worker (foreground — terminal stays attached)
synthadoc serve -w my-wiki

# Detach to background — banner shown, then shell is released
# All logs go to <wiki>/.synthadoc/logs/synthadoc.log
synthadoc serve -w my-wiki --background

# Custom port
synthadoc serve -w my-wiki --port 7071

# Verbose debug logging to console
synthadoc serve -w my-wiki --verbose

Ingesting sources

# Single file or URL
synthadoc ingest report.pdf -w my-wiki
synthadoc ingest https://example.com/article -w my-wiki

# Entire folder (parallel, up to max_parallel_ingest at a time)
synthadoc ingest --batch raw_sources/ -w my-wiki

# Manifest file — ingest a curated list of sources in one shot.
# sources.txt: one entry per line; each line is either an absolute file path
# (PDF, DOCX, PPTX, MD, …) or a URL. Blank lines and # comments are ignored.
# Each entry becomes a separate job in the queue, processed sequentially.
#
# Example sources.txt:
#   /home/user/docs/research-paper.pdf
#   /home/user/slides/keynote.pptx
#   https://en.wikipedia.org/wiki/Alan_Turing
#   # this line is ignored
synthadoc ingest --file sources.txt -w my-wiki

# Force re-ingest (bypass deduplication and cache)
synthadoc ingest --force report.pdf -w my-wiki

# Web search — triggers a Tavily search, then ingests each result URL as a child job.
# Prefix the query with any recognised intent: "search for:", "find on the web:",
# "look up:", or "web search:"  (prefix is stripped before the search is sent)
# Requires TAVILY_API_KEY to be set.
#
# Note: web search content is NOT saved to raw_sources/. The flow is direct:
#   query → Tavily → URLs → each URL fetched → wiki pages written
# raw_sources/ is for user-provided local files (PDF, DOCX, PPTX, etc.) only.
# The wiki pages themselves are the persistent output of a web search.
synthadoc ingest "search for: Bank of Canada interest rate decisions 2024" -w my-wiki
synthadoc ingest "find on the web: unemployment trends Ontario Q1 2025" -w my-wiki

# Limit how many URLs are enqueued (default 20, overrides [web_search] max_results)
synthadoc ingest "search for: quantum computing basics" --max-results 5 -w my-wiki

# Multiple web searches at once via a manifest file
# web-searches.txt:
#   search for: Bank of Canada interest rate decisions 2024
#   find on the web: unemployment trends Ontario Q1 2025
#   look up: Toronto housing market affordability index
synthadoc ingest --file web-searches.txt -w my-wiki

# YouTube video — transcript extracted automatically, no API key needed.
# The video must have captions (auto-generated or manual).
# Check: open the video on YouTube → ... → Show transcript.
synthadoc ingest "https://www.youtube.com/watch?v=O5nskjZ_GoI" -w my-wiki
synthadoc ingest "https://youtu.be/O5nskjZ_GoI" -w my-wiki

# YouTube URLs returned by web search are also routed automatically:
# if Tavily returns a YouTube URL, the transcript is ingested instead of the page HTML.
synthadoc ingest "search for: history of computing lecture" -w my-wiki

Each YouTube wiki page opens with an executive summary — what the video is about, the main topics covered, and the key takeaway — followed by the full timestamped transcript for precise citation.

Querying

# Ask a question — answer cites wiki pages
synthadoc query "What is Moore's Law?" -w my-wiki

# Save the answer as a new wiki page
synthadoc query "What is Moore's Law?" --save -w my-wiki

Linting

# Run a full lint pass (enqueues job)
synthadoc lint run -w my-wiki

# Only contradictions
synthadoc lint run --scope contradictions -w my-wiki

# Auto-apply high-confidence resolutions
synthadoc lint run --auto-resolve -w my-wiki

# Skip adversarial review (structural checks only; also clears existing warnings)
synthadoc lint run --no-adversarial -w my-wiki

# Instant report (reads wiki files directly, no server needed)
synthadoc lint report -w my-wiki

Monitoring jobs

# List all jobs (oldest first by default)
synthadoc jobs list -w my-wiki

# Sort column — created_at (default) | status | operation
synthadoc jobs list --sort created_at -w my-wiki   # oldest first (default)
synthadoc jobs list --sort status -w my-wiki        # alphabetical by status
synthadoc jobs list --sort operation -w my-wiki     # alphabetical by operation type

# Sort direction — asc (default) | desc
synthadoc jobs list --order desc -w my-wiki                     # newest first
synthadoc jobs list --sort status --order desc -w my-wiki       # status Z→A

# Filter by status — pending | in_progress | completed | failed | skipped | dead | cancelled
synthadoc jobs list --status pending -w my-wiki
synthadoc jobs list --status failed -w my-wiki
synthadoc jobs list --status dead -w my-wiki

# Combine sort, order, and status freely
synthadoc jobs list --status failed --sort created_at --order desc -w my-wiki

# Single job detail
synthadoc jobs status <job-id> -w my-wiki

# Retry a dead job
synthadoc jobs retry <job-id> -w my-wiki

# Cancel all pending jobs at once (e.g. after a bad batch ingest)
synthadoc jobs cancel -w my-wiki        # prompts for confirmation
synthadoc jobs cancel --yes -w my-wiki  # skip confirmation

# Remove old records
synthadoc jobs purge --older-than 30 -w my-wiki

Inspecting ingest results

# Preview how a source will be analysed without writing pages
synthadoc ingest report.pdf --analyse-only -w my-wiki
# → {"entities": [...], "tags": [...], "summary": "..."}

Audit trail

# Ingest history: timestamp, source file, wiki page, tokens, cost
synthadoc audit history -w my-wiki            # last 50 records
synthadoc audit history -n 100 -w my-wiki     # last 100 records
synthadoc audit history --json -w my-wiki     # raw JSON for scripting

# Token usage: totals + daily breakdown (cost always $0.0000 in v0.1)
synthadoc audit cost -w my-wiki               # last 30 days
synthadoc audit cost --days 7 -w my-wiki      # last 7 days

# Audit events: contradictions found, auto-resolutions, cost gate triggers
synthadoc audit events -w my-wiki             # last 100 events
synthadoc audit events --json -w my-wiki      # raw JSON for scripting

# Claim citations: source-line provenance for every annotated claim
synthadoc audit citations -w my-wiki                    # all citations (last 50)
synthadoc audit citations --page alan-turing -w my-wiki # citations for one page
synthadoc audit citations --source turing.pdf -w my-wiki # citations from one source
synthadoc audit citations --broken -w my-wiki           # validation failures only
synthadoc audit citations --json -w my-wiki             # raw JSON for scripting

Scheduling recurring jobs

# Register a nightly ingest
synthadoc schedule add --op "ingest --batch raw_sources/" --cron "0 2 * * *" -w my-wiki

# Weekly lint
synthadoc schedule add --op "lint" --cron "0 3 * * 0" -w my-wiki

# List scheduled jobs
synthadoc schedule list -w my-wiki

# Remove a scheduled job
synthadoc schedule remove <id> -w my-wiki

Routing

ROUTING.md maps wiki branches to page slugs so queries and ingest jobs are scoped to the relevant section of the wiki. Create it once from your existing index.md, then let Synthadoc maintain it automatically as new pages are added.

# Bootstrap ROUTING.md from current index.md branch structure (run once)
synthadoc routing init -w my-wiki

# Report dangling slugs (pages listed in ROUTING.md that no longer exist)
synthadoc routing validate -w my-wiki

# Auto-remove dangling slugs from ROUTING.md
synthadoc routing clean -w my-wiki

Candidates staging

When staging is enabled, ingest writes new pages to wiki/candidates/ for human review instead of the main wiki. Useful when you want to approve AI-generated pages before they become canonical.

# Show current staging policy
synthadoc staging policy -w my-wiki

# Route all new pages to staging (review everything)
synthadoc staging policy all -w my-wiki

# Only stage pages below a confidence threshold (auto-promote high-confidence)
synthadoc staging policy threshold --min-confidence high -w my-wiki

# Turn staging off (pages go directly to wiki/)
synthadoc staging policy off -w my-wiki

# List candidate pages awaiting review
synthadoc candidates list -w my-wiki

# Promote a specific page (moves it from candidates/ to wiki/)
synthadoc candidates promote my-page-slug -w my-wiki

# Promote all candidates at once
synthadoc candidates promote --all -w my-wiki

# Discard a specific candidate
synthadoc candidates discard my-page-slug -w my-wiki

# Discard all candidates
synthadoc candidates discard --all -w my-wiki

Context packs

A context pack decomposes a goal into sub-questions, runs parallel BM25 searches, and packs the highest-scoring excerpts into a single cited Markdown document within a token budget.

Typical use cases:

Paste into an external LLM chat (Claude.ai, ChatGPT) as grounded context before asking a question
Save next to a document you are writing as a cited research brief
Pipe into another CLI tool that reads Markdown

# Print to terminal — inspect, copy, or pipe
synthadoc context build "How did transistors change computing?" -w my-wiki

# Copy to clipboard and paste into an LLM chat (macOS)
synthadoc context build "early computing pioneers" -w my-wiki | pbcopy

# Custom token budget (default 4000)
synthadoc context build "Early programming languages" --tokens 8000 -w my-wiki

# Save next to a document you are writing
synthadoc context build "Rise of microprocessors" --output ~/drafts/computing-brief.md -w my-wiki

Removing a wiki

Stop the server for that wiki before uninstalling — the serve process must not be running when the directory is deleted.

# Stop the background server (PID is in <wiki-root>/.synthadoc/server.pid)
kill $(cat ~/wikis/my-wiki/.synthadoc/server.pid)          # Linux / macOS
taskkill /PID <pid> /F                                      # Windows

# Then uninstall — two-step confirmation required, no --yes escape
synthadoc uninstall my-wiki

For Obsidian plugin commands see Appendix A — Obsidian Plugin Command Reference in the Quick-Start Guide.

Administrative Reference

Health and status

# Wiki statistics: pages, queue depth, cache hit rate
synthadoc status -w my-wiki

# Liveness probe (useful in scripts and monitoring)
# Port is per-wiki — check [server] port in <wiki-root>/.synthadoc/config.toml
# Default is 7070; each additional wiki uses its own port (7071, 7072, …)
curl http://127.0.0.1:7070/health

Expected status output:

Wiki:         /home/user/wikis/my-wiki
Pages:        34
Jobs pending: 0
Jobs total:   12

Logs

Synthadoc writes three log artefacts per wiki:

File	Location	Format	Use
`log.md`	`<wiki-root>/log.md`	Human-readable Markdown	Read inside Obsidian; shows every ingest, contradiction, lint event
`synthadoc.log`	`<wiki-root>/.synthadoc/logs/`	JSON lines (rotating)	Structured debug/ops log; grep or pipe to jq
`audit.db`	`<wiki-root>/.synthadoc/audit.db`	SQLite (append-only)	Source hashes, cost records, job history

Tailing the JSON log:

# Tail and pretty-print with jq
tail -f .synthadoc/logs/synthadoc.log | jq .

# Filter to errors only
tail -f .synthadoc/logs/synthadoc.log | jq 'select(.level == "ERROR")'

# Filter to a specific job
# job_id is present only on records logged in job context (ingest/lint workers)
tail -f .synthadoc/logs/synthadoc.log | jq 'select(.job_id == "abc123")'

Log rotation: When synthadoc.log reaches max_file_mb, it is renamed to synthadoc.log.1; the previous .1 becomes .2; files beyond backup_count are deleted. Total disk ≈ max_file_mb × (backup_count + 1).

Changing log level at runtime: Edit [logs] level in .synthadoc/config.toml and restart synthadoc serve. Or pass --verbose to get DEBUG for one session without editing config.

Audit trail

synthadoc audit history -w my-wiki          # table: timestamp, source file, wiki page, tokens, cost
synthadoc audit history -n 100 -w my-wiki   # last 100 records (default 50)
synthadoc audit history --json -w my-wiki   # raw JSON for scripting

synthadoc audit cost -w my-wiki             # total tokens + daily breakdown, last 30 days
synthadoc audit cost --days 7 -w my-wiki    # weekly view
synthadoc audit cost --json -w my-wiki      # {total_tokens, total_cost_usd, daily: [...]}

synthadoc audit events -w my-wiki           # table: timestamp, job_id, event type, metadata
synthadoc audit events --json -w my-wiki    # raw JSON

synthadoc audit citations -w my-wiki                     # all claim citations (last 50)
synthadoc audit citations --page alan-turing -w my-wiki  # citations for one page
synthadoc audit citations --source turing.pdf -w my-wiki # citations from one source file
synthadoc audit citations --broken -w my-wiki            # validation failures only
synthadoc audit citations --json -w my-wiki              # raw JSON for scripting

Note: Per-model cost tracking is live from v0.2.0 — pricing tables cover all 7 API providers. Token counts and USD cost are recorded for every ingest and query operation in audit.db.

Cache management

# Remove all cached LLM responses
# Output: "Cache cleared: N entries removed."
synthadoc cache clear -w my-wiki

Cache invalidation happens automatically when:

A source file’s SHA-256 hash changes (content changed)
CACHE_VERSION is bumped in core/cache.py (after prompt template edits)
--force is passed to ingest

OpenTelemetry integration

By default, traces and metrics are written to <wiki-root>/.synthadoc/logs/traces.jsonl. To send to any OTLP backend (Jaeger, Grafana Tempo, Honeycomb, Datadog):

# ~/.synthadoc/config.toml
[observability]
exporter      = "otlp"
otlp_endpoint = "http://localhost:4317"

Debugging

# Start server with DEBUG console logging
synthadoc serve -w my-wiki --verbose

# Check for configuration problems
synthadoc status -w my-wiki     # prints pre-flight warnings

# View recent job failures
synthadoc jobs list --status failed -w my-wiki
synthadoc jobs status <job-id> -w my-wiki    # shows error message + traceback

# Force a re-ingest to rule out cache issues
synthadoc ingest --force problem.pdf -w my-wiki

Understanding Logs and the Audit Trail

Synthadoc writes three log artefacts per wiki: log.md (human-readable Markdown, open in Obsidian), synthadoc.log (JSON lines, rotate-by-size, grep with jq), and audit.db (append-only SQLite — source hashes, cost records, job history).

For the full field reference, log levels, rotation config, OTel integration, and audit query examples see docs/design.md — Logs and Audit Trail.

Customization

Custom skills (new file formats)

Subclass BaseSkill (Apache-2.0 — no AGPL obligation on your skill code), drop the file in <wiki-root>/skills/ or ~/.synthadoc/skills/, and Synthadoc hot-loads it on the next ingest. Skills can match by file extension or intent prefix (supports any Unicode text, including Chinese/Japanese/Arabic prefixes).

Custom LLM providers

Subclass LLMProvider from synthadoc/providers/base.py (Apache-2.0) and place it in ~/.synthadoc/providers/ or the wiki providers/ directory.

Hooks

Shell commands (any language) that fire on on_ingest_complete and on_lint_complete. Receive a JSON context on stdin. Set blocking = true to gate the operation on the hook’s exit code.

Cache

Three cache layers (embedding, LLM response, provider prompt cache). Cache invalidates automatically on source file change (SHA-256). Force a fresh call with --force or wipe all responses with synthadoc cache clear -w my-wiki.

Per-wiki AGENTS.md

Edit <wiki-root>/AGENTS.md to give the LLM domain-specific instructions — terminology, page naming conventions, what to cross-reference. Highest-priority instruction source for every agent run against this wiki.

For full examples, API signatures, and intent-dispatch config see docs/design.md — Customization.

相似文章

@Vinkyu567: https://x.com/Vinkyu567/status/2073399058535002336

X AI KOLs Timeline

本教程展示了如何用 Obsidian、Markdown、GitHub 和 AI 辅助（Codex）在 5 分钟内搭建一个本地与云端同步的 AI 知识库，并推荐了相关插件。

@Ryrenz: 论文、合同、PDF——这几个开源工具把所有文档工作打通了： 1、opendatalab/MinerU（68.9k）——上海 AI Lab 出品，PDF/文档一键转 markdown，学术论文排版还原度极高 https://github.c…

X AI KOLs Timeline

这篇推文汇总了6个开源工具，涵盖PDF转markdown、文档理解、OCR、论文翻译和自动文献综述，旨在打通文档工作流。

@huoshan007: 兄弟们这玩意离谱的点是：可以自己搭一个自己的 AI 第二大脑，而且不用买什么 1000 美元课，也不用报编程训练营。就三步半。把你的 Obsidian、Markdown、PDF、旧文档，全扔进一个文件夹。装 Khoj，开源项目，3…

X AI KOLs Timeline

Khoj 是一个开源的个人 AI 第二大脑工具，支持连接本地或云端大模型，索引 Obsidian、Markdown、PDF 等个人文档，通过语义搜索帮助用户发现知识关联，无需编程背景即可搭建。

@DashHuang: https://x.com/DashHuang/status/2057323152758480955

X AI KOLs Timeline

这篇文章探讨了为什么在AI agent时代，GitHub比传统文档系统更适合作知识协作的基础，因为它具有开放协作、AI模型熟悉、本地完整上下文和结构化原始数据等优势。

@GitHub_Daily: 平时收藏了一堆文章和论文，全堆在笔记软件里吃灰，从来没整理过。 second-brain 的思路是让 AI 来当图书管理员，我们只管把素材扔进 raw 文件夹。 AI 读完会自动把内容写成一篇篇结构化的 Wiki，页面之间带双链，索引也替…

X AI KOLs Timeline

Second Brain 是一个基于LLM的个人知识库工具，自动将原始素材整理成结构化的Wiki，支持Obsidian浏览和Agent集成。

axoviq-ai/synthadoc

Synthadoc

Table of Contents

Who Is It For?

Inspiration and Vision

Problems Addressed

1. RAG conflates contradictions; Synthadoc surfaces them

2. Knowledge fragments; Synthadoc links it

3. Orphan knowledge has no address; Synthadoc finds it

4. LLM-compiled content can be overconfident; Synthadoc audits it

Claim-Level Provenance

5. Re-synthesis is expensive; Synthadoc caches it

6. Knowledge is locked in tools; Synthadoc escapes it

7. Wiki structure decays as content grows; Synthadoc regenerates it

Business values

Why Synthadoc?

Competitive advantages

Key differentiators vs. RAG

Architecture

What’s Included

Installation

Prerequisites

Step 1 — Clone and install

Step 2 — Run the Python test suite

Step 3 — Test the Obsidian plugin

Step 4 — Set your API keys

Step 5 — Verify

Step 6 — Install a demo wiki, then start the engine

Quick-Start Guide

Creating Your Own Wiki

How decomposition works

Semantic re-ranking (vector search)

Knowledge gap workflow

Configuration

Command Reference by Use Case

Setting up a wiki

Switching the active wiki

Refreshing wiki scaffold

Running the server

Ingesting sources

Querying

Linting

Monitoring jobs

Inspecting ingest results

Audit trail

Scheduling recurring jobs

Routing

Candidates staging

Context packs

Removing a wiki

Administrative Reference

Health and status

Logs

Audit trail

Cache management

OpenTelemetry integration

Debugging

Understanding Logs and the Audit Trail

Customization

Custom skills (new file formats)

Custom LLM providers

Hooks

Cache

Per-wiki AGENTS.md

Links

相似文章

@Vinkyu567: https://x.com/Vinkyu567/status/2073399058535002336

@Ryrenz: 论文、合同、PDF——这几个开源工具把所有文档工作打通了： 1、opendatalab/MinerU（68.9k）——上海 AI Lab 出品，PDF/文档一键转 markdown，学术论文排版还原度极高 https://github.c…

@huoshan007: 兄弟们这玩意离谱的点是： 可以自己搭一个自己的 AI 第二大脑，而且不用买什么 1000 美元课，也不用报编程训练营。 就三步半。 把你的 Obsidian、Markdown、PDF、旧文档，全扔进一个文件夹。 装 Khoj，开源项目，3…

@DashHuang: https://x.com/DashHuang/status/2057323152758480955

提交意见反馈

@huoshan007: 兄弟们这玩意离谱的点是：可以自己搭一个自己的 AI 第二大脑，而且不用买什么 1000 美元课，也不用报编程训练营。就三步半。把你的 Obsidian、Markdown、PDF、旧文档，全扔进一个文件夹。装 Khoj，开源项目，3…