@GitHub_Daily: 想把网页内容喂给 AI,结果抓回来一堆导航栏、广告和乱码,上下文窗口浪费大半,AI 还读不明白。 于是找到 PullMD 这个开源项目,可以把任意网页内容提取转成干净的 Markdown 文件。 只需提供网页链接,自动识别页面类型,层层提…

X AI KOLs Timeline 工具

摘要

PullMD 是一个开源的 URL 转 Markdown 服务,可以自动提取网页正文内容,去除导航、广告等杂物,支持无头浏览器和多种接口(网页、REST API、MCP),便于 AI 工具和用户获取干净的网页文本。

想把网页内容喂给 AI,结果抓回来一堆导航栏、广告和乱码,上下文窗口浪费大半,AI 还读不明白。 于是找到 PullMD 这个开源项目,可以把任意网页内容提取转成干净的 Markdown 文件。 只需提供网页链接,自动识别页面类型,层层提取正文内容,遇到 JavaScript 渲染的页面还会自动启用无头浏览器,输出干净整洁。 GitHub:http://github.com/AeternaLabsHQ/pullmd… 提供网页界面、REST 接口和 MCP 服务三种使用方式,Claude Code、Codex 这些工具可以直接接入。 如果你经常需要让 AI 读网页内容,或者想给自己搭一个干净的稍后阅读服务,值得试试。
查看原文
查看缓存全文

缓存时间: 2026/05/25 10:49

想把网页内容喂给 AI,结果抓回来一堆导航栏、广告和乱码,上下文窗口浪费大半,AI 还读不明白。

于是找到 PullMD 这个开源项目,可以把任意网页内容提取转成干净的 Markdown 文件。

只需提供网页链接,自动识别页面类型,层层提取正文内容,遇到 JavaScript 渲染的页面还会自动启用无头浏览器,输出干净整洁。

GitHub:http://github.com/AeternaLabsHQ/pullmd…

提供网页界面、REST 接口和 MCP 服务三种使用方式,Claude Code、Codex 这些工具可以直接接入。

如果你经常需要让 AI 读网页内容,或者想给自己搭一个干净的稍后阅读服务,值得试试。


AeternaLabsHQ/pullmd

Source: https://github.com/AeternaLabsHQ/pullmd

PullMD

Release Docker Pulls CI License MCP

Self-hosted URL-to-Markdown service for humans and AI agents.

PullMD web interface

PullMD takes any web URL and returns clean, readable Markdown — no navigation, no ads, no boilerplate. It auto-detects Reddit threads (with full comment trees), uses Cloudflare’s native Markdown when available, runs Mozilla Readability + Trafilatura on static HTML, and as a last resort renders JavaScript-heavy pages via headless Chromium (Playwright sidecar) before extracting.

It ships as:

  • a PWA frontend with raw/rendered view toggle, dark/paper themes, history, archive, share links
  • a REST API at GET /api?url=…
  • an MCP server at POST /mcp (Streamable-HTTP transport, stateless)
  • a Claude Code skill as a downloadable zip

Every conversion gets an 8-hex share id that works as a stable live-endpoint: GET /s/:id returns the cached markdown and re-fetches from the source if older than one hour. Use the share id as a fixed URL that always returns fresh content — useful for subreddit feeds and similar.


Quick start

Pre-built multi-arch images (linux/amd64, linux/arm64) live on Docker Hub. Drop the compose file somewhere and run:

mkdir pullmd && cd pullmd
curl -O https://raw.githubusercontent.com/AeternaLabsHQ/pullmd/main/docker-compose.yml
docker compose up -d
# → http://localhost:3000

That’s it. No .env needed: every variable has a sensible default and PullMD listens on port 3000. Add a .env next to the compose file to override anything (see Configuration).

docker-compose.yml (zero-config)

services:
  pullmd:
    image: aeternalabshq/pullmd:latest
    container_name: pullmd
    restart: unless-stopped
    ports:
      - "${PORT:-3000}:3000"
    environment:
      - PUBLIC_URL=${PUBLIC_URL:-http://localhost:${PORT:-3000}}
      - TRAFILATURA_URL=http://trafilatura:8001/extract
      - PLAYWRIGHT_URL=http://playwright:8002/render
      - REDDIT_CLIENT_ID=${REDDIT_CLIENT_ID:-}
      - REDDIT_CLIENT_SECRET=${REDDIT_CLIENT_SECRET:-}
      - REDDIT_USER_AGENT=${REDDIT_USER_AGENT:-}
    volumes:
      - ./data:/data
    networks:
      - pullmd-internal
    depends_on:
      - trafilatura
      - playwright

  trafilatura:
    image: aeternalabshq/pullmd-trafilatura:latest
    container_name: pullmd-trafilatura
    restart: unless-stopped
    networks:
      - pullmd-internal

  playwright:
    image: aeternalabshq/pullmd-playwright:latest
    container_name: pullmd-playwright
    restart: unless-stopped
    networks:
      - pullmd-internal

networks:
  pullmd-internal:
    driver: bridge

Note: the Playwright sidecar adds ~3.7 GB to your image cache (Chromium + Firefox + WebKit binaries from the official Playwright base image). It’s optional — leave PLAYWRIGHT_URL unset and the playwright service block off, and PullMD silently degrades to static extraction with a fallback note in the metadata.

Mirror on GHCR: ghcr.io/aeternalabshq/{pullmd,pullmd-trafilatura,pullmd-playwright}. Replace the image: lines if you prefer GitHub’s registry.

Behind Traefik

For deployments behind Traefik with TLS, use docker-compose.traefik.yml instead. Same images, but with Traefik labels and the proxy external network. Set HOST_DOMAIN in .env:

curl -O https://raw.githubusercontent.com/AeternaLabsHQ/pullmd/main/docker-compose.traefik.yml
echo "HOST_DOMAIN=pullmd.example.com" > .env
docker compose -f docker-compose.traefik.yml up -d

Local development (no Docker)

git clone https://github.com/AeternaLabsHQ/pullmd.git
cd pullmd
npm install
npm start             # http://localhost:3000
npm test              # node --test

Configuration

All variables go in .env (copy from .env.example):

VariableRequiredPurpose
HOST_DOMAINyesPublic hostname without scheme. Used by Traefik routing and as fallback for PUBLIC_URL.
PUBLIC_URLnoFull public origin embedded in /help and the skill zip. Defaults to https://${HOST_DOMAIN}.
TRAFILATURA_URLnoURL of the Trafilatura sidecar’s /extract endpoint. Unset → skip Trafilatura, Readability only.
PLAYWRIGHT_URLnoURL of the Playwright sidecar’s /render endpoint. Unset → skip Playwright fallback for JS pages.
REDDIT_CLIENT_IDnoOAuth credentials for Reddit. Without them, PullMD uses the public JSON API (lower rate limit).
REDDIT_CLIENT_SECRETno
REDDIT_USER_AGENTnoReddit requires a unique UA. Default: PullMD/1.0 (URL-to-Markdown service).
DISABLE_PUBLIC_HISTORYnoWhen true, hides the global recent-conversions list and archive (/api/history + /api/archive return 403, frontend hides the section). /s/:id share links keep working. Default: false.
PULLMD_USER_AGENTnoPin a single outbound User-Agent for every web fetch. Disables rotation. Useful for CI or when one specific UA is known to work.
PULLMD_UA_FEED_URLnoURL of a JSON feed of current real-world UAs. Default: WinFuture23/real-world-user-agents. Set to an empty string to disable live refresh and rely on the built-in seed pool.
PULLMD_AUTH_MODEnodisabled (default) / single-admin / multi-user. See “Authentication” below.
PULLMD_ADMIN_EMAILrequired when AUTH_MODE != disabled, on first startupBootstrap email for the first admin user.
PULLMD_ADMIN_PASSWORDrequired when AUTH_MODE != disabled, on first startupBootstrap password (min 8 chars).
PULLMD_AUTH_TOKENnoLegacy bearer token compat (single-admin mode only, deprecated).

PUBLIC_URL matters for self-hosting: the help page and downloadable skill embed it as the canonical endpoint. Set it correctly and your users get a copy-paste setup that points at your instance.

PullMD rotates its outbound User-Agent for the web fetch path from a pool of current desktop browsers, refreshed every 48 hours from a live feed of real-world UAs maintained by @WinFuture23. A built-in seed pool ensures rotation works even when the feed is unreachable. Set PULLMD_USER_AGENT to pin a single UA, or PULLMD_UA_FEED_URL to point at your own feed. The Reddit path keeps its dedicated REDDIT_USER_AGENT because Reddit’s API expects a stable, identifying UA.

DISABLE_PUBLIC_HISTORY=true is the privacy switch for shared instances (multi-tenant VPS, office deployments). Conversions still get cached and assigned share IDs; users just can’t see what other users have fetched. Anyone with a known /s/:id link still gets their markdown back. Use this as a stopgap until per-user scoping lands.


Authentication (v2.0+)

Pulling v2.x: Use the explicit :2 tag (or :2.0, :2.0.0). The :latest tag remains on v1.x for backward compatibility until v2.x has stabilized in real-world deployments.

services:
  pullmd:
    image: aeternalabs/pullmd:2

PullMD ships with three auth modes. Pick one with PULLMD_AUTH_MODE:

ModeBehavior
disabledDefault. No auth, everything open. Existing v1.x behavior.
single-adminOne user, credentials from env vars. No self-signup. For homelab.
multi-userSelf-signup at /signup, login at /login, per-user data isolation.

In single-admin and multi-user modes, PULLMD_ADMIN_EMAIL + PULLMD_ADMIN_PASSWORD bootstrap the first admin user on first startup. After that, changing these env vars does not change the password — use the admin CLI:

docker compose exec pullmd node scripts/admin.js reset-password [email protected]

Auth boundary

EndpointAuth required (when mode != disabled)
/, /help, static assets, /web-reader.zipno
/login, /signup, /api/me (auth surface)no
/s/:id (share links)no
/api, /api/streamyes
/mcpyes
/api/history, /api/archiveyes
/api/cache/:id, DELETE /api/cacheyes
/api/stats, /api/storage, /api/config (aggregate)no

Authentication paths

  1. Session cookiesPOST /login sets pullmd_session (HttpOnly, SameSite=Lax, Secure over HTTPS, 7-day TTL with sliding expiry). The PWA uses this automatically.
  2. API keys — generate at /settings, send via Authorization: Bearer pmd_<32-char-base62>. Stored as SHA-256 hashes; only shown once at creation.
  3. Legacy PULLMD_AUTH_TOKEN — deprecated. single-admin mode only. Maps to admin user. Kept for migration compatibility, removed in v3.0.

Migration from v1.x

See MIGRATION.md for the full upgrade checklist. The TL;DR: leave PULLMD_AUTH_MODE unset and v2.0 behaves exactly like v1.x.

OAuth 2.1 (claude.ai Web Connector)

PullMD ships with a full OAuth 2.1 Authorization Code flow so the claude.ai web app’s Custom Connector feature can authenticate users against your PullMD instance. All endpoints needed by the spec are implemented: Dynamic Client Registration (RFC 7591), PKCE-S256 (RFC 7636), Authorization Server Metadata (RFC 8414), Protected Resource Metadata (RFC 9728), and Token Revocation (RFC 7009).

Setup:

  1. Set PULLMD_AUTH_MODE to single-admin or multi-user (OAuth requires Phase-1 auth).
  2. Set OAUTH_JWT_SECRET to a 32+ character random string (openssl rand -hex 32).
  3. Set PUBLIC_URL to your instance’s public origin (e.g. https://pullmd.example.com).
  4. In claude.ai → Settings → Connectors → Add custom connector, point it at https://pullmd.example.com/mcp — claude.ai discovers everything else automatically via the well-known endpoints.
  5. The first time the user clicks the connector, they’ll be redirected to PullMD’s /login, then to a consent screen, then back to claude.ai.

Tokens:

  • Access tokens are JWTs (HS256), TTL 1 hour, audience-bound to your /mcp URL.
  • Refresh tokens are opaque (pmd_rt_…), TTL 30 days, rotated on every refresh, with reuse-detection that invalidates the entire refresh chain on replay.
  • Revoke a token via POST /oauth/revoke (RFC 7009).

Scope: Currently a single mcp:full scope (URL conversion + history read). Granular scopes are tracked for a future minor release.

Issues #6 and #10 track this work and close on the v2.1.0 release.


AI-agent integration

Three install paths. Once your instance is running, ${PULLMD_URL}/help shows the same boxes with your URL pre-filled. Replace ${PULLMD_URL} below with your hostname (e.g. https://pullmd.example.com).

1. Universal prompt

Drop into any chat agent (ChatGPT, Claude, Gemini, …):

When you need to read a web page, fetch via PullMD instead of raw HTML:

  GET ${PULLMD_URL}/api?url=<URL>

Returns clean Markdown (text/markdown). Optional query params:

  comments=false        skip Reddit comments
  comment_depth=N       comment nesting depth (default 3)
  frontmatter=true      prepend YAML metadata block
  format=text           strip Markdown, return plain text
  nocache=true          bypass the 1h cache and refetch
  render=force|skip     override the auto Playwright fallback
  lang=de|en            language for the comments section header

Response headers worth checking:
  X-Source       reddit | cloudflare | readability | playwright
  X-Quality      0.0-1.0 extraction confidence
  X-Share-Id     8-hex permalink, openable as /s/<id>

Reddit URLs are auto-detected (incl. redd.it short links and /s/ shares).
Use this whenever you would otherwise fetch raw HTML — the markdown is
much cleaner and saves significant context window space.

2. Claude Code skill

web-reader.zip is auto-built with your URL embedded:

curl -O ${PULLMD_URL}/web-reader.zip
mkdir -p ~/.claude/skills
unzip web-reader.zip -d ~/.claude/skills/
# Restart Claude Code; the skill activates on web-reading requests.

3. MCP server

Remote MCP server at ${PULLMD_URL}/mcp (Streamable-HTTP transport, stateless). Three tools: read_url, get_share, list_recent. Server-side updates reach every client automatically — no local install needed.

Claude Code (CLI):

claude mcp add --transport http pullmd ${PULLMD_URL}/mcp

Claude Desktop / Cursor / other MCP hosts — JSON config:

{
  "mcpServers": {
    "pullmd": {
      "type": "http",
      "url": "${PULLMD_URL}/mcp"
    }
  }
}

Once registered, the three tools surface natively in the agent — no prompt instructions needed, the LLM picks them up via their schema descriptions.

MCP client compatibility (updated for v2.0)

ClientBearer (Authorization: Bearer pmd_...)OAuthNotes
Claude Code CLIRecommended. Generate a key at /settings.
CursorSame as CLI.
Claude Desktop(#6)UI lacks header field. Phase 2 OAuth.
claude.ai (web)(#6)Web requires OAuth. Phase 2.

For Phase 1, Claude Desktop / claude.ai users still need the OAuth/proxy workaround documented in #10. Phase 2 (#6) layers OAuth on top of this user system.

Claude Desktop limitation

The Claude Desktop “Add custom connector” UI accepts URL + OAuth Client ID/Secret but no custom-header field. Additionally, claude_desktop_config.json entries with "type": "http" are silently rewritten to {} after Desktop launches (current Desktop only honors stdio servers in that file).

Until OAuth support lands (see #6), the practical workaround for Claude Desktop users is a reverse proxy that accepts the auth token as either a bearer header (for CLI) or as a URL path prefix (for Desktop, which has no header field).

Caddy workaround for Claude Desktop

Contributed by @WinFuture23:

@bearer header Authorization "Bearer {$AUTH_TOKEN}"
handle @bearer { reverse_proxy pullmd:3000 }

@token_path path /{$AUTH_TOKEN}/* /{$AUTH_TOKEN}
handle @token_path {
    uri strip_prefix /{$AUTH_TOKEN}
    reverse_proxy pullmd:3000
}

Then in Claude Desktop’s connector dialog, use the URL with the token path prefix: https://your-instance.com/<TOKEN>/mcp. CLI clients keep using the Authorization header as normal.

This is a stopgap pattern; native OAuth (Phase 2) will remove the need for it.


API

EndpointReturns
GET /api?url=…Markdown (or JSON / plain text via format=).
GET /api/stream?url=…Server-Sent Events stream of extraction-stage status, ending in a result event. Used by the PWA.
GET /s/:idCached Markdown by share id; refreshes from source if > 1 h old.
GET /api/historyRecent conversions (JSON).
GET /api/archivePaginated full archive.
GET /api/storageCache size / hit-rate stats.
GET /api/statsExtraction telemetry (sources, quality, latency).
POST /mcpStreamable-HTTP MCP endpoint (3 tools: read_url, get_share, list_recent).
GET /web-reader.zipClaude Code skill bundle, with this instance’s URL baked in.
GET /helpBilingual user/agent setup guide.

/api parameters

ParamDefaultNotes
urlRequired.
commentstrueInclude Reddit comments. Ignored for non-Reddit URLs.
comment_depth3Max nesting depth (1–10).
comment_limit15Max top-level comments.
frontmatterfalsePrepend YAML metadata.
formatmdtext strips Markdown; json returns structured response.
nocachefalseBypass the 1-hour cache.
renderautoforce → always render via Playwright. skip → never render. Bypasses cache.
langdeComments-section header language (de or en).

Response headers

  • X-Sourcereddit · cloudflare · readability · readability-fallback · trafilatura · playwright
  • X-Quality0.01.0 extraction confidence
  • X-Share-Id — the 8-hex permalink id

Cache & TTLs

  • /api?url=… re-fetches from source if the cache row is older than 1 hour.
  • /s/:id does the same on-demand refresh, so share links double as live endpoints.
  • Cache rows are pruned 90 days after the last write. /s/:id hits keep the row alive (since they trigger refresh + write); read-only access does not extend the TTL.
  • If the source is unreachable on refresh, the last good snapshot is served — share links keep working even when the original URL dies.

Architecture

  • server.js — Express app factory (createApp) with dependency injection for tests. Exposes /api and /api/stream (SSE).
  • lib/reddit.js — Reddit URL normalization, redirect resolution, post + comment extraction.
  • lib/web.js — Orchestrator: Cloudflare-Markdown short-circuit, then static Readability + Trafilatura with pickBest, then optional Playwright re-render + re-extract on body-soup / low-quality output.
  • lib/render-decision.js — Predicate that decides when to fall back to Playwright (readability-fellback + thin, body-soup signature, or quality < 0.5; plus force / skip overrides).
  • lib/playwright-client.js — HTTP client for the Playwright sidecar with AbortSignal propagation for SSE-disconnect cancellation.
  • lib/scoring.js — Quality scoring used to pick between extractors and as a render-trigger heuristic.
  • lib/cache.js — SQLite cache (better-sqlite3) with 90-day TTL and 8-hex share ids.
  • lib/mcp.js — Stateless MCP server registering the three tools.
  • lib/distrib.js — Public-URL substitution in /help and /web-reader.zip.
  • trafilatura-sidecar/ — Python sidecar (FastAPI) wrapping Trafilatura.
  • playwright-sidecar/ — Python sidecar (FastAPI + Playwright + Chromium) for JS-rendered pages.
  • public/ — PWA frontend (vanilla JS, dark/paper themes, service worker, EventSource client for /api/stream).
  • skill/web-reader/ — Claude Code skill source (templated with __PULLMD_URL__).

License

GNU AGPL v3 — Copyright © 2026 Aeterna Labs.

PullMD is free software: you can redistribute it and modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, version 3 or later. If you run a modified version as a network service, you must make your modifications available to its users.

相似文章

@geekbb: AI 写出来的技术文档动辄几千字,全在终端里滚,没人愿意看。md2html 让 AI 自动把这些 Markdown 转成带侧边栏目录、图表、时间线、卡片和警告框的 HTML 页面,一个文件就能发给团队看。 https://github.c…

X AI KOLs Timeline

md2html is a tool that converts AI-generated Markdown documents into polished, self-contained HTML pages with sidebar table of contents, diagrams, timelines, and callouts, making them easier to read and share.

@GYLQ520: 搞 AI Agent 的注意了!token 烧钱烧到心疼? 有个开源工具叫 curl.md,专门把网页转成 Markdown 格式喂给 AI,token 消耗直接砍一大截。CLI、浏览器插件、API 三种用法随你选,Cursor、Clau…

X AI KOLs Timeline

curl.md is an open-source tool that converts web pages to optimized Markdown format for AI agents, significantly reducing token consumption and cost. It offers CLI, browser extension, and API usage, with integrations for Cursor, Claude, and other agents.

@AIExplorerTim: 有人刚刚开发了一个工具,可以将 PDF 转换为 干净、结构化的 Markdown 速度达到 100 页/秒 不需要 GPU。 不需要 API 成本。 没有混乱的解析。 只有原始的、可用的数据。 它可以轻松处理的内容: • 表格 → 完美提…

X AI KOLs Timeline

OpenDataLoader 是一个开源工具,可将 PDF 转换为结构化的 Markdown 和 JSON,支持 100 页/秒的本地处理速度,无需 GPU 或 API 成本,专为 RAG 管道和 PDF 无障碍自动化设计。

@AYi_AInotes: 说个可能会被骂的判断, 过去十年我们被 Markdown 宠坏了,以为它就是内容交付的终极形态, 但昨晚翻完 html-anything 这个开源项目,上线 7 天,已经 3.3k 星,我突然想通了一件事,就是 AI 时代输出格式的真正的…

X AI KOLs Timeline

html-anything 是一个开源项目,能将 Markdown、CSV、JSON 等内容通过本地 AI agent 直接生成生产级 HTML,并一键发布到微信、X 等平台。它强调设计约束(如 8px 网格、CJK 字体栈)和零 API Key 的本地运行,被视为 Agent 时代的内容生产操作系统。