@Jolyne_AI: 一个能自动读懂 PDF 书籍的 Python 脚本:AI Reads Books。 把 PDF 丢进去,运行即可按页解析内容,抓取关键知识点,自动生成结构清晰的 Markdown 摘要。 GitHub:https://github.com…
摘要
一个 Python 脚本,能自动解析 PDF 书籍内容,提取关键知识点并生成 Markdown 格式的摘要,旨在提升阅读和知识整理效率。
查看缓存全文
缓存时间: 2026/06/29 02:21
一个能自动读懂 PDF 书籍的 Python 脚本:AI Reads Books。
把 PDF 丢进去,运行即可按页解析内容,抓取关键知识点,自动生成结构清晰的 Markdown 摘要。
GitHub:https://github.com/echohive42/AI-reads-books-page-by-page…
官网:https://echohive.ai
翻页、摘录、整理这些重复劳动交给 AI,你只需专注重点:更快提炼要点,更轻松理解和消化内容。一款真正好用的阅读辅助工具。
echohive42/AI-reads-books-page-by-page
Source: https://github.com/echohive42/AI-reads-books-page-by-page
📚 AI reads books: Page-by-Page PDF Knowledge Extractor & Summarizer
The read_books.py script performs an intelligent page-by-page analysis of PDF books, methodically extracting knowledge points and generating progressive summaries at specified intervals. It processes each page individually, allowing for detailed content understanding while maintaining the contextual flow of the book. Below is a detailed explanation of how the script works:
Features
- 📚 Automated PDF book analysis and knowledge extraction
- 🤖 AI-powered content understanding and summarization
- 📊 Interval-based progress summaries
- 💾 Persistent knowledge base storage
- 📝 Markdown-formatted summaries
- 🎨 Color-coded terminal output for better visibility
- 🔄 Resume capability with existing knowledge base
- ⚙️ Configurable analysis intervals and test modes
- 🚫 Smart content filtering (skips TOC, index pages, etc.)
- 📂 Organized directory structure for outputs
❤️ Go deeper: Get Amplified + weekly 1000x LAB meetings
This repo is one small doorway into the larger AI-building practice I share with patrons.
- ❤️ Support me on Patreon to get the full project collection, source code, explanations, and ongoing AI-building material.
- 🎥 Get Amplified is my 55-video-post series for thinking fast, building faster, and speed-running your creativity. Learn to use Codex, Claude Code, Cursor, and other AI tools effectively and creatively. Get amplified by becoming amplifiable, with more chapters on the way in quick succession.
- 🧠 1000x LAB for Architect+ tiers is the patron meeting archive: 82 focused 1000x meetings so far, with a new one added every week. These sessions go behind the scenes on real builds, agent workflows, creative tooling, and the decisions that turn experiments into finished, usable systems.
- 🤝 Higher memberships also include 1-on-1 meetings for more direct guidance on your AI builds, workflows, and creative direction.
- 🚀 Patrons get the deeper context around projects like this: source code, walkthroughs, implementation notes, and a steady stream of examples for turning AI ideas into working products.
How to Use
-
Setup
# Clone the repository git clone [repository-url] cd [repository-name] # Install requirements pip install -r requirements.txt -
Configure
- Place your PDF file in the project root directory
- Open
read_books.pyand update thePDF_NAMEconstant with your PDF filename - (Optional) Adjust other constants like
ANALYSIS_INTERVALorTEST_PAGES
-
Run
python read_books.py -
Output The script will generate:
book_analysis/knowledge_bases/: JSON files containing extracted knowledgebook_analysis/summaries/: Markdown files with interval and final summariesbook_analysis/pdfs/: Copy of your PDF file
-
Customization Options
- Set
ANALYSIS_INTERVAL = Noneto skip interval summaries - Set
TEST_PAGES = Noneto process entire book - Adjust
MODELandANALYSIS_MODELfor different AI models
- Set
Configuration Constants
PDF_NAME: The name of the PDF file to be analyzed.BASE_DIR: The base directory for the analysis.PDF_DIR: Directory where the PDF file is stored.KNOWLEDGE_DIR: Directory where the knowledge base will be saved.SUMMARIES_DIR: Directory where the summaries will be saved.PDF_PATH: Full path to the PDF file.OUTPUT_PATH: Path to the knowledge base JSON file.ANALYSIS_INTERVAL: Number of pages after which an interval analysis is generated. Set toNoneto skip interval analyses.MODEL: The model used for processing pages.ANALYSIS_MODEL: The model used for generating analyses.TEST_PAGES: Number of pages to process for testing. Set toNoneto process the entire book.
Classes and Functions
PageContent Class
A Pydantic model that represents the structure of the response from the OpenAI API for page content analysis. It has two fields:
has_content: A boolean indicating if the page has relevant content.knowledge: A list of knowledge points extracted from the page.
load_or_create_knowledge_base() -> Dict[str, Any]
Loads the existing knowledge base from the JSON file if it exists. If not, it returns an empty dictionary.
save_knowledge_base(knowledge_base: list[str])
Saves the knowledge base to a JSON file. It prints a message indicating the number of items saved.
process_page(client: OpenAI, page_text: str, current_knowledge: list[str], page_num: int) -> list[str]
Processes a single page of the PDF. It sends the page text to the OpenAI API for analysis and updates the knowledge base with the extracted knowledge points. It also saves the updated knowledge base to a JSON file.
load_existing_knowledge() -> list[str]
Loads the existing knowledge base from the JSON file if it exists. If not, it returns an empty list.
analyze_knowledge_base(client: OpenAI, knowledge_base: list[str]) -> str
Generates a comprehensive summary of the entire knowledge base using the OpenAI API. It returns the summary in markdown format.
setup_directories()
Sets up the necessary directories for the analysis. It clears any previously generated files and ensures the PDF file is in the correct location.
save_summary(summary: str, is_final: bool = False)
Saves the generated summary to a markdown file. It creates a file with a proper naming convention based on whether it is a final or interval summary.
print_instructions()
Prints instructions for using the script. It explains the configuration options and how to run the script.
main()
The main function that orchestrates the entire process. It sets up directories, loads the knowledge base, processes each page of the PDF, generates interval and final summaries, and saves them.
How It Works
- Setup: The script sets up the necessary directories and ensures the PDF file is in the correct location.
- Load Knowledge Base: It loads the existing knowledge base if it exists.
- Process Pages: It processes each page of the PDF, extracting knowledge points and updating the knowledge base.
- Generate Summaries: It generates interval summaries based on the
ANALYSIS_INTERVALand a final summary after processing all pages. - Save Results: It saves the knowledge base and summaries to their respective files.
Running the Script
- Place your PDF in the same directory as the script.
- Update the
PDF_NAMEconstant with your PDF filename. - Run the script. It will process the book, extract knowledge points, and generate summaries.
Example Usage
相似文章
@AIExplorerTim: 有人刚刚开发了一个工具,可以将 PDF 转换为 干净、结构化的 Markdown 速度达到 100 页/秒 不需要 GPU。 不需要 API 成本。 没有混乱的解析。 只有原始的、可用的数据。 它可以轻松处理的内容: • 表格 → 完美提…
OpenDataLoader 是一个开源工具,可将 PDF 转换为结构化的 Markdown 和 JSON,支持 100 页/秒的本地处理速度,无需 GPU 或 API 成本,专为 RAG 管道和 PDF 无障碍自动化设计。
@BlockInsight214: 论文、合同、扫描件丢给 AI 之前,最难的一步往往是「先把 PDF 洗干净」。这几个开源项目专干这件事:转成 Markdown/JSON,直接喂给 RAG 或 agent。 ① MarkItDown · 微软出品,Office/PDF/图…
介绍了五个开源工具(MarkItDown、MinerU、Docling、marker、surya),用于将PDF、Office文档等转换为Markdown或JSON,以便直接供RAG或AI代理使用。
@AYi_AInotes: https://x.com/AYi_AInotes/status/2058536443174158504
作者分享了自己三年使用PDF喂AI的踩坑经历,指出Markdown比PDF更适合作为AI输入格式,因为PDF本质上是坐标+字符的混合体,AI需要先解析结构,容易出错且消耗更多token。文章提供了具体案例和推荐工具(markitdown、pandoc、LlamaParse),并预告了一个名为“喂AI的艺术”的新系列。
@sitinme: 不“让 AI 总结一本书”,而是更进一步:把一本书、一个文档包,整理成 AI Agent 可以反复调用的 Skill,这个思路感觉可以聊一聊。 之前书买了、读了,过一阵想找里面某个知识点,翻半天找不到;问 AI 吧,它可能瞎编;把整本 P…
介绍了一个将书籍或文档包转换为AI Agent可调用Skill的工具book-to-skill,支持PDF等格式,生成SKILL.md和章节索引,避免一次性加载全部上下文。
@XAMTO_AI: Get it. 一个把 PDF 教科书变成「可测量掌握地图」的桌面学习神器! 上传文字版 PDF 后,AI 会自动给每个概念打标签,一键生成 3D 可视化、动画、公式和知识图谱。同时提供 Chat、闪卡、测验、Feynman 讲解四种深度…
XAMTO AI 是一款桌面学习应用,能将 PDF 教科书转换为可交互的知识地图,支持 3D 可视化、动画、知识图谱,并提供聊天、闪卡、测验和费曼讲解四种学习模式,实时追踪四维掌握度。它使用用户自己的 ChatGPT Plus 或 OpenAI Key,本地运行无需额外费用,是 GDG AI Hack Milan 2026 黑客松作品。