@Jolyne_AI: A Python script that can automatically read PDF books: AI Reads Books. Drop a PDF in, run it to parse content page by page, capture key knowledge points, and automatically generate a well-structured Markdown summary. GitHub: https://github.com…
Summary
A Python script that can automatically parse PDF book content, extract key knowledge points, and generate Markdown-format summaries, aiming to improve reading and knowledge organization efficiency.
View Cached Full Text
Cached at: 06/29/26, 02:21 AM
A Python script that automatically reads PDF books: AI Reads Books.
Drop a PDF in, and it will parse content page by page, extract key knowledge points, and automatically generate well-structured Markdown summaries.
GitHub:https://github.com/echohive42/AI-reads-books-page-by-page…
Website:https://echohive.ai
Leave the repetitive tasks of flipping pages, extracting, and organizing to AI; you can focus on the key points: faster extraction of insights, easier understanding and digestion of content. A truly useful reading assistant tool.
echohive42/AI-reads-books-page-by-page
Source: https://github.com/echohive42/AI-reads-books-page-by-page
📚 AI reads books: Page-by-Page PDF Knowledge Extractor & Summarizer
The read_books.py script performs an intelligent page-by-page analysis of PDF books, methodically extracting knowledge points and generating progressive summaries at specified intervals. It processes each page individually, allowing for detailed content understanding while maintaining the contextual flow of the book. Below is a detailed explanation of how the script works:
Features
- 📚 Automated PDF book analysis and knowledge extraction
- 🤖 AI-powered content understanding and summarization
- 📊 Interval-based progress summaries
- 💾 Persistent knowledge base storage
- 📝 Markdown-formatted summaries
- 🎨 Color-coded terminal output for better visibility
- 🔄 Resume capability with existing knowledge base
- ⚙️ Configurable analysis intervals and test modes
- 🚫 Smart content filtering (skips TOC, index pages, etc.)
- 📂 Organized directory structure for outputs
❤️ Go deeper: Get Amplified + weekly 1000x LAB meetings
This repo is one small doorway into the larger AI-building practice I share with patrons.
- ❤️ Support me on Patreon (https://www.patreon.com/c/echohive42/membership) to get the full project collection, source code, explanations, and ongoing AI-building material.
- 🎥 Get Amplified (https://www.patreon.com/collection/1845761) is my 55-video-post series for thinking fast, building faster, and speed-running your creativity. Learn to use Codex, Claude Code, Cursor, and other AI tools effectively and creatively. Get amplified by becoming amplifiable, with more chapters on the way in quick succession.
- 🧠 1000x LAB for Architect+ tiers (https://www.patreon.com/collection/759209) is the patron meeting archive: 82 focused 1000x meetings so far, with a new one added every week. These sessions go behind the scenes on real builds, agent workflows, creative tooling, and the decisions that turn experiments into finished, usable systems.
- 🤝 Higher memberships also include 1-on-1 meetings for more direct guidance on your AI builds, workflows, and creative direction.
- 🚀 Patrons get the deeper context around projects like this: source code, walkthroughs, implementation notes, and a steady stream of examples for turning AI ideas into working products.
How to Use
-
Setup ``bash
Clone the repository
git clone [repository-url] cd [repository-name]
Install requirements
pip install -r requirements.txt ``
-
Configure
- Place your PDF file in the project root directory
- Open
read_books.pyand update thePDF_NAMEconstant with your PDF filename - (Optional) Adjust other constants like
ANALYSIS_INTERVALorTEST_PAGES
-
Run
bash python read_books.py -
Output The script will generate:
book_analysis/knowledge_bases/: JSON files containing extracted knowledgebook_analysis/summaries/: Markdown files with interval and final summariesbook_analysis/pdfs/: Copy of your PDF file
-
Customization Options
- Set
ANALYSIS_INTERVAL = Noneto skip interval summaries - Set
TEST_PAGES = Noneto process entire book - Adjust
MODELandANALYSIS_MODELfor different AI models
- Set
Configuration Constants
PDF_NAME: The name of the PDF file to be analyzed.BASE_DIR: The base directory for the analysis.PDF_DIR: Directory where the PDF file is stored.KNOWLEDGE_DIR: Directory where the knowledge base will be saved.SUMMARIES_DIR: Directory where the summaries will be saved.PDF_PATH: Full path to the PDF file.OUTPUT_PATH: Path to the knowledge base JSON file.ANALYSIS_INTERVAL: Number of pages after which an interval analysis is generated. Set toNoneto skip interval analyses.MODEL: The model used for processing pages.ANALYSIS_MODEL: The model used for generating analyses.TEST_PAGES: Number of pages to process for testing. Set toNoneto process the entire book.
Classes and Functions
PageContent Class
A Pydantic model that represents the structure of the response from the OpenAI API for page content analysis. It has two fields:
has_content: A boolean indicating if the page has relevant content.knowledge: A list of knowledge points extracted from the page.
load_or_create_knowledge_base() -> Dict[str, Any]
Loads the existing knowledge base from the JSON file if it exists. If not, it returns an empty dictionary.
save_knowledge_base(knowledge_base: list[str])
Saves the knowledge base to a JSON file. It prints a message indicating the number of items saved.
process_page(client: OpenAI, page_text: str, current_knowledge: list[str], page_num: int) -> list[str]
Processes a single page of the PDF. It sends the page text to the OpenAI API for analysis and updates the knowledge base with the extracted knowledge points. It also saves the updated knowledge base to a JSON file.
load_existing_knowledge() -> list[str]
Loads the existing knowledge base from the JSON file if it exists. If not, it returns an empty list.
analyze_knowledge_base(client: OpenAI, knowledge_base: list[str]) -> str
Generates a comprehensive summary of the entire knowledge base using the OpenAI API. It returns the summary in markdown format.
setup_directories()
Sets up the necessary directories for the analysis. It clears any previously generated files and ensures the PDF file is in the correct location.
save_summary(summary: str, is_final: bool = False)
Saves the generated summary to a markdown file. It creates a file with a proper naming convention based on whether it is a final or interval summary.
print_instructions()
Prints instructions for using the script. It explains the configuration options and how to run the script.
main()
The main function that orchestrates the entire process. It sets up directories, loads the knowledge base, processes each page of the PDF, generates interval and final summaries, and saves them.
How It Works
- Setup: The script sets up the necessary directories and ensures the PDF file is in the correct location.
- Load Knowledge Base: It loads the existing knowledge base if it exists.
- Process Pages: It processes each page of the PDF, extracting knowledge points and updating the knowledge base.
- Generate Summaries: It generates interval summaries based on the
ANALYSIS_INTERVALand a final summary after processing all pages. - Save Results: It saves the knowledge base and summaries to their respective files.
Running the Script
- Place your PDF in the same directory as the script.
- Update the
PDF_NAMEconstant with your PDF filename. - Run the script. It will process the book, extract knowledge points, and generate summaries.
Example Usage
Similar Articles
@AIExplorerTim: Someone just released a tool that converts PDFs into clean, structured Markdown at speeds up to 100 pages/second. No GPU required. No API costs. No messy parsing. Just raw, usable data. It handles with ease: • Tables → Perfectly ex…
OpenDataLoader is an open-source tool that converts PDFs into structured Markdown and JSON, supporting local processing speeds of up to 100 pages/second without requiring a GPU or incurring API costs, designed specifically for RAG pipelines and PDF accessibility automation.
@BlockInsight214: Before feeding papers, contracts, or scanned documents to AI, the hardest step is often "cleaning up the PDF." These open-source projects specialize in that: converting to Markdown/JSON, ready for RAG or agents. ① MarkItDown · Microsoft, Office/PDF/images to Markdown in one click...
Introduces five open-source tools (MarkItDown, MinerU, Docling, marker, surya) that convert PDFs, Office documents, etc., into Markdown or JSON for direct use with RAG or AI agents.
@AYi_AInotes: https://x.com/AYi_AInotes/status/2058536443174158504
The author shares their three-year experience of feeding PDFs to AI, pointing out that Markdown is a better input format for AI than PDF, because PDF is essentially a mix of coordinates and characters. AI needs to parse the structure first, which is error-prone and consumes more tokens. The article provides specific cases and recommended tools (markitdown, pandoc, LlamaParse), and teases a new series called 'The Art of Feeding AI'.
@sitinme: Not just "have AI summarize a book", but go further: turning a book or a document package into a Skill that an AI Agent can repeatedly call. This idea is worth discussing. Previously, after buying and reading a book, when I later wanted to find a certain knowledge point, I couldn't find it after flipping through for a long time; asking AI might make things up; throwing the entire PD…
Introduces a tool called book-to-skill that converts books or document packages into AI Agent callable Skills. It supports PDF and other formats, generates SKILL.md and chapter indexes, avoiding loading the full context at once.
@XAMTO_AI: Get it. A desktop learning tool that turns PDF textbooks into a "measurable mastery map"!
XAMTO AI is a desktop learning application that converts PDF textbooks into interactive knowledge maps, supporting 3D visualization, animations, knowledge graphs, and offering four learning modes: chat, flashcards, quizzes, and Feynman explanations, with real-time tracking of four-dimensional mastery. It uses your own ChatGPT Plus or OpenAI Key, runs locally with no extra cost, and is a project from GDG AI Hack Milan 2026.