@CycleDecoded: Stop paying a fortune for crappy crawler software and paying the IQ tax! This open-source tool is insane — it directly exposes all social media platforms' data. The game for traffic matrix and data monetization players is over! Meet MediaCrawler, a GitHub project with over 54,000 stars. In plain language…

X AI KOLs Timeline 06/30/26, 09:42 AM Tools

open-source web-scraping social-media data-mining python automation

Summary

MediaCrawler is an open-source multi-platform social media crawler tool with over 54,000 stars on GitHub. It supports data collection from 7 major platforms including Xiaohongshu, Douyin, Bilibili, etc., and features multi-account, IP proxy, breakpoint resume, and AI integration.

Stop spending a fortune on those crappy crawler software and paying the IQ tax! Recently, an open-source tool has been so outrageous that it directly strips the data from all social media platforms. Those in the traffic matrix and data monetization game are in real danger! This is MediaCrawler, an incredible open-source project on GitHub with over 54,000 stars. In plain terms, it's a universal "data pump" for all platforms. Whether it's Xiaohongshu's viral posts, Douyin's short videos, or hot comments from Bilibili, Weibo, Kuaishou, Zhihu, and Tieba — you can run everything locally with one click, open source and free, no tricks! Its latest iterations come with several killer features that anyone in Web3, data analysis, or side hustle traffic will instantly appreciate: - **Cross-platform data domination**: Full coverage of 7 major content platforms. Batch search for trending content, grab competitor account info — it's a total game-changer! - **Multi-account + IP proxy pool**: Environment isolation to prevent risk control, easily bypass platform restrictions, and quietly grab data! - **Breakpoint resume**: Network disconnection ruining your work? Not anymore — pause and resume anytime! - **AI Agent automation integration**: Seamlessly connected with Cursor and Claude, letting AI do the work for you automatically! ⚠️ **Warning**: This tool is incredibly powerful. Use it for side hustle trend discovery and competitive analysis. Those who know, know. Don't cross the line! GitHub repo: https://github.com/NanmiCoder/MediaCrawler…

Original Article

View Cached Full Text

Cached at: 06/30/26, 03:45 PM

Stop wasting money on overpriced scraping tools — that’s just paying the idiot tax. There’s a new open-source tool that’s absolutely insane: it rips the data from every major social platform, putting the entire traffic matrix and data monetization game at serious risk. Meet MediaCrawler, the superstar open-source project on GitHub with over 54,000 stars. In plain English, it’s a cross-platform data pump. Whether it’s Xiaohongshu’s viral posts, Douyin’s short videos, or the hot comments from Bilibili, Weibo, Kuaishou, Zhihu, or Tieba — one click and it runs locally, completely free and open-source with no tricks.

Its latest iteration packs some killer features that anyone in Web3, data analysis, or side-hustle traffic generation will immediately appreciate:

Universal data harvesting: Covers 7 major content platforms. Batch-finding trending posts and stalking competitor accounts is a total cheat code.
Multi‑account + IP proxy pool: Environment isolation beats anti‑scraping measures. Silently siphon data without detection.
Resume capability: Network drops mid‑crawl? Never again. Pause and resume anytime.
AI Agent integration: Already seamlessly connected with Cursor, Claude, etc. Let the AI do the work for you.

Warning: This tool is powerful. Use it for side-project trend analysis or competitor research. If you know, you know — don’t cross the line.

GitHub repo: https://github.com/NanmiCoder/MediaCrawler…

NanmiCoder/MediaCrawler

Source: https://github.com/NanmiCoder/MediaCrawler

🔥 MediaCrawler - Self-Media Platform Crawler 🕷️

GitHub Stars (https://github.com/NanmiCoder/MediaCrawler/stargazers) GitHub Forks (https://github.com/NanmiCoder/MediaCrawler/network/members) GitHub Issues (https://github.com/NanmiCoder/MediaCrawler/issues) GitHub Pull Requests (https://github.com/NanmiCoder/MediaCrawler/pulls) License (https://github.com/NanmiCoder/MediaCrawler/blob/main/LICENSE)

中文 English Español

Disclaimer:

Please use this repository for learning purposes only ⚠️⚠️⚠️⚠️. For cases of illegal web scraping, see (https://github.com/HiddenStrawberry/Crawler_Illegal_Cases_In_China)

All content in this repository is for learning and reference only. Commercial use is prohibited. No individual or organization may use the content for illegal purposes or infringe upon others’ legitimate rights. The scraping techniques covered are for study and research only, and should not be used for large-scale scraping or other illegal activities on other platforms. The repository assumes no responsibility for any legal consequences arising from the use of its content. By using this repository, you agree to all terms and conditions of this disclaimer.

Click to view the more detailed disclaimer. Jump to section

📖 Project Introduction

A powerful multi-platform self-media data collection tool that supports scraping public information from Xiaohongshu, Douyin, Kuaishou, Bilibili, Weibo, Tieba, Zhihu, and other mainstream platforms.

🔧 Technical Principles

Core Technology: Based on the Playwright (https://playwright.dev/) browser automation framework — logs in and saves the login state.
No JS reverse engineering: Uses the browser context with preserved login state to obtain signature parameters via JS expressions.
Advantages: No need to reverse-engineer complex encryption algorithms, significantly lowering the technical barrier.

✨ Feature Overview

Platform	Keyword Search	Scrape by Post ID	Sub‑comments	Scrape Creator Profile	Login State Cache	IP Proxy Pool	Generate Comment Word Cloud
Xiaohongshu	✅	✅	✅	✅	✅	✅	✅
Douyin	✅	✅	✅	✅	✅	✅	✅
Kuaishou	✅	✅	✅	✅	✅	✅	✅
Bilibili	✅	✅	✅	✅	✅	✅	✅
Weibo	✅	✅	✅	✅	✅	✅	✅
Tieba	✅	✅	✅	✅	✅	✅	✅
Zhihu	✅	✅	✅	✅	✅	✅	✅

🚀 MediaCrawlerPro – Major Release!

Open source is not easy — please subscribe to support us.

Focus on learning mature project architecture design. It’s not just about scraping; the code design of the Pro version is worth deep study!

MediaCrawlerPro (https://github.com/MediaCrawlerPro)

🎯 Core Feature Upgrades

✅ Self-Media Content Extraction Agent (new feature)
✅ Resume Capability (key feature)
✅ Multi‑account + IP Proxy Pool (key feature)
✅ Removed Playwright dependency – simpler to use
✅ Full Linux environment support

🏗️ Architecture Design Optimizations

✅ Code refactored & optimised – more readable and maintainable (decoupled JS signature logic)
✅ Enterprise‑grade code quality – suitable for building large‑scale scraping projects
✅ Perfect architecture design – high extensibility, great for learning source code

🎁 Additional Features

✅ Self-Media Video Downloader (Desktop) – ideal for learning full‑stack development
✅ Multi‑platform Home Feed recommendation (HomeFeed)
✅ AI Agent Skill support – One‑click install with OpenClaw (https://openclaw.ai/) 🦞 / Claude Code / Cursor. Let the Agent automatically scrape data.
Comment Analysis AI Agent – under development 🚀🚀

Click to view: MediaCrawlerPro Project Homepage (https://github.com/MediaCrawlerPro)

– More details

🚀 Quick Start

💡 If this project helps you, please give it a ⭐ Star!

📋 Prerequisites

🚀 Install uv (recommended)

Before proceeding, make sure you have uv installed:

Installation: uv official installation guide
Verify: Run uv --version in your terminal. A version number means success.
Why uv: The fastest Python package manager — speedy and accurate dependency resolution.

🟢 Node.js Installation

The project depends on Node.js. Download from the official site:

Download: https://nodejs.org/en/download/
Version requirement: >= 16.0.0

📦 Install Python Packages

# Enter the project directory
cd MediaCrawler

# Use uv sync to ensure consistent Python version and dependencies
uv sync

🌐 Browser Driver Installation (optional)

If you use the default CDP mode (connect to an existing Chrome browser), no need to install browser drivers. Only required when using standard Playwright mode.

# Only needed for standard Playwright mode
uv run playwright install

🌍 Chrome Browser Configuration (recommended)

The project uses CDP mode by default to connect to your existing Chrome browser, reusing login state, cookies, extensions, etc. This significantly reduces the risk of platform detection.

Before using:

Install the latest Chrome (version >= 144) – Download
Enable remote debugging: In Chrome’s address bar, enter chrome://inspect/#remote-debugging, check “Allow remote debugging for this browser instance”
The page should show Server running at: 127.0.0.1:9222 – ready to go.

💡 Tip: After starting the crawler, Chrome will display a confirmation dialog — click “Accept”. The program waits up to 60 seconds for you to confirm.

If you don’t want to use CDP mode, set ENABLE_CDP_MODE = False in config/base_config.py to switch to standard Playwright mode.

🚀 Run the Crawler

# Check config/base_config.py for configuration options (comments are in Chinese)
# Read keywords from config, search posts, and scrape post info + comments
uv run main.py --platform xhs --lt qrcode --type search

# Read a list of post IDs from config and scrape specific posts + comments
uv run main.py --platform xhs --lt qrcode --type detail

# Scan QR code with the corresponding app to log in
# For other platforms, see:
uv run main.py --help

🖥️ WebUI Visual Interface

MediaCrawler provides a web‑based visual interface — no command line needed.

Start WebUI

# Start the API server (default port 8080)
uv run uvicorn api.main:app --port 8080 --reload

# Or launch it as a module
uv run python -m api.main

After startup, visit http://localhost:8080 to open the WebUI.

WebUI Features

Visual configuration of crawling parameters (platform, login method, crawl type, etc.)
Real‑time monitoring of crawl status and logs
Data preview and export

Preview

🔗 Using Python’s native venv (not recommended)

Create and Activate a Python Virtual Environment

For Douyin and Zhihu, Node.js (version >= 16) must be installed.

# Enter the project root
cd MediaCrawler

# Create virtual environment
# My Python version is 3.11 — requirements.txt is based on that.
# If you use a different Python version, dependencies may be incompatible — solve manually.
python -m venv venv

# macOS & Linux: activate
source venv/bin/activate

# Windows: activate
venv\Scripts\activate

Install Dependencies

pip install -r requirements.txt

Install Playwright Browser Drivers

playwright install

Run the Crawler (native environment)

# By default, comment scraping is disabled. To enable, modify ENABLE_GET_COMMENTS in config/base_config.py.
# Other options are documented with Chinese comments in config/base_config.py.
# Read keywords from config, search posts, and scrape post info + comments
python main.py --platform xhs --lt qrcode --type search

# Read a list of post IDs from config and scrape specific posts + comments
python main.py --platform xhs --lt qrcode --type detail

# Scan QR code with the corresponding app to log in
# For other platforms, see:
python main.py --help

💾 Data Storage

MediaCrawler supports multiple storage formats: CSV, JSON, JSONL, Excel, SQLite, and MySQL.

📖 Detailed instructions: Data Storage Guide

🚀 MediaCrawlerPro – Major Release 🚀!

More features, better architecture. Open source is not easy — please subscribe to support! (https://github.com/MediaCrawlerPro)

💬 Community & Groups

WeChat Group: Click to join
Bilibili Account: Follow me – sharing AI & scraping tech knowledge

💰 Sponsors

TikHub.io provides 900+ highly stable data APIs covering 14+ domestic and international platforms (TK, DY, XHS, Y2B, Ins, X, etc.). Supports public data APIs for users, content, products, comments, etc., along with 40M+ pre‑cleaned structured datasets. Use invitation code cfzyejV9 when registering and top up to receive an extra $2 credit.

Atlas Cloud is a full‑modal AI inference platform that gives developers a unified AI API for video generation, image generation, and LLMs, eliminating the need to maintain multiple vendor integrations. It provides access to 300+ curated models. Atlas Cloud recently launched a coding plan discount, offering developers a more cost‑effective API budget.

🤝 Become a Sponsor

Sponsors can showcase their products here, gaining daily exposure.

Contact:

WeChat: relakkes
Email: [email protected]

☕ Buy the Author a Coffee

If this project helps you, feel free to support me. Every bit of support fuels my continued development ❤️

WeChat Pay Alipay Buy Me a Coffee

📚 Other Resources

FAQ: MediaCrawler Full Documentation
Crawler Tutorial: CrawlerTutorial Free Course
News Crawler Open Source: NewsCrawlerCollection

⭐ Star History

If this project helps you, please ⭐ Star it — help more people find MediaCrawler!

Star History Chart

📚 References

Xiaohongshu signature repo: Cloxl/xhshow
Xiaohongshu client: ReaJason/xhs
SMS forwarding: SmsForwarder
Intranet penetration tool: ngrok docs

Disclaimer

1. Project Purpose and Nature

This project (hereinafter referred to as “the Project”) is created as a technical research and learning tool, aiming to explore and study web data scraping techniques. The Project focuses on scraping technology for self-media platforms, intended for technical exchange among learners and researchers.

2. Legal Compliance Statement

The developer of this Project (hereinafter referred to as “the Developer”) reminds users to strictly comply with the relevant laws and regulations of the People’s Republic of China, including but not limited to the Cybersecurity Law, the Anti‑Espionage Law, and all other applicable national laws and policies, when downloading, installing, and using the Project. Users assume all legal responsibilities that may arise from using the Project.

3. Restrictions on Use

The Project is strictly prohibited from being used for any illegal purpose or for any non‑learning, non‑research commercial activities. The Project must not be used for any form of illegal intrusion into others’ computer systems, nor for any infringement of others’ intellectual property rights or other legitimate rights. Users must ensure that the purpose of using the Project is solely for personal learning and technical research, and not for any illegal activities.

4. Disclaimer

The Developer has made every effort to ensure the legitimacy and safety of the Project but assumes no liability for any direct or indirect losses caused by users’ use of the Project, including but not limited to data loss, device damage, or legal proceedings.

5. Intellectual Property Statement

The intellectual property rights of the Project belong to the Developer. The Project is protected by copyright law, international copyright treaties, and other intellectual property laws and treaties. Users may download and use the Project provided they comply with this disclaimer and relevant laws and regulations.

6. Final Interpretation

The Developer reserves the right of final interpretation of this disclaimer. The Developer reserves the right to change or update this disclaimer at any time without prior notice.

NanmiCoder/MediaCrawler

🔥 MediaCrawler - Self-Media Platform Crawler 🕷️

📖 Project Introduction

🔧 Technical Principles

✨ Feature Overview

🚀 MediaCrawlerPro – Major Release!

🎯 Core Feature Upgrades

🏗️ Architecture Design Optimizations

🎁 Additional Features

🚀 Quick Start

📋 Prerequisites

🚀 Install uv (recommended)

🟢 Node.js Installation

📦 Install Python Packages

🌐 Browser Driver Installation (optional)

🌍 Chrome Browser Configuration (recommended)

🚀 Run the Crawler

🖥️ WebUI Visual Interface

Start WebUI

WebUI Features

Preview

🔗 Using Python’s native venv (not recommended)

Create and Activate a Python Virtual Environment

Install Dependencies

Install Playwright Browser Drivers

Run the Crawler (native environment)

💾 Data Storage

🚀 MediaCrawlerPro – Major Release 🚀!

💬 Community & Groups

💰 Sponsors

🤝 Become a Sponsor

☕ Buy the Author a Coffee

📚 Other Resources

⭐ Star History

📚 References

Disclaimer

1. Project Purpose and Nature

2. Legal Compliance Statement

3. Restrictions on Use

4. Disclaimer

5. Intellectual Property Statement

6. Final Interpretation

Similar Articles

@WY_mask: MediaCrawler: Open-source web scraping tool for Xiaohongshu, Douyin, Weibo, Bilibili, Kuaishou. Supports scraping videos, images, comments, likes, reposts, etc. https://github.com/NanmiCoder/MediaCrawler…

NanmiCoder/MediaCrawler

Submit Feedback