@CycleDecoded: Stop paying a fortune for crappy crawler software and paying the IQ tax! This open-source tool is insane — it directly exposes all social media platforms' data. The game for traffic matrix and data monetization players is over! Meet MediaCrawler, a GitHub project with over 54,000 stars. In plain language…
Summary
MediaCrawler is an open-source multi-platform social media crawler tool with over 54,000 stars on GitHub. It supports data collection from 7 major platforms including Xiaohongshu, Douyin, Bilibili, etc., and features multi-account, IP proxy, breakpoint resume, and AI integration.
View Cached Full Text
Cached at: 06/30/26, 03:45 PM
Stop wasting money on overpriced scraping tools — that’s just paying the idiot tax. There’s a new open-source tool that’s absolutely insane: it rips the data from every major social platform, putting the entire traffic matrix and data monetization game at serious risk. Meet MediaCrawler, the superstar open-source project on GitHub with over 54,000 stars. In plain English, it’s a cross-platform data pump. Whether it’s Xiaohongshu’s viral posts, Douyin’s short videos, or the hot comments from Bilibili, Weibo, Kuaishou, Zhihu, or Tieba — one click and it runs locally, completely free and open-source with no tricks.
Its latest iteration packs some killer features that anyone in Web3, data analysis, or side-hustle traffic generation will immediately appreciate:
- Universal data harvesting: Covers 7 major content platforms. Batch-finding trending posts and stalking competitor accounts is a total cheat code.
- Multi‑account + IP proxy pool: Environment isolation beats anti‑scraping measures. Silently siphon data without detection.
- Resume capability: Network drops mid‑crawl? Never again. Pause and resume anytime.
- AI Agent integration: Already seamlessly connected with Cursor, Claude, etc. Let the AI do the work for you.
Warning: This tool is powerful. Use it for side-project trend analysis or competitor research. If you know, you know — don’t cross the line.
GitHub repo: https://github.com/NanmiCoder/MediaCrawler…
NanmiCoder/MediaCrawler
Source: https://github.com/NanmiCoder/MediaCrawler
🔥 MediaCrawler - Self-Media Platform Crawler 🕷️
GitHub Stars (https://github.com/NanmiCoder/MediaCrawler/stargazers) GitHub Forks (https://github.com/NanmiCoder/MediaCrawler/network/members) GitHub Issues (https://github.com/NanmiCoder/MediaCrawler/issues) GitHub Pull Requests (https://github.com/NanmiCoder/MediaCrawler/pulls) License (https://github.com/NanmiCoder/MediaCrawler/blob/main/LICENSE)
Disclaimer:
Please use this repository for learning purposes only ⚠️⚠️⚠️⚠️. For cases of illegal web scraping, see (https://github.com/HiddenStrawberry/Crawler_Illegal_Cases_In_China)
All content in this repository is for learning and reference only. Commercial use is prohibited. No individual or organization may use the content for illegal purposes or infringe upon others’ legitimate rights. The scraping techniques covered are for study and research only, and should not be used for large-scale scraping or other illegal activities on other platforms. The repository assumes no responsibility for any legal consequences arising from the use of its content. By using this repository, you agree to all terms and conditions of this disclaimer.
Click to view the more detailed disclaimer. Jump to section
📖 Project Introduction
A powerful multi-platform self-media data collection tool that supports scraping public information from Xiaohongshu, Douyin, Kuaishou, Bilibili, Weibo, Tieba, Zhihu, and other mainstream platforms.
🔧 Technical Principles
- Core Technology: Based on the Playwright (https://playwright.dev/) browser automation framework — logs in and saves the login state.
- No JS reverse engineering: Uses the browser context with preserved login state to obtain signature parameters via JS expressions.
- Advantages: No need to reverse-engineer complex encryption algorithms, significantly lowering the technical barrier.
✨ Feature Overview
| Platform | Keyword Search | Scrape by Post ID | Sub‑comments | Scrape Creator Profile | Login State Cache | IP Proxy Pool | Generate Comment Word Cloud |
|---|---|---|---|---|---|---|---|
| Xiaohongshu | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Douyin | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Kuaishou | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Bilibili | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | |
| Tieba | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Zhihu | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
🚀 MediaCrawlerPro – Major Release!
Open source is not easy — please subscribe to support us.
Focus on learning mature project architecture design. It’s not just about scraping; the code design of the Pro version is worth deep study!
MediaCrawlerPro (https://github.com/MediaCrawlerPro)
🎯 Core Feature Upgrades
- ✅ Self-Media Content Extraction Agent (new feature)
- ✅ Resume Capability (key feature)
- ✅ Multi‑account + IP Proxy Pool (key feature)
- ✅ Removed Playwright dependency – simpler to use
- ✅ Full Linux environment support
🏗️ Architecture Design Optimizations
- ✅ Code refactored & optimised – more readable and maintainable (decoupled JS signature logic)
- ✅ Enterprise‑grade code quality – suitable for building large‑scale scraping projects
- ✅ Perfect architecture design – high extensibility, great for learning source code
🎁 Additional Features
- ✅ Self-Media Video Downloader (Desktop) – ideal for learning full‑stack development
- ✅ Multi‑platform Home Feed recommendation (HomeFeed)
- ✅ AI Agent Skill support – One‑click install with OpenClaw (https://openclaw.ai/) 🦞 / Claude Code / Cursor. Let the Agent automatically scrape data.
- Comment Analysis AI Agent – under development 🚀🚀
Click to view: MediaCrawlerPro Project Homepage (https://github.com/MediaCrawlerPro)
– More details
🚀 Quick Start
💡 If this project helps you, please give it a ⭐ Star!
📋 Prerequisites
🚀 Install uv (recommended)
Before proceeding, make sure you have uv installed:
- Installation: uv official installation guide
- Verify: Run
uv --versionin your terminal. A version number means success. - Why uv: The fastest Python package manager — speedy and accurate dependency resolution.
🟢 Node.js Installation
The project depends on Node.js. Download from the official site:
- Download: https://nodejs.org/en/download/
- Version requirement: >= 16.0.0
📦 Install Python Packages
# Enter the project directory
cd MediaCrawler
# Use uv sync to ensure consistent Python version and dependencies
uv sync
🌐 Browser Driver Installation (optional)
If you use the default CDP mode (connect to an existing Chrome browser), no need to install browser drivers. Only required when using standard Playwright mode.
# Only needed for standard Playwright mode
uv run playwright install
🌍 Chrome Browser Configuration (recommended)
The project uses CDP mode by default to connect to your existing Chrome browser, reusing login state, cookies, extensions, etc. This significantly reduces the risk of platform detection.
Before using:
- Install the latest Chrome (version >= 144) – Download
- Enable remote debugging: In Chrome’s address bar, enter
chrome://inspect/#remote-debugging, check “Allow remote debugging for this browser instance” - The page should show
Server running at: 127.0.0.1:9222– ready to go.
💡 Tip: After starting the crawler, Chrome will display a confirmation dialog — click “Accept”. The program waits up to 60 seconds for you to confirm.
If you don’t want to use CDP mode, set
ENABLE_CDP_MODE = Falseinconfig/base_config.pyto switch to standard Playwright mode.
🚀 Run the Crawler
# Check config/base_config.py for configuration options (comments are in Chinese)
# Read keywords from config, search posts, and scrape post info + comments
uv run main.py --platform xhs --lt qrcode --type search
# Read a list of post IDs from config and scrape specific posts + comments
uv run main.py --platform xhs --lt qrcode --type detail
# Scan QR code with the corresponding app to log in
# For other platforms, see:
uv run main.py --help
🖥️ WebUI Visual Interface
MediaCrawler provides a web‑based visual interface — no command line needed.
Start WebUI
# Start the API server (default port 8080)
uv run uvicorn api.main:app --port 8080 --reload
# Or launch it as a module
uv run python -m api.main
After startup, visit http://localhost:8080 to open the WebUI.
WebUI Features
- Visual configuration of crawling parameters (platform, login method, crawl type, etc.)
- Real‑time monitoring of crawl status and logs
- Data preview and export
Preview
🔗 Using Python’s native venv (not recommended)
Create and Activate a Python Virtual Environment
For Douyin and Zhihu, Node.js (version >= 16) must be installed.
# Enter the project root
cd MediaCrawler
# Create virtual environment
# My Python version is 3.11 — requirements.txt is based on that.
# If you use a different Python version, dependencies may be incompatible — solve manually.
python -m venv venv
# macOS & Linux: activate
source venv/bin/activate
# Windows: activate
venv\Scripts\activate
Install Dependencies
pip install -r requirements.txt
Install Playwright Browser Drivers
playwright install
Run the Crawler (native environment)
# By default, comment scraping is disabled. To enable, modify ENABLE_GET_COMMENTS in config/base_config.py.
# Other options are documented with Chinese comments in config/base_config.py.
# Read keywords from config, search posts, and scrape post info + comments
python main.py --platform xhs --lt qrcode --type search
# Read a list of post IDs from config and scrape specific posts + comments
python main.py --platform xhs --lt qrcode --type detail
# Scan QR code with the corresponding app to log in
# For other platforms, see:
python main.py --help
💾 Data Storage
MediaCrawler supports multiple storage formats: CSV, JSON, JSONL, Excel, SQLite, and MySQL.
📖 Detailed instructions: Data Storage Guide
🚀 MediaCrawlerPro – Major Release 🚀!
More features, better architecture. Open source is not easy — please subscribe to support! (https://github.com/MediaCrawlerPro)
💬 Community & Groups
- WeChat Group: Click to join
- Bilibili Account: Follow me – sharing AI & scraping tech knowledge
💰 Sponsors
TikHub.io provides 900+ highly stable data APIs covering 14+ domestic and international platforms (TK, DY, XHS, Y2B, Ins, X, etc.). Supports public data APIs for users, content, products, comments, etc., along with 40M+ pre‑cleaned structured datasets. Use invitation code cfzyejV9 when registering and top up to receive an extra $2 credit.
Atlas Cloud is a full‑modal AI inference platform that gives developers a unified AI API for video generation, image generation, and LLMs, eliminating the need to maintain multiple vendor integrations. It provides access to 300+ curated models. Atlas Cloud recently launched a coding plan discount, offering developers a more cost‑effective API budget.
🤝 Become a Sponsor
Sponsors can showcase their products here, gaining daily exposure.
Contact:
- WeChat:
relakkes - Email:
[email protected]
☕ Buy the Author a Coffee
If this project helps you, feel free to support me. Every bit of support fuels my continued development ❤️
WeChat Pay Alipay Buy Me a Coffee
📚 Other Resources
- FAQ: MediaCrawler Full Documentation
- Crawler Tutorial: CrawlerTutorial Free Course
- News Crawler Open Source: NewsCrawlerCollection
⭐ Star History
If this project helps you, please ⭐ Star it — help more people find MediaCrawler!
📚 References
- Xiaohongshu signature repo: Cloxl/xhshow
- Xiaohongshu client: ReaJason/xhs
- SMS forwarding: SmsForwarder
- Intranet penetration tool: ngrok docs
Disclaimer
1. Project Purpose and Nature
This project (hereinafter referred to as “the Project”) is created as a technical research and learning tool, aiming to explore and study web data scraping techniques. The Project focuses on scraping technology for self-media platforms, intended for technical exchange among learners and researchers.
2. Legal Compliance Statement
The developer of this Project (hereinafter referred to as “the Developer”) reminds users to strictly comply with the relevant laws and regulations of the People’s Republic of China, including but not limited to the Cybersecurity Law, the Anti‑Espionage Law, and all other applicable national laws and policies, when downloading, installing, and using the Project. Users assume all legal responsibilities that may arise from using the Project.
3. Restrictions on Use
The Project is strictly prohibited from being used for any illegal purpose or for any non‑learning, non‑research commercial activities. The Project must not be used for any form of illegal intrusion into others’ computer systems, nor for any infringement of others’ intellectual property rights or other legitimate rights. Users must ensure that the purpose of using the Project is solely for personal learning and technical research, and not for any illegal activities.
4. Disclaimer
The Developer has made every effort to ensure the legitimacy and safety of the Project but assumes no liability for any direct or indirect losses caused by users’ use of the Project, including but not limited to data loss, device damage, or legal proceedings.
5. Intellectual Property Statement
The intellectual property rights of the Project belong to the Developer. The Project is protected by copyright law, international copyright treaties, and other intellectual property laws and treaties. Users may download and use the Project provided they comply with this disclaimer and relevant laws and regulations.
6. Final Interpretation
The Developer reserves the right of final interpretation of this disclaimer. The Developer reserves the right to change or update this disclaimer at any time without prior notice.
Similar Articles
@NFTCPS: Finally found out where those repost accounts on X get their content! It's this tool MediaCrawler, a single tool that covers Xiaohongshu, Douyin, Kuaishou, Bilibili, Weibo, Tieba, and Zhihu. It can scrape public content, comments, likes, and reposts. The best part is it doesn't need JS reverse engineering—it uses browser login state to get signatures directly, …
MediaCrawler is a multi-platform social media data scraping tool that supports public content crawling from Xiaohongshu, Douyin, Kuaishou, Bilibili, Weibo, Tieba, and Zhihu. It bypasses JS reverse engineering by leveraging browser login state, lowering the technical barrier.
@WY_mask: MediaCrawler: Open-source web scraping tool for Xiaohongshu, Douyin, Weibo, Bilibili, Kuaishou. Supports scraping videos, images, comments, likes, reposts, etc. https://github.com/NanmiCoder/MediaCrawler…
MediaCrawler is an open-source multi-platform self-media data collection tool that supports scraping public information from Xiaohongshu, Douyin, Weibo, Bilibili, Kuaishou and other platforms. No JS reverse engineering required, based on Playwright browser automation.
@axichuhai: Folks, this open-source project is like having a god's-eye view, boosting web scraping efficiency tens of times over. It has topped GitHub trending with 50k+ stars. No more writing code, maintaining selectors, or dealing with anti-scraping measures. Just drop in a URL, zero-code, naturally bypass blocks, no need to maintain selectors...
This open-source project can scrape web data with zero code, bypass anti-scraping mechanisms, boost efficiency tens of times, and has earned 50k+ stars.
NanmiCoder/MediaCrawler
MediaCrawler是一个开源的多平台自媒体数据采集工具,支持小红书、抖音、快手、B站、微博、贴吧、知乎等主流平台的公开信息抓取,基于Playwright浏览器自动化实现,无需JS逆向。
@CycleDecoded: Bro, are you still foolishly manually copying and pasting articles? How can you grab traffic across the internet with such low efficiency? Today I uncovered an open-source tool that can skyrocket your efficiency tenfold—a must-have for running a self-media matrix! The GitHub project quietly going viral: Wechatsync (WeChat Official Account Sync Assistant), designed to cure the pain of multi-platform distribution...
Wechatsync is a free and open-source browser extension and CLI tool that supports one-click syncing of WeChat Official Account articles to 29+ self-media platforms, greatly improving content distribution efficiency.