@quanruzhuoxiu: 经常被问:Midscene 和 Browser-Use 有什么区别? 都是开源,都用视觉,都解决各自该解决的问题。下面是诚实对比,不是踩 Browser-Use。 Browser-Use 是个 web agent,定位是「打开浏览器,把这…
摘要
A comparison of Midscene and Browser-Use, two open-source tools with different focuses: Browser-Use is a web agent for one-time tasks, while Midscene is a vision SDK designed for reliable multi-platform repeated execution.
查看缓存全文
缓存时间: 2026/06/02 05:56
经常被问:Midscene 和 Browser-Use 有什么区别?
都是开源,都用视觉,都解决各自该解决的问题。下面是诚实对比,不是踩 Browser-Use。
Browser-Use 是个 web agent,定位是「打开浏览器,把这件事做了」——单次自主探索。 Midscene 是个 vision SDK,定位是「重复跑的脚本,要在 Web、iOS、Android、HarmonyOS、桌面应用上稳定工作」——重复执行 + 多平台。
不同问题,不同工具。
Browser-Use 的强项:10 行 prototype 一个 web agent、demo、研究类一次性任务。 它会撞墙的地方:CI 里跑 1000 次的脚本会脆、不支持移动端、不支持原生桌面应用。
Midscene 的强项:长生命周期的 E2E 测试,UI 改了能存活、Web + 原生一套脚本、缓存复放让重跑成本几乎 = 0。 我们较弱的地方:自由探索式 web 任务不如 BU 那么 agent 化、BU 的 planning loop 更激进。
不同工具解不同问题。
- 你要「跑一次的 agent」 → Browser-Use 更合适
- 你要「同样的事情可靠跑 1000 次」 → 试试 Midscene
→ http://github.com/web-infra-dev/midscene…
web-infra-dev/midscene
Source: https://github.com/web-infra-dev/midscene
Midscene.js
AI-powered, vision-driven UI automation for every platform.
📣 Midscene Skills is here!
Use Midscene Skills to control any platform with OpenClaw
Showcases
- Web Automation - Automatically register the GitHub form in a web browser and pass all field validations
- iOS Automation - Meituan coffee order
- iOS Automation - Auto-like the first @midscene_ai tweet
- Android Automation - DCar: Xiaomi SU7 specs
- Android Automation - Booking a hotel for Christmas
- MCP Integration - Midscene MCP UI prepatch release
- robotic arm + vision + voice for in-vehicle testing
💡 Features
Write Automation with Natural Language
- Describe your goals and steps, and Midscene will plan and operate the user interface for you.
- Use Javascript SDK or YAML to write your automation script.
Web & Mobile App & Any Interface
- Web Automation: Either integrate with Puppeteer, Playwright or use Bridge Mode to control your desktop browser.
- Android Automation: Use Javascript SDK with adb to control your local Android device.
- iOS Automation: Use Javascript SDK with WebDriverAgent to control your local iOS devices and simulators.
- Any Interface Automation: Use Javascript SDK to control your own interface.
For Developers
- Three kinds of APIs:
- Interaction API: interact with the user interface.
- Data Extraction API: extract data from the user interface and dom.
- Utility API: utility functions like
aiAssert(),aiLocate(),aiWaitFor().
- MCP: Midscene provides MCP services that expose atomic Midscene Agent actions as MCP tools so upper-layer agents can inspect and operate UIs with natural language. Docs
- Caching for Efficiency: Replay your script with cache and get the result faster.
- Debugging Experience: Midscene.js offers a visualized replay back report file, a built-in playground, and a Chrome Extension to simplify the debugging process. These are the tools most developers truly need.
👉 Zero-code Quick Experience
- Chrome Extension: Start in-browser experience immediately through the Chrome Extension, without writing any code.
- Android Playground: There is also a built-in Android playground to control your local Android device.
- iOS Playground: There is also a built-in iOS playground to control your local iOS device.
✨ Driven by Visual Language Model
Midscene.js is all-in on the pure-vision route for UI actions: element localization and interactions are based on screenshots only. It supports visual-language models like Qwen3-VL, Doubao-1.6-vision, gemini-3-pro, and UI-TARS. For data extraction and page understanding, you can still opt in to include DOM when needed.
- Pure-vision localization for UI actions; the DOM extraction mode is removed.
- Works across web, mobile, desktop, and even
<canvas>surfaces. - Far fewer tokens by skipping DOM for actions, which cuts cost and speeds up runs.
- DOM can still be included for data extraction and page understanding when needed.
- Strong open-source options for self-hosting.
Read more about Model Strategy
📄 Resources
- Official Website: https://midscenejs.com
- Documentation: https://midscenejs.com
- Sample Projects: https://github.com/web-infra-dev/midscene-example
- API Reference: https://midscenejs.com/api
- GitHub: https://github.com/web-infra-dev/midscene
🤝 Community
🌟 Awesome Midscene
Community projects that extend Midscene.js capabilities:
- midscene-ios - iOS Mirror automation support for Midscene
- midscene-pc - PC operation device for Windows, macOS, and Linux
- midscene-pc-docker - Docker image with Midscene-PC server pre-installed
- Midscene-Python - Python SDK for Midscene automation
- midscene-java by @Master-Frank - Java SDK for Midscene automation
- midscene-java by @alstafeev - Java SDK for Midscene automation
📝 Credits
We would like to thank the following projects:
- Rsbuild and Rslib for the build tool.
- UI-TARS for the open-source agent model UI-TARS.
- Qwen-VL for the open-source VL model Qwen-VL.
- scrcpy and yume-chan allow us to control Android devices with browser.
- appium-adb for the javascript bridge of adb.
- appium-webdriveragent for the javascript operate XCTest。
- YADB for the yadb tool which improves the performance of text input.
- libnut-core for the cross-platform native keyboard and mouse control.
- Puppeteer for browser automation and control.
- Playwright for browser automation and control and testing.
📖 Citation
If you use Midscene.js in your research or project, please cite:
@software{Midscene.js,
author = {Xiao Zhou, Tao Yu, YiBing Lin},
title = {Midscene.js: Your AI Operator for Web, Android, iOS, Automation & Testing.},
year = {2025},
publisher = {GitHub},
url = {https://github.com/web-infra-dev/midscene}
}
✨ Star History
📝 License
Midscene.js is MIT licensed.
相似文章
@quanruzhuoxiu: 做 Midscene.js 这两年,我们做了一个迟来但关键的判断:UI 自动化迟早要从「理解 DOM」切到「看屏幕」,所以去年 12 月 1.0 版本我们直接砍掉了 DOM 兼容路径。 早期我们和大家一样,走的是 DOM + 视觉混合方案…
Midscene.js 团队决定从 DOM + 视觉混合方案彻底转向纯视觉 UI 自动化,认为未来 UI 自动化必然要基于屏幕截图而非 DOM。这一改变降低了 Token 消耗并简化了跨平台适配。
@quanruzhuoxiu: 用 Midscene 的 Computer Agent 时,桌面自动化在 Linux CI 里能跑无头。 大家默认桌面 UI 自动化必须挂一台真机或者 VM,所以 Mac/Windows 的桌面 E2E 通常只能跑本地,进不了 CI。结果…
Midscene的Computer Agent让桌面UI自动化可以在Linux CI中无头运行,通过xvfb-run自动化,无需真机或VM,支持Electron、Qt、GTK应用。
@geekbb: Browser-use 团队用 Rust 编写的终端 TUI 工具,你用自然语言告诉它做什么,它就去控制浏览器完成。自研的 LLM 引擎加上 Chrome 的 CDP 协议,支持用你登录态的 Chrome、无头浏览器或者 Browser …
Browser-use 团队推出了一款用 Rust 编写的终端 TUI 工具,允许用户通过自然语言控制浏览器,支持使用登录态 Chrome、无头浏览器或 Browser Use 云端运行。
@quanruzhuoxiu: Midscene.js 里我个人最得意的一个设计,其实不是 AI 部分,是 HTML 回放报告。 每次脚本跑完,自动出一份单文件 HTML 报告,里面包含: - 每一步的截图 - 模型输入的 prompt 全文 - 模型输出的 JSON(…
Midscene.js 的 HTML 回放报告设计,通过截图、prompt 和模型输出三件套帮助开发者快速定位 AI 自动化失败原因。
@MingruiZhang: 对@browser_use 的新Terminal Agent有一个问题,我的上下文窗口用了122% https://github.com/browser-use/term…
Browser Use Terminal 是一个用于浏览器代理的 Rust TUI,允许用户从终端自动化浏览器任务,它配备了一个新的LLM harness,比Browser Harness便宜2倍且快2倍。