@quanruzhuoxiu: 经常被问：Midscene 和 Browser-Use 有什么区别？都是开源，都用视觉，都解决各自该解决的问题。下面是诚实对比，不是踩 Browser-Use。 Browser-Use 是个 web agent，定位是「打开浏览器，把这…

X AI KOLs Timeline 2026/06/01 14:01 工具

open-source web-automation vision-sdk e2e-testing ui-automation midscene browser-use

摘要

A comparison of Midscene and Browser-Use, two open-source tools with different focuses: Browser-Use is a web agent for one-time tasks, while Midscene is a vision SDK designed for reliable multi-platform repeated execution.

经常被问：Midscene 和 Browser-Use 有什么区别？都是开源，都用视觉，都解决各自该解决的问题。下面是诚实对比，不是踩 Browser-Use。 Browser-Use 是个 web agent，定位是「打开浏览器，把这件事做了」——单次自主探索。 Midscene 是个 vision SDK，定位是「重复跑的脚本，要在 Web、iOS、Android、HarmonyOS、桌面应用上稳定工作」——重复执行 + 多平台。不同问题，不同工具。 Browser-Use 的强项：10 行 prototype 一个 web agent、demo、研究类一次性任务。它会撞墙的地方：CI 里跑 1000 次的脚本会脆、不支持移动端、不支持原生桌面应用。 Midscene 的强项：长生命周期的 E2E 测试，UI 改了能存活、Web + 原生一套脚本、缓存复放让重跑成本几乎 = 0。我们较弱的地方：自由探索式 web 任务不如 BU 那么 agent 化、BU 的 planning loop 更激进。不同工具解不同问题。 - 你要「跑一次的 agent」 → Browser-Use 更合适 - 你要「同样的事情可靠跑 1000 次」 → 试试 Midscene → http://github.com/web-infra-dev/midscene…

查看原文

查看缓存全文

缓存时间: 2026/06/02 05:56

经常被问：Midscene 和 Browser-Use 有什么区别？

都是开源，都用视觉，都解决各自该解决的问题。下面是诚实对比，不是踩 Browser-Use。

Browser-Use 是个 web agent，定位是「打开浏览器，把这件事做了」——单次自主探索。 Midscene 是个 vision SDK，定位是「重复跑的脚本，要在 Web、iOS、Android、HarmonyOS、桌面应用上稳定工作」——重复执行 + 多平台。

不同问题，不同工具。

Browser-Use 的强项：10 行 prototype 一个 web agent、demo、研究类一次性任务。它会撞墙的地方：CI 里跑 1000 次的脚本会脆、不支持移动端、不支持原生桌面应用。

Midscene 的强项：长生命周期的 E2E 测试，UI 改了能存活、Web + 原生一套脚本、缓存复放让重跑成本几乎 = 0。我们较弱的地方：自由探索式 web 任务不如 BU 那么 agent 化、BU 的 planning loop 更激进。

不同工具解不同问题。

你要「跑一次的 agent」 → Browser-Use 更合适
你要「同样的事情可靠跑 1000 次」 → 试试 Midscene

→ http://github.com/web-infra-dev/midscene…

web-infra-dev/midscene

Source: https://github.com/web-infra-dev/midscene

Midscene.js

English | 简体中文

Official Website: https://midscenejs.com/

AI-powered, vision-driven UI automation for every platform.

📣 Midscene Skills is here!

Use Midscene Skills to control any platform with OpenClaw

Showcases

💡 Features

Write Automation with Natural Language

Describe your goals and steps, and Midscene will plan and operate the user interface for you.
Use Javascript SDK or YAML to write your automation script.

Web & Mobile App & Any Interface

Web Automation: Either integrate with Puppeteer, Playwright or use Bridge Mode to control your desktop browser.
Android Automation: Use Javascript SDK with adb to control your local Android device.
iOS Automation: Use Javascript SDK with WebDriverAgent to control your local iOS devices and simulators.
Any Interface Automation: Use Javascript SDK to control your own interface.

For Developers

Three kinds of APIs:
- Interaction API: interact with the user interface.
- Data Extraction API: extract data from the user interface and dom.
- Utility API: utility functions like aiAssert(), aiLocate(), aiWaitFor().
MCP: Midscene provides MCP services that expose atomic Midscene Agent actions as MCP tools so upper-layer agents can inspect and operate UIs with natural language. Docs
Caching for Efficiency: Replay your script with cache and get the result faster.
Debugging Experience: Midscene.js offers a visualized replay back report file, a built-in playground, and a Chrome Extension to simplify the debugging process. These are the tools most developers truly need.

👉 Zero-code Quick Experience

Chrome Extension: Start in-browser experience immediately through the Chrome Extension, without writing any code.
Android Playground: There is also a built-in Android playground to control your local Android device.
iOS Playground: There is also a built-in iOS playground to control your local iOS device.

✨ Driven by Visual Language Model

Midscene.js is all-in on the pure-vision route for UI actions: element localization and interactions are based on screenshots only. It supports visual-language models like Qwen3-VL, Doubao-1.6-vision, gemini-3-pro, and UI-TARS. For data extraction and page understanding, you can still opt in to include DOM when needed.

Pure-vision localization for UI actions; the DOM extraction mode is removed.
Works across web, mobile, desktop, and even <canvas> surfaces.
Far fewer tokens by skipping DOM for actions, which cuts cost and speeds up runs.
DOM can still be included for data extraction and page understanding when needed.
Strong open-source options for self-hosting.

📄 Resources

Official Website: https://midscenejs.com
Documentation: https://midscenejs.com
Sample Projects: https://github.com/web-infra-dev/midscene-example
API Reference: https://midscenejs.com/api
GitHub: https://github.com/web-infra-dev/midscene

🤝 Community

🌟 Awesome Midscene

Community projects that extend Midscene.js capabilities:

midscene-ios - iOS Mirror automation support for Midscene
midscene-pc - PC operation device for Windows, macOS, and Linux
midscene-pc-docker - Docker image with Midscene-PC server pre-installed
Midscene-Python - Python SDK for Midscene automation
midscene-java by @Master-Frank - Java SDK for Midscene automation
midscene-java by @alstafeev - Java SDK for Midscene automation

📝 Credits

We would like to thank the following projects:

Rsbuild and Rslib for the build tool.
UI-TARS for the open-source agent model UI-TARS.
Qwen-VL for the open-source VL model Qwen-VL.
scrcpy and yume-chan allow us to control Android devices with browser.
appium-adb for the javascript bridge of adb.
appium-webdriveragent for the javascript operate XCTest。
YADB for the yadb tool which improves the performance of text input.
libnut-core for the cross-platform native keyboard and mouse control.
Puppeteer for browser automation and control.
Playwright for browser automation and control and testing.

📖 Citation

If you use Midscene.js in your research or project, please cite:

@software{Midscene.js,
  author = {Xiao Zhou, Tao Yu, YiBing Lin},
  title = {Midscene.js: Your AI Operator for Web, Android, iOS, Automation & Testing.},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/web-infra-dev/midscene}
}

✨ Star History

📝 License

Midscene.js is MIT licensed.

If this project helps you or inspires you, please give us a star

相似文章

@quanruzhuoxiu: 做 Midscene.js 这两年，我们做了一个迟来但关键的判断：UI 自动化迟早要从「理解 DOM」切到「看屏幕」，所以去年 12 月 1.0 版本我们直接砍掉了 DOM 兼容路径。早期我们和大家一样，走的是 DOM + 视觉混合方案…

X AI KOLs Timeline

Midscene.js 团队决定从 DOM + 视觉混合方案彻底转向纯视觉 UI 自动化，认为未来 UI 自动化必然要基于屏幕截图而非 DOM。这一改变降低了 Token 消耗并简化了跨平台适配。

@quanruzhuoxiu: 用 Midscene 的 Computer Agent 时，桌面自动化在 Linux CI 里能跑无头。大家默认桌面 UI 自动化必须挂一台真机或者 VM，所以 Mac/Windows 的桌面 E2E 通常只能跑本地，进不了 CI。结果…

X AI KOLs Timeline

Midscene的Computer Agent让桌面UI自动化可以在Linux CI中无头运行，通过xvfb-run自动化，无需真机或VM，支持Electron、Qt、GTK应用。

@geekbb: Browser-use 团队用 Rust 编写的终端 TUI 工具，你用自然语言告诉它做什么，它就去控制浏览器完成。自研的 LLM 引擎加上 Chrome 的 CDP 协议，支持用你登录态的 Chrome、无头浏览器或者 Browser …

X AI KOLs Timeline

Browser-use 团队推出了一款用 Rust 编写的终端 TUI 工具，允许用户通过自然语言控制浏览器，支持使用登录态 Chrome、无头浏览器或 Browser Use 云端运行。

@quanruzhuoxiu: Midscene.js 里我个人最得意的一个设计，其实不是 AI 部分，是 HTML 回放报告。每次脚本跑完，自动出一份单文件 HTML 报告，里面包含： - 每一步的截图 - 模型输入的 prompt 全文 - 模型输出的 JSON（…

X AI KOLs Timeline

Midscene.js 的 HTML 回放报告设计，通过截图、prompt 和模型输出三件套帮助开发者快速定位 AI 自动化失败原因。

@MingruiZhang: 对@browser_use 的新Terminal Agent有一个问题，我的上下文窗口用了122% https://github.com/browser-use/term…

X AI KOLs Timeline

Browser Use Terminal 是一个用于浏览器代理的 Rust TUI，允许用户从终端自动化浏览器任务，它配备了一个新的LLM harness，比Browser Harness便宜2倍且快2倍。

web-infra-dev/midscene

Midscene.js

📣 Midscene Skills is here!

Showcases

💡 Features

Write Automation with Natural Language

Web & Mobile App & Any Interface

For Developers

👉 Zero-code Quick Experience

✨ Driven by Visual Language Model

📄 Resources

🤝 Community

🌟 Awesome Midscene

📝 Credits

📖 Citation

✨ Star History

📝 License

相似文章

@quanruzhuoxiu: 做 Midscene.js 这两年，我们做了一个迟来但关键的判断：UI 自动化迟早要从「理解 DOM」切到「看屏幕」，所以去年 12 月 1.0 版本我们直接砍掉了 DOM 兼容路径。 早期我们和大家一样，走的是 DOM + 视觉混合方案…

@quanruzhuoxiu: 用 Midscene 的 Computer Agent 时，桌面自动化在 Linux CI 里能跑无头。 大家默认桌面 UI 自动化必须挂一台真机或者 VM，所以 Mac/Windows 的桌面 E2E 通常只能跑本地，进不了 CI。结果…

@geekbb: Browser-use 团队用 Rust 编写的终端 TUI 工具，你用自然语言告诉它做什么，它就去控制浏览器完成。自研的 LLM 引擎加上 Chrome 的 CDP 协议，支持用你登录态的 Chrome、无头浏览器或者 Browser …

@quanruzhuoxiu: Midscene.js 里我个人最得意的一个设计，其实不是 AI 部分，是 HTML 回放报告。 每次脚本跑完，自动出一份单文件 HTML 报告，里面包含： - 每一步的截图 - 模型输入的 prompt 全文 - 模型输出的 JSON（…

@MingruiZhang: 对@browser_use 的新Terminal Agent有一个问题，我的上下文窗口用了122% https://github.com/browser-use/term…

提交意见反馈

@quanruzhuoxiu: 做 Midscene.js 这两年，我们做了一个迟来但关键的判断：UI 自动化迟早要从「理解 DOM」切到「看屏幕」，所以去年 12 月 1.0 版本我们直接砍掉了 DOM 兼容路径。早期我们和大家一样，走的是 DOM + 视觉混合方案…

@quanruzhuoxiu: 用 Midscene 的 Computer Agent 时，桌面自动化在 Linux CI 里能跑无头。大家默认桌面 UI 自动化必须挂一台真机或者 VM，所以 Mac/Windows 的桌面 E2E 通常只能跑本地，进不了 CI。结果…

@quanruzhuoxiu: Midscene.js 里我个人最得意的一个设计，其实不是 AI 部分，是 HTML 回放报告。每次脚本跑完，自动出一份单文件 HTML 报告，里面包含： - 每一步的截图 - 模型输入的 prompt 全文 - 模型输出的 JSON（…