@quanruzhuoxiu: 经常被问:Midscene 和 Browser-Use 有什么区别? 都是开源,都用视觉,都解决各自该解决的问题。下面是诚实对比,不是踩 Browser-Use。 Browser-Use 是个 web agent,定位是「打开浏览器,把这…

X AI KOLs Timeline 工具

摘要

A comparison of Midscene and Browser-Use, two open-source tools with different focuses: Browser-Use is a web agent for one-time tasks, while Midscene is a vision SDK designed for reliable multi-platform repeated execution.

经常被问:Midscene 和 Browser-Use 有什么区别? 都是开源,都用视觉,都解决各自该解决的问题。下面是诚实对比,不是踩 Browser-Use。 Browser-Use 是个 web agent,定位是「打开浏览器,把这件事做了」——单次自主探索。 Midscene 是个 vision SDK,定位是「重复跑的脚本,要在 Web、iOS、Android、HarmonyOS、桌面应用上稳定工作」——重复执行 + 多平台。 不同问题,不同工具。 Browser-Use 的强项:10 行 prototype 一个 web agent、demo、研究类一次性任务。 它会撞墙的地方:CI 里跑 1000 次的脚本会脆、不支持移动端、不支持原生桌面应用。 Midscene 的强项:长生命周期的 E2E 测试,UI 改了能存活、Web + 原生一套脚本、缓存复放让重跑成本几乎 = 0。 我们较弱的地方:自由探索式 web 任务不如 BU 那么 agent 化、BU 的 planning loop 更激进。 不同工具解不同问题。 - 你要「跑一次的 agent」 → Browser-Use 更合适 - 你要「同样的事情可靠跑 1000 次」 → 试试 Midscene → http://github.com/web-infra-dev/midscene…
查看原文
查看缓存全文

缓存时间: 2026/06/02 05:56

经常被问:Midscene 和 Browser-Use 有什么区别?

都是开源,都用视觉,都解决各自该解决的问题。下面是诚实对比,不是踩 Browser-Use。

Browser-Use 是个 web agent,定位是「打开浏览器,把这件事做了」——单次自主探索。 Midscene 是个 vision SDK,定位是「重复跑的脚本,要在 Web、iOS、Android、HarmonyOS、桌面应用上稳定工作」——重复执行 + 多平台。

不同问题,不同工具。

Browser-Use 的强项:10 行 prototype 一个 web agent、demo、研究类一次性任务。 它会撞墙的地方:CI 里跑 1000 次的脚本会脆、不支持移动端、不支持原生桌面应用。

Midscene 的强项:长生命周期的 E2E 测试,UI 改了能存活、Web + 原生一套脚本、缓存复放让重跑成本几乎 = 0。 我们较弱的地方:自由探索式 web 任务不如 BU 那么 agent 化、BU 的 planning loop 更激进。

不同工具解不同问题。

  • 你要「跑一次的 agent」 → Browser-Use 更合适
  • 你要「同样的事情可靠跑 1000 次」 → 试试 Midscene

→ http://github.com/web-infra-dev/midscene…


web-infra-dev/midscene

Source: https://github.com/web-infra-dev/midscene

Midscene.js

Midscene.js

English | 简体中文

Official Website: https://midscenejs.com/

web-infra-dev%2Fmidscene | Trendshift

AI-powered, vision-driven UI automation for every platform.

npm version hugging face model downloads License discord twitter Ask DeepWiki.com

📣 Midscene Skills is here!

Use Midscene Skills to control any platform with OpenClaw

Showcases

💡 Features

Write Automation with Natural Language

  • Describe your goals and steps, and Midscene will plan and operate the user interface for you.
  • Use Javascript SDK or YAML to write your automation script.

Web & Mobile App & Any Interface

  • Web Automation: Either integrate with Puppeteer, Playwright or use Bridge Mode to control your desktop browser.
  • Android Automation: Use Javascript SDK with adb to control your local Android device.
  • iOS Automation: Use Javascript SDK with WebDriverAgent to control your local iOS devices and simulators.
  • Any Interface Automation: Use Javascript SDK to control your own interface.

For Developers

  • Three kinds of APIs:
  • MCP: Midscene provides MCP services that expose atomic Midscene Agent actions as MCP tools so upper-layer agents can inspect and operate UIs with natural language. Docs
  • Caching for Efficiency: Replay your script with cache and get the result faster.
  • Debugging Experience: Midscene.js offers a visualized replay back report file, a built-in playground, and a Chrome Extension to simplify the debugging process. These are the tools most developers truly need.

👉 Zero-code Quick Experience

✨ Driven by Visual Language Model

Midscene.js is all-in on the pure-vision route for UI actions: element localization and interactions are based on screenshots only. It supports visual-language models like Qwen3-VL, Doubao-1.6-vision, gemini-3-pro, and UI-TARS. For data extraction and page understanding, you can still opt in to include DOM when needed.

  • Pure-vision localization for UI actions; the DOM extraction mode is removed.
  • Works across web, mobile, desktop, and even <canvas> surfaces.
  • Far fewer tokens by skipping DOM for actions, which cuts cost and speeds up runs.
  • DOM can still be included for data extraction and page understanding when needed.
  • Strong open-source options for self-hosting.

Read more about Model Strategy

📄 Resources

🤝 Community

🌟 Awesome Midscene

Community projects that extend Midscene.js capabilities:

📝 Credits

We would like to thank the following projects:

  • Rsbuild and Rslib for the build tool.
  • UI-TARS for the open-source agent model UI-TARS.
  • Qwen-VL for the open-source VL model Qwen-VL.
  • scrcpy and yume-chan allow us to control Android devices with browser.
  • appium-adb for the javascript bridge of adb.
  • appium-webdriveragent for the javascript operate XCTest。
  • YADB for the yadb tool which improves the performance of text input.
  • libnut-core for the cross-platform native keyboard and mouse control.
  • Puppeteer for browser automation and control.
  • Playwright for browser automation and control and testing.

📖 Citation

If you use Midscene.js in your research or project, please cite:

@software{Midscene.js,
  author = {Xiao Zhou, Tao Yu, YiBing Lin},
  title = {Midscene.js: Your AI Operator for Web, Android, iOS, Automation & Testing.},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/web-infra-dev/midscene}
}

✨ Star History

Star History Chart

📝 License

Midscene.js is MIT licensed.


If this project helps you or inspires you, please give us a star

相似文章

@quanruzhuoxiu: 做 Midscene.js 这两年,我们做了一个迟来但关键的判断:UI 自动化迟早要从「理解 DOM」切到「看屏幕」,所以去年 12 月 1.0 版本我们直接砍掉了 DOM 兼容路径。 早期我们和大家一样,走的是 DOM + 视觉混合方案…

X AI KOLs Timeline

Midscene.js 团队决定从 DOM + 视觉混合方案彻底转向纯视觉 UI 自动化,认为未来 UI 自动化必然要基于屏幕截图而非 DOM。这一改变降低了 Token 消耗并简化了跨平台适配。