@quanruzhuoxiu: 用 Midscene 的 Computer Agent 时，桌面自动化在 Linux CI 里能跑无头。大家默认桌面 UI 自动化必须挂一台真机或者 VM，所以 Mac/Windows 的桌面 E2E 通常只能跑本地，进不了 CI。结果…

X AI KOLs Timeline 2026/05/24 14:00 工具

midscene ui-automation headless-testing linux-ci desktop-automation xvfb e2e-testing

摘要

Midscene的Computer Agent让桌面UI自动化可以在Linux CI中无头运行，通过xvfb-run自动化，无需真机或VM，支持Electron、Qt、GTK应用。

用 Midscene 的 Computer Agent 时，桌面自动化在 Linux CI 里能跑无头。大家默认桌面 UI 自动化必须挂一台真机或者 VM，所以 Mac/Windows 的桌面 E2E 通常只能跑本地，进不了 CI。结果就是桌面应用的回归测试永远靠手测。文档里的方案简单到一行 apt-get install + 一个 API flag——Midscene 会用 xvfb-run 自动包你的命令，不用你手起 Xvfb。 GitHub Actions / GitLab Runner 都能直接跑，Electron、Qt、原生 GTK 应用都能识别，因为我们走纯视觉，不依赖任何 accessibility API。 Midscene Studio 自己在 CI 里跑的全套 workflow 公开在仓库，拿去拄到你项目里改几行就能用： http://github.com/web-infra-dev/midscene/blob/main/.github/workflows/studio-headless-linux.yml… 附图是这份 workflow 最近一次跑绿的 run #25790418497。

查看原文

查看缓存全文

缓存时间: 2026/05/25 02:41

用 Midscene 的 Computer Agent 时，桌面自动化在 Linux CI 里能跑无头。

大家默认桌面 UI 自动化必须挂一台真机或者 VM，所以 Mac/Windows 的桌面 E2E 通常只能跑本地，进不了 CI。结果就是桌面应用的回归测试永远靠手测。

文档里的方案简单到一行 apt-get install + 一个 API flag——Midscene 会用 xvfb-run 自动包你的命令，不用你手起 Xvfb。

GitHub Actions / GitLab Runner 都能直接跑，Electron、Qt、原生 GTK 应用都能识别，因为我们走纯视觉，不依赖任何 accessibility API。

Midscene Studio 自己在 CI 里跑的全套 workflow 公开在仓库，拿去拄到你项目里改几行就能用： http://github.com/web-infra-dev/midscene/blob/main/.github/workflows/studio-headless-linux.yml…

附图是这份 workflow 最近一次跑绿的 run #25790418497。

web-infra-dev/midscene

Source: https://github.com/web-infra-dev/midscene

Midscene.js

English | 简体中文

Official Website: https://midscenejs.com/

AI-powered, vision-driven UI automation for every platform.

📣 Midscene Skills is here!

Use Midscene Skills to control any platform with OpenClaw

Showcases

💡 Features

Write Automation with Natural Language

Describe your goals and steps, and Midscene will plan and operate the user interface for you.
Use Javascript SDK or YAML to write your automation script.

Web & Mobile App & Any Interface

Web Automation: Either integrate with Puppeteer, Playwright or use Bridge Mode to control your desktop browser.
Android Automation: Use Javascript SDK with adb to control your local Android device.
iOS Automation: Use Javascript SDK with WebDriverAgent to control your local iOS devices and simulators.
Any Interface Automation: Use Javascript SDK to control your own interface.

For Developers

Three kinds of APIs:
- Interaction API: interact with the user interface.
- Data Extraction API: extract data from the user interface and dom.
- Utility API: utility functions like aiAssert(), aiLocate(), aiWaitFor().
MCP: Midscene provides MCP services that expose atomic Midscene Agent actions as MCP tools so upper-layer agents can inspect and operate UIs with natural language. Docs
Caching for Efficiency: Replay your script with cache and get the result faster.
Debugging Experience: Midscene.js offers a visualized replay back report file, a built-in playground, and a Chrome Extension to simplify the debugging process. These are the tools most developers truly need.

👉 Zero-code Quick Experience

Chrome Extension: Start in-browser experience immediately through the Chrome Extension, without writing any code.
Android Playground: There is also a built-in Android playground to control your local Android device.
iOS Playground: There is also a built-in iOS playground to control your local iOS device.

✨ Driven by Visual Language Model

Midscene.js is all-in on the pure-vision route for UI actions: element localization and interactions are based on screenshots only. It supports visual-language models like Qwen3-VL, Doubao-1.6-vision, gemini-3-pro, and UI-TARS. For data extraction and page understanding, you can still opt in to include DOM when needed.

Pure-vision localization for UI actions; the DOM extraction mode is removed.
Works across web, mobile, desktop, and even <canvas> surfaces.
Far fewer tokens by skipping DOM for actions, which cuts cost and speeds up runs.
DOM can still be included for data extraction and page understanding when needed.
Strong open-source options for self-hosting.

📄 Resources

Official Website: https://midscenejs.com
Documentation: https://midscenejs.com
Sample Projects: https://github.com/web-infra-dev/midscene-example
API Reference: https://midscenejs.com/api
GitHub: https://github.com/web-infra-dev/midscene

🤝 Community

🌟 Awesome Midscene

Community projects that extend Midscene.js capabilities:

midscene-ios - iOS Mirror automation support for Midscene
midscene-pc - PC operation device for Windows, macOS, and Linux
midscene-pc-docker - Docker image with Midscene-PC server pre-installed
Midscene-Python - Python SDK for Midscene automation
midscene-java by @Master-Frank - Java SDK for Midscene automation
midscene-java by @alstafeev - Java SDK for Midscene automation

📝 Credits

We would like to thank the following projects:

Rsbuild and Rslib for the build tool.
UI-TARS for the open-source agent model UI-TARS.
Qwen-VL for the open-source VL model Qwen-VL.
scrcpy and yume-chan allow us to control Android devices with browser.
appium-adb for the javascript bridge of adb.
appium-webdriveragent for the javascript operate XCTest。
YADB for the yadb tool which improves the performance of text input.
libnut-core for the cross-platform native keyboard and mouse control.
Puppeteer for browser automation and control.
Playwright for browser automation and control and testing.

📖 Citation

If you use Midscene.js in your research or project, please cite:

@software{Midscene.js,
  author = {Xiao Zhou, Tao Yu, YiBing Lin},
  title = {Midscene.js: Your AI Operator for Web, Android, iOS, Automation & Testing.},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/web-infra-dev/midscene}
}

✨ Star History

📝 License

Midscene.js is MIT licensed.

If this project helps you or inspires you, please give us a star

相似文章

@quanruzhuoxiu: 经常被问：Midscene 和 Browser-Use 有什么区别？都是开源，都用视觉，都解决各自该解决的问题。下面是诚实对比，不是踩 Browser-Use。 Browser-Use 是个 web agent，定位是「打开浏览器，把这…

X AI KOLs Timeline

A comparison of Midscene and Browser-Use, two open-source tools with different focuses: Browser-Use is a web agent for one-time tasks, while Midscene is a vision SDK designed for reliable multi-platform repeated execution.

@canghe: https://x.com/canghe/status/2061431572306518501

X AI KOLs Timeline

WeSight 正式开源，它是一个桌面 AI Agent 控制台，能够统一管理和调度 Claude Code、Codex 等多种 Agent 引擎，提供可视化工作台、团队协作及飞书集成等功能。

@quanruzhuoxiu: 做 Midscene.js 这两年，我们做了一个迟来但关键的判断：UI 自动化迟早要从「理解 DOM」切到「看屏幕」，所以去年 12 月 1.0 版本我们直接砍掉了 DOM 兼容路径。早期我们和大家一样，走的是 DOM + 视觉混合方案…

X AI KOLs Timeline

Midscene.js 团队决定从 DOM + 视觉混合方案彻底转向纯视觉 UI 自动化，认为未来 UI 自动化必然要基于屏幕截图而非 DOM。这一改变降低了 Token 消耗并简化了跨平台适配。

@VincentLogic: 发现个字节开源的桌面 AI 神器！ UI-TARS Desktop，31k stars 不是吹的，这玩意儿真能看懂你的屏幕，然后帮你自动操作电脑。你告诉它"帮我把 VS Code 的自动保存打开，延迟改成 500 毫秒"，它就自己： -…

X AI KOLs Timeline

字节跳动开源的桌面 AI 自动化工具 UI-TARS Desktop 支持本地运行与屏幕视觉理解，可通过自然语言指令自主操控电脑完成日常任务。

@GoSailGlobal: 字节悄悄把 GUI Agent 这条路线开源了，而且做得比想象中扎实 UI-TARS-desktop（GitHub 29.4k ）一个仓库里塞了两个东西： · Agent TARS：通用多模态 Agent 框架，CLI 一键启动，能在终端…

X AI KOLs Timeline

字节跳动开源了 UI-TARS-desktop 项目，包含通用多模态 Agent 框架 Agent TARS 和本地 GUI Agent UI-TARS Desktop，支持在终端/浏览器执行真实任务，基于 UI-TARS 视觉模型和 Seed-1.5-VL，采用 Apache 2.0 许可。

web-infra-dev/midscene

Midscene.js

📣 Midscene Skills is here!

Showcases

💡 Features

Write Automation with Natural Language

Web & Mobile App & Any Interface

For Developers

👉 Zero-code Quick Experience

✨ Driven by Visual Language Model

📄 Resources

🤝 Community

🌟 Awesome Midscene

📝 Credits

📖 Citation

✨ Star History

📝 License

相似文章

@quanruzhuoxiu: 经常被问：Midscene 和 Browser-Use 有什么区别？ 都是开源，都用视觉，都解决各自该解决的问题。下面是诚实对比，不是踩 Browser-Use。 Browser-Use 是个 web agent，定位是「打开浏览器，把这…

@canghe: https://x.com/canghe/status/2061431572306518501

@quanruzhuoxiu: 做 Midscene.js 这两年，我们做了一个迟来但关键的判断：UI 自动化迟早要从「理解 DOM」切到「看屏幕」，所以去年 12 月 1.0 版本我们直接砍掉了 DOM 兼容路径。 早期我们和大家一样，走的是 DOM + 视觉混合方案…

@VincentLogic: 发现个字节开源的桌面 AI 神器！ UI-TARS Desktop，31k stars 不是吹的，这玩意儿真能看懂你的屏幕，然后帮你自动操作电脑。 你告诉它"帮我把 VS Code 的自动保存打开，延迟改成 500 毫秒"，它就自己： -…

@GoSailGlobal: 字节悄悄把 GUI Agent 这条路线开源了，而且做得比想象中扎实 UI-TARS-desktop（GitHub 29.4k ）一个仓库里塞了两个东西： · Agent TARS：通用多模态 Agent 框架，CLI 一键启动，能在终端…

提交意见反馈

@quanruzhuoxiu: 经常被问：Midscene 和 Browser-Use 有什么区别？都是开源，都用视觉，都解决各自该解决的问题。下面是诚实对比，不是踩 Browser-Use。 Browser-Use 是个 web agent，定位是「打开浏览器，把这…

@quanruzhuoxiu: 做 Midscene.js 这两年，我们做了一个迟来但关键的判断：UI 自动化迟早要从「理解 DOM」切到「看屏幕」，所以去年 12 月 1.0 版本我们直接砍掉了 DOM 兼容路径。早期我们和大家一样，走的是 DOM + 视觉混合方案…

@VincentLogic: 发现个字节开源的桌面 AI 神器！ UI-TARS Desktop，31k stars 不是吹的，这玩意儿真能看懂你的屏幕，然后帮你自动操作电脑。你告诉它"帮我把 VS Code 的自动保存打开，延迟改成 500 毫秒"，它就自己： -…