@quanruzhuoxiu: 用 Midscene 的 Computer Agent 时,桌面自动化在 Linux CI 里能跑无头。 大家默认桌面 UI 自动化必须挂一台真机或者 VM,所以 Mac/Windows 的桌面 E2E 通常只能跑本地,进不了 CI。结果…
摘要
Midscene的Computer Agent让桌面UI自动化可以在Linux CI中无头运行,通过xvfb-run自动化,无需真机或VM,支持Electron、Qt、GTK应用。
查看缓存全文
缓存时间: 2026/05/25 02:41
用 Midscene 的 Computer Agent 时,桌面自动化在 Linux CI 里能跑无头。
大家默认桌面 UI 自动化必须挂一台真机或者 VM,所以 Mac/Windows 的桌面 E2E 通常只能跑本地,进不了 CI。结果就是桌面应用的回归测试永远靠手测。
文档里的方案简单到一行 apt-get install + 一个 API flag——Midscene 会用 xvfb-run 自动包你的命令,不用你手起 Xvfb。
GitHub Actions / GitLab Runner 都能直接跑,Electron、Qt、原生 GTK 应用都能识别,因为我们走纯视觉,不依赖任何 accessibility API。
Midscene Studio 自己在 CI 里跑的全套 workflow 公开在仓库,拿去拄到你项目里改几行就能用: http://github.com/web-infra-dev/midscene/blob/main/.github/workflows/studio-headless-linux.yml…
附图是这份 workflow 最近一次跑绿的 run #25790418497。
web-infra-dev/midscene
Source: https://github.com/web-infra-dev/midscene
Midscene.js
AI-powered, vision-driven UI automation for every platform.
📣 Midscene Skills is here!
Use Midscene Skills to control any platform with OpenClaw
Showcases
- Web Automation - Automatically register the GitHub form in a web browser and pass all field validations
- iOS Automation - Meituan coffee order
- iOS Automation - Auto-like the first @midscene_ai tweet
- Android Automation - DCar: Xiaomi SU7 specs
- Android Automation - Booking a hotel for Christmas
- MCP Integration - Midscene MCP UI prepatch release
- robotic arm + vision + voice for in-vehicle testing
💡 Features
Write Automation with Natural Language
- Describe your goals and steps, and Midscene will plan and operate the user interface for you.
- Use Javascript SDK or YAML to write your automation script.
Web & Mobile App & Any Interface
- Web Automation: Either integrate with Puppeteer, Playwright or use Bridge Mode to control your desktop browser.
- Android Automation: Use Javascript SDK with adb to control your local Android device.
- iOS Automation: Use Javascript SDK with WebDriverAgent to control your local iOS devices and simulators.
- Any Interface Automation: Use Javascript SDK to control your own interface.
For Developers
- Three kinds of APIs:
- Interaction API: interact with the user interface.
- Data Extraction API: extract data from the user interface and dom.
- Utility API: utility functions like
aiAssert(),aiLocate(),aiWaitFor().
- MCP: Midscene provides MCP services that expose atomic Midscene Agent actions as MCP tools so upper-layer agents can inspect and operate UIs with natural language. Docs
- Caching for Efficiency: Replay your script with cache and get the result faster.
- Debugging Experience: Midscene.js offers a visualized replay back report file, a built-in playground, and a Chrome Extension to simplify the debugging process. These are the tools most developers truly need.
👉 Zero-code Quick Experience
- Chrome Extension: Start in-browser experience immediately through the Chrome Extension, without writing any code.
- Android Playground: There is also a built-in Android playground to control your local Android device.
- iOS Playground: There is also a built-in iOS playground to control your local iOS device.
✨ Driven by Visual Language Model
Midscene.js is all-in on the pure-vision route for UI actions: element localization and interactions are based on screenshots only. It supports visual-language models like Qwen3-VL, Doubao-1.6-vision, gemini-3-pro, and UI-TARS. For data extraction and page understanding, you can still opt in to include DOM when needed.
- Pure-vision localization for UI actions; the DOM extraction mode is removed.
- Works across web, mobile, desktop, and even
<canvas>surfaces. - Far fewer tokens by skipping DOM for actions, which cuts cost and speeds up runs.
- DOM can still be included for data extraction and page understanding when needed.
- Strong open-source options for self-hosting.
Read more about Model Strategy
📄 Resources
- Official Website: https://midscenejs.com
- Documentation: https://midscenejs.com
- Sample Projects: https://github.com/web-infra-dev/midscene-example
- API Reference: https://midscenejs.com/api
- GitHub: https://github.com/web-infra-dev/midscene
🤝 Community
🌟 Awesome Midscene
Community projects that extend Midscene.js capabilities:
- midscene-ios - iOS Mirror automation support for Midscene
- midscene-pc - PC operation device for Windows, macOS, and Linux
- midscene-pc-docker - Docker image with Midscene-PC server pre-installed
- Midscene-Python - Python SDK for Midscene automation
- midscene-java by @Master-Frank - Java SDK for Midscene automation
- midscene-java by @alstafeev - Java SDK for Midscene automation
📝 Credits
We would like to thank the following projects:
- Rsbuild and Rslib for the build tool.
- UI-TARS for the open-source agent model UI-TARS.
- Qwen-VL for the open-source VL model Qwen-VL.
- scrcpy and yume-chan allow us to control Android devices with browser.
- appium-adb for the javascript bridge of adb.
- appium-webdriveragent for the javascript operate XCTest。
- YADB for the yadb tool which improves the performance of text input.
- libnut-core for the cross-platform native keyboard and mouse control.
- Puppeteer for browser automation and control.
- Playwright for browser automation and control and testing.
📖 Citation
If you use Midscene.js in your research or project, please cite:
@software{Midscene.js,
author = {Xiao Zhou, Tao Yu, YiBing Lin},
title = {Midscene.js: Your AI Operator for Web, Android, iOS, Automation & Testing.},
year = {2025},
publisher = {GitHub},
url = {https://github.com/web-infra-dev/midscene}
}
✨ Star History
📝 License
Midscene.js is MIT licensed.
相似文章
@quanruzhuoxiu: 经常被问:Midscene 和 Browser-Use 有什么区别? 都是开源,都用视觉,都解决各自该解决的问题。下面是诚实对比,不是踩 Browser-Use。 Browser-Use 是个 web agent,定位是「打开浏览器,把这…
A comparison of Midscene and Browser-Use, two open-source tools with different focuses: Browser-Use is a web agent for one-time tasks, while Midscene is a vision SDK designed for reliable multi-platform repeated execution.
@canghe: https://x.com/canghe/status/2061431572306518501
WeSight 正式开源,它是一个桌面 AI Agent 控制台,能够统一管理和调度 Claude Code、Codex 等多种 Agent 引擎,提供可视化工作台、团队协作及飞书集成等功能。
@quanruzhuoxiu: 做 Midscene.js 这两年,我们做了一个迟来但关键的判断:UI 自动化迟早要从「理解 DOM」切到「看屏幕」,所以去年 12 月 1.0 版本我们直接砍掉了 DOM 兼容路径。 早期我们和大家一样,走的是 DOM + 视觉混合方案…
Midscene.js 团队决定从 DOM + 视觉混合方案彻底转向纯视觉 UI 自动化,认为未来 UI 自动化必然要基于屏幕截图而非 DOM。这一改变降低了 Token 消耗并简化了跨平台适配。
@VincentLogic: 发现个字节开源的桌面 AI 神器! UI-TARS Desktop,31k stars 不是吹的,这玩意儿真能看懂你的屏幕,然后帮你自动操作电脑。 你告诉它"帮我把 VS Code 的自动保存打开,延迟改成 500 毫秒",它就自己: -…
字节跳动开源的桌面 AI 自动化工具 UI-TARS Desktop 支持本地运行与屏幕视觉理解,可通过自然语言指令自主操控电脑完成日常任务。
@GoSailGlobal: 字节悄悄把 GUI Agent 这条路线开源了,而且做得比想象中扎实 UI-TARS-desktop(GitHub 29.4k )一个仓库里塞了两个东西: · Agent TARS:通用多模态 Agent 框架,CLI 一键启动,能在终端…
字节跳动开源了 UI-TARS-desktop 项目,包含通用多模态 Agent 框架 Agent TARS 和本地 GUI Agent UI-TARS Desktop,支持在终端/浏览器执行真实任务,基于 UI-TARS 视觉模型和 Seed-1.5-VL,采用 Apache 2.0 许可。