@quanruzhuoxiu: 用 Midscene 的 Computer Agent 时,桌面自动化在 Linux CI 里能跑无头。 大家默认桌面 UI 自动化必须挂一台真机或者 VM,所以 Mac/Windows 的桌面 E2E 通常只能跑本地,进不了 CI。结果…

X AI KOLs Timeline 工具

摘要

Midscene的Computer Agent让桌面UI自动化可以在Linux CI中无头运行,通过xvfb-run自动化,无需真机或VM,支持Electron、Qt、GTK应用。

用 Midscene 的 Computer Agent 时,桌面自动化在 Linux CI 里能跑无头。 大家默认桌面 UI 自动化必须挂一台真机或者 VM,所以 Mac/Windows 的桌面 E2E 通常只能跑本地,进不了 CI。结果就是桌面应用的回归测试永远靠手测。 文档里的方案简单到一行 apt-get install + 一个 API flag——Midscene 会用 xvfb-run 自动包你的命令,不用你手起 Xvfb。 GitHub Actions / GitLab Runner 都能直接跑,Electron、Qt、原生 GTK 应用都能识别,因为我们走纯视觉,不依赖任何 accessibility API。 Midscene Studio 自己在 CI 里跑的全套 workflow 公开在仓库,拿去拄到你项目里改几行就能用: http://github.com/web-infra-dev/midscene/blob/main/.github/workflows/studio-headless-linux.yml… 附图是这份 workflow 最近一次跑绿的 run #25790418497。
查看原文
查看缓存全文

缓存时间: 2026/05/25 02:41

用 Midscene 的 Computer Agent 时,桌面自动化在 Linux CI 里能跑无头。

大家默认桌面 UI 自动化必须挂一台真机或者 VM,所以 Mac/Windows 的桌面 E2E 通常只能跑本地,进不了 CI。结果就是桌面应用的回归测试永远靠手测。

文档里的方案简单到一行 apt-get install + 一个 API flag——Midscene 会用 xvfb-run 自动包你的命令,不用你手起 Xvfb。

GitHub Actions / GitLab Runner 都能直接跑,Electron、Qt、原生 GTK 应用都能识别,因为我们走纯视觉,不依赖任何 accessibility API。

Midscene Studio 自己在 CI 里跑的全套 workflow 公开在仓库,拿去拄到你项目里改几行就能用: http://github.com/web-infra-dev/midscene/blob/main/.github/workflows/studio-headless-linux.yml…

附图是这份 workflow 最近一次跑绿的 run #25790418497。


web-infra-dev/midscene

Source: https://github.com/web-infra-dev/midscene

Midscene.js

Midscene.js

English | 简体中文

Official Website: https://midscenejs.com/

web-infra-dev%2Fmidscene | Trendshift

AI-powered, vision-driven UI automation for every platform.

npm version hugging face model downloads License discord twitter Ask DeepWiki.com

📣 Midscene Skills is here!

Use Midscene Skills to control any platform with OpenClaw

Showcases

💡 Features

Write Automation with Natural Language

  • Describe your goals and steps, and Midscene will plan and operate the user interface for you.
  • Use Javascript SDK or YAML to write your automation script.

Web & Mobile App & Any Interface

  • Web Automation: Either integrate with Puppeteer, Playwright or use Bridge Mode to control your desktop browser.
  • Android Automation: Use Javascript SDK with adb to control your local Android device.
  • iOS Automation: Use Javascript SDK with WebDriverAgent to control your local iOS devices and simulators.
  • Any Interface Automation: Use Javascript SDK to control your own interface.

For Developers

  • Three kinds of APIs:
  • MCP: Midscene provides MCP services that expose atomic Midscene Agent actions as MCP tools so upper-layer agents can inspect and operate UIs with natural language. Docs
  • Caching for Efficiency: Replay your script with cache and get the result faster.
  • Debugging Experience: Midscene.js offers a visualized replay back report file, a built-in playground, and a Chrome Extension to simplify the debugging process. These are the tools most developers truly need.

👉 Zero-code Quick Experience

✨ Driven by Visual Language Model

Midscene.js is all-in on the pure-vision route for UI actions: element localization and interactions are based on screenshots only. It supports visual-language models like Qwen3-VL, Doubao-1.6-vision, gemini-3-pro, and UI-TARS. For data extraction and page understanding, you can still opt in to include DOM when needed.

  • Pure-vision localization for UI actions; the DOM extraction mode is removed.
  • Works across web, mobile, desktop, and even <canvas> surfaces.
  • Far fewer tokens by skipping DOM for actions, which cuts cost and speeds up runs.
  • DOM can still be included for data extraction and page understanding when needed.
  • Strong open-source options for self-hosting.

Read more about Model Strategy

📄 Resources

🤝 Community

🌟 Awesome Midscene

Community projects that extend Midscene.js capabilities:

📝 Credits

We would like to thank the following projects:

  • Rsbuild and Rslib for the build tool.
  • UI-TARS for the open-source agent model UI-TARS.
  • Qwen-VL for the open-source VL model Qwen-VL.
  • scrcpy and yume-chan allow us to control Android devices with browser.
  • appium-adb for the javascript bridge of adb.
  • appium-webdriveragent for the javascript operate XCTest。
  • YADB for the yadb tool which improves the performance of text input.
  • libnut-core for the cross-platform native keyboard and mouse control.
  • Puppeteer for browser automation and control.
  • Playwright for browser automation and control and testing.

📖 Citation

If you use Midscene.js in your research or project, please cite:

@software{Midscene.js,
  author = {Xiao Zhou, Tao Yu, YiBing Lin},
  title = {Midscene.js: Your AI Operator for Web, Android, iOS, Automation & Testing.},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/web-infra-dev/midscene}
}

✨ Star History

Star History Chart

📝 License

Midscene.js is MIT licensed.


If this project helps you or inspires you, please give us a star

相似文章

@quanruzhuoxiu: 经常被问:Midscene 和 Browser-Use 有什么区别? 都是开源,都用视觉,都解决各自该解决的问题。下面是诚实对比,不是踩 Browser-Use。 Browser-Use 是个 web agent,定位是「打开浏览器,把这…

X AI KOLs Timeline

A comparison of Midscene and Browser-Use, two open-source tools with different focuses: Browser-Use is a web agent for one-time tasks, while Midscene is a vision SDK designed for reliable multi-platform repeated execution.

@canghe: https://x.com/canghe/status/2061431572306518501

X AI KOLs Timeline

WeSight 正式开源,它是一个桌面 AI Agent 控制台,能够统一管理和调度 Claude Code、Codex 等多种 Agent 引擎,提供可视化工作台、团队协作及飞书集成等功能。

@quanruzhuoxiu: 做 Midscene.js 这两年,我们做了一个迟来但关键的判断:UI 自动化迟早要从「理解 DOM」切到「看屏幕」,所以去年 12 月 1.0 版本我们直接砍掉了 DOM 兼容路径。 早期我们和大家一样,走的是 DOM + 视觉混合方案…

X AI KOLs Timeline

Midscene.js 团队决定从 DOM + 视觉混合方案彻底转向纯视觉 UI 自动化,认为未来 UI 自动化必然要基于屏幕截图而非 DOM。这一改变降低了 Token 消耗并简化了跨平台适配。

@GoSailGlobal: 字节悄悄把 GUI Agent 这条路线开源了,而且做得比想象中扎实 UI-TARS-desktop(GitHub 29.4k )一个仓库里塞了两个东西: · Agent TARS:通用多模态 Agent 框架,CLI 一键启动,能在终端…

X AI KOLs Timeline

字节跳动开源了 UI-TARS-desktop 项目,包含通用多模态 Agent 框架 Agent TARS 和本地 GUI Agent UI-TARS Desktop,支持在终端/浏览器执行真实任务,基于 UI-TARS 视觉模型和 Seed-1.5-VL,采用 Apache 2.0 许可。