@quanruzhuoxiu: Over the two years of developing Midscene.js, we made a belated but critical decision: UI automation will sooner or later shift from 'understanding the DOM' to 'looking at the screen'. So in the December 1.0 release, we directly cut the DOM compatibility path. In the early days, like everyone else, we followed a DOM + visual hybrid approach...
Summary
The Midscene.js team decided to completely shift from a DOM + visual hybrid approach to pure visual UI automation, believing that future UI automation must be based on screenshots rather than the DOM. This change reduced token consumption and simplified cross-platform adaptation.
Similar Articles
@quanruzhuoxiu: Often asked: What's the difference between Midscene and Browser-Use? Both are open-source, both use vision, both solve their respective problems. Here's an honest comparison, not to bash Browser-Use. Browser-Use is a web agent, positioned as "open the browser, get this done…"
A comparison of Midscene and Browser-Use, two open-source tools with different focuses: Browser-Use is a web agent for one-time tasks, while Midscene is a vision SDK designed for reliable multi-platform repeated execution.
@quanruzhuoxiu: My favorite design in Midscene.js is actually not the AI part, but the HTML replay report. Every time a script runs, it automatically generates a single-file HTML report containing: - Screenshots of each step - Full prompt input to the model - JSON output from the model (...
Midscene.js's HTML replay report design helps developers quickly locate the cause of AI automation failures through the triple combination of screenshots, prompt, and model output.
@quanruzhuoxiu: When using Midscene's Computer Agent, desktop automation runs headless in Linux CI. Everyone assumes desktop UI automation must use a real machine or VM, so Mac/Windows desktop E2E can only run locally and cannot enter CI. Result...
Midscene's Computer Agent enables desktop UI automation to run headless in Linux CI, automated via xvfb-run, without needing a real machine or VM, and supports Electron, Qt, and GTK applications.
@billtheinvestor: Imagine every pixel on your screen streaming live, straight from the model—no HTML, no layout engine, no code, just the exact image you want to see. @eddiejiao_obj, @drewocarr and I built a prototype to explore how this works in practice and to turn it into reality…
Flipbook is a prototype that streams every screen pixel directly from an AI model in real time, eliminating HTML, layout engines, and traditional code.
@leeoxiang: Regarding the product form of local-first and CLI-first, Hyperframes' current interaction already embodies this. Claude Code handles LUI interaction, while Hyperframes launches a UI for preview and necessary GUI interaction.
Hyperframes has already implemented the local-first and CLI-first product form, combining with Claude Code for LUI interaction, and launching a UI for preview and necessary GUI interaction.