@quanruzhuoxiu: Over the two years of developing Midscene.js, we made a belated but critical decision: UI automation will sooner or later shift from 'understanding the DOM' to 'looking at the screen'. So in the December 1.0 release, we directly cut the DOM compatibility path. In the early days, like everyone else, we followed a DOM + visual hybrid approach...

X AI KOLs Timeline 05/14/26, 02:00 PM Tools

ui-automation visual-testing midscene-js dom-alternative token-efficiency cross-platform

Summary

The Midscene.js team decided to completely shift from a DOM + visual hybrid approach to pure visual UI automation, believing that future UI automation must be based on screenshots rather than the DOM. This change reduced token consumption and simplified cross-platform adaptation.

Over the two years of developing Midscene.js, we made a belated but critical decision: UI automation will sooner or later shift from 'understanding the DOM' to 'looking at the screen'. So in the December 1.0 release, we directly cut the DOM compatibility path. Early on, like everyone else, we followed a DOM + visual hybrid approach—using the DOM wherever possible to save tokens and ensure stable positioning. But the deeper we went, the more we realized: the same product now needs to run simultaneously on Web, iOS, Android, HarmonyOS, Mac, Windows, Linux desktop, plus rendering layers like Canvas, Electron, and Qt that have no DOM at all. If we had to maintain a set of DOM adaptations for element positioning on each platform, things would never converge. So in 1.0, we completely switched UI operations to pure visual: only look at screenshots, don't read the DOM. An unexpected benefit was that without including the DOM in the prompt, token consumption was actually lower than the previous hybrid approach. The repository link is in the comments.

Original Article

Similar Articles

@quanruzhuoxiu: Often asked: What's the difference between Midscene and Browser-Use? Both are open-source, both use vision, both solve their respective problems. Here's an honest comparison, not to bash Browser-Use. Browser-Use is a web agent, positioned as "open the browser, get this done…"

X AI KOLs Timeline

A comparison of Midscene and Browser-Use, two open-source tools with different focuses: Browser-Use is a web agent for one-time tasks, while Midscene is a vision SDK designed for reliable multi-platform repeated execution.

@quanruzhuoxiu: My favorite design in Midscene.js is actually not the AI part, but the HTML replay report. Every time a script runs, it automatically generates a single-file HTML report containing: - Screenshots of each step - Full prompt input to the model - JSON output from the model (...

X AI KOLs Timeline

Midscene.js's HTML replay report design helps developers quickly locate the cause of AI automation failures through the triple combination of screenshots, prompt, and model output.

@quanruzhuoxiu: When using Midscene's Computer Agent, desktop automation runs headless in Linux CI. Everyone assumes desktop UI automation must use a real machine or VM, so Mac/Windows desktop E2E can only run locally and cannot enter CI. Result...

X AI KOLs Timeline

Midscene's Computer Agent enables desktop UI automation to run headless in Linux CI, automated via xvfb-run, without needing a real machine or VM, and supports Electron, Qt, and GTK applications.

@billtheinvestor: Imagine every pixel on your screen streaming live, straight from the model—no HTML, no layout engine, no code, just the exact image you want to see. @eddiejiao_obj, @drewocarr and I built a prototype to explore how this works in practice and to turn it into reality…

X AI KOLs Timeline

Flipbook is a prototype that streams every screen pixel directly from an AI model in real time, eliminating HTML, layout engines, and traditional code.

@leeoxiang: Regarding the product form of local-first and CLI-first, Hyperframes' current interaction already embodies this. Claude Code handles LUI interaction, while Hyperframes launches a UI for preview and necessary GUI interaction.

X AI KOLs Following

Hyperframes has already implemented the local-first and CLI-first product form, combining with Claude Code for LUI interaction, and launching a UI for preview and necessary GUI interaction.

Similar Articles

@quanruzhuoxiu: Often asked: What's the difference between Midscene and Browser-Use? Both are open-source, both use vision, both solve their respective problems. Here's an honest comparison, not to bash Browser-Use. Browser-Use is a web agent, positioned as "open the browser, get this done…"

@quanruzhuoxiu: My favorite design in Midscene.js is actually not the AI part, but the HTML replay report. Every time a script runs, it automatically generates a single-file HTML report containing: - Screenshots of each step - Full prompt input to the model - JSON output from the model (...

@quanruzhuoxiu: When using Midscene's Computer Agent, desktop automation runs headless in Linux CI. Everyone assumes desktop UI automation must use a real machine or VM, so Mac/Windows desktop E2E can only run locally and cannot enter CI. Result...

@billtheinvestor: Imagine every pixel on your screen streaming live, straight from the model—no HTML, no layout engine, no code, just the exact image you want to see. @eddiejiao_obj, @drewocarr and I built a prototype to explore how this works in practice and to turn it into reality…

@leeoxiang: Regarding the product form of local-first and CLI-first, Hyperframes' current interaction already embodies this. Claude Code handles LUI interaction, while Hyperframes launches a UI for preview and necessary GUI interaction.

Submit Feedback