@quanruzhuoxiu: Over the two years of developing Midscene.js, we made a belated but critical decision: UI automation will sooner or later shift from 'understanding the DOM' to 'looking at the screen'. So in the December 1.0 release, we directly cut the DOM compatibility path. In the early days, like everyone else, we followed a DOM + visual hybrid approach...

X AI KOLs Timeline Tools

Summary

The Midscene.js team decided to completely shift from a DOM + visual hybrid approach to pure visual UI automation, believing that future UI automation must be based on screenshots rather than the DOM. This change reduced token consumption and simplified cross-platform adaptation.

Over the two years of developing Midscene.js, we made a belated but critical decision: UI automation will sooner or later shift from 'understanding the DOM' to 'looking at the screen'. So in the December 1.0 release, we directly cut the DOM compatibility path. Early on, like everyone else, we followed a DOM + visual hybrid approach—using the DOM wherever possible to save tokens and ensure stable positioning. But the deeper we went, the more we realized: the same product now needs to run simultaneously on Web, iOS, Android, HarmonyOS, Mac, Windows, Linux desktop, plus rendering layers like Canvas, Electron, and Qt that have no DOM at all. If we had to maintain a set of DOM adaptations for element positioning on each platform, things would never converge. So in 1.0, we completely switched UI operations to pure visual: only look at screenshots, don't read the DOM. An unexpected benefit was that without including the DOM in the prompt, token consumption was actually lower than the previous hybrid approach. The repository link is in the comments.
Original Article

Similar Articles

@quanruzhuoxiu: Often asked: What's the difference between Midscene and Browser-Use? Both are open-source, both use vision, both solve their respective problems. Here's an honest comparison, not to bash Browser-Use. Browser-Use is a web agent, positioned as "open the browser, get this done…"

X AI KOLs Timeline

A comparison of Midscene and Browser-Use, two open-source tools with different focuses: Browser-Use is a web agent for one-time tasks, while Midscene is a vision SDK designed for reliable multi-platform repeated execution.