WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

arXiv cs.CL Papers

Summary

This paper introduces WebRISE, a benchmark for evaluating MLLM-generated web artifacts using Interaction Contract Graphs (ICGs) to assess requirement-induced states and transitions across five input modalities. Experiments show even the strongest models achieve limited validity and coverage, with video input providing the strongest interaction signal.

arXiv:2606.03220v1 Announce Type: new Abstract: Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task requirements into Interaction Contract Graphs (ICGs) of observable states, user-intent transitions, and DOM/visual assertions for implementation-agnostic browser execution. WebRISE spans 442 tasks across five input modalities (Text, Markdown, Sketch, Image, Video), with 5,495 transitions and 5,271 requirement checks that separate user-stated functions from implicit product-level constraints. Across 14 MLLMs, even the strongest model reaches only 65.6% transition validity and 66.3% requirement coverage, and visual quality is no proxy for behavior (Qwen3.6-35B-A3B on Markdown: V=80.8 yet T=15.5). Video gives the strongest interaction signal (+10.6 pp implicit coverage over Text), while implicit constraints persist; defect injection shows ICG-based scoring detects state errors at 2-16x the rate of checkpoint-style evaluation.
Original Article
View Cached Full Text

Cached at: 06/03/26, 09:37 AM

# WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts
Source: [https://arxiv.org/html/2606.03220](https://arxiv.org/html/2606.03220)
Yuxin Meng1,2\{\}^\{\\textnormal\{1,2\}\}111Equal contribution\.Yuhan Suo1,2\{\}^\{\\textnormal\{1,2\}\}111Equal contribution\.Junjie Wang1\{\}^\{\\textnormal\{1\}\}111Equal contribution\.Yuhan Sun3\{\}^\{\\textnormal\{3\}\}111Equal contribution\. Yiyao Yu1\{\}^\{\\textnormal\{1\}\}Ruixu Zhang1\{\}^\{\\textnormal\{1\}\}Ruining Hu4\{\}^\{\\textnormal\{4\}\}Yubin Wang2\{\}^\{\\textnormal\{2\}\}Shouwei Ruan5\{\}^\{\\textnormal\{5\}\} Bin Wang2\{\}^\{\\textnormal\{2\}\}Yuxiang Zhang2\{\}^\{\\textnormal\{2\}\}222Corresponding authors\.Yujiu Yang1\{\}^\{\\textnormal\{1\}\}222Corresponding authors\.1\{\}^\{\\textnormal\{1\}\}Tsinghua University2\{\}^\{\\textnormal\{2\}\}Huawei Noah’s Ark Lab3\{\}^\{\\textnormal\{3\}\}East China Normal University 4\{\}^\{\\textnormal\{4\}\}Tongji University5\{\}^\{\\textnormal\{5\}\}Institute of Artificial Intelligence, Beihang University [https://iigroup\.github\.io/WebRISE](https://iigroup.github.io/WebRISE)

###### Abstract

Existing benchmarks for MLLM\-generated web artifacts assess interaction through local evidence and miss the requirement\-induced states and transitions that determine whether a page works\. We introduceWebRISE, which compiles task requirements into Interaction Contract Graphs \(ICGs\) of observable states, user\-intent transitions, and DOM/visual assertions for implementation\-agnostic browser execution\.WebRISEspans442442tasks across five input modalities \(Text, Markdown, Sketch, Image, Video\), with5,4955\{,\}495transitions and5,2715\{,\}271requirement checks that separate user\-stated functions from implicit product\-level constraints\. Across1414MLLMs, even the strongest model reaches only65\.6%65\.6\\%transition validity and66\.3%66\.3\\%requirement coverage, and visual quality is no proxy for behavior \(Qwen3\.6\-35B\-A3B on Markdown:V=80\.8V\{=\}80\.8yetT=15\.5T\{=\}15\.5\)\. Video gives the strongest interaction signal \(\+10\.6\+10\.6pp implicit coverage over Text\), while implicit constraints persist; defect injection shows ICG\-based scoring detects state errors at22–16×16\\timesthe rate of checkpoint\-style evaluation\.

WebRISE: Requirement\-Induced State Evaluation for MLLM\-Generated Web Artifacts

Yuxin Meng1,2\{\}^\{\\textnormal\{1,2\}\}111Equal contribution\.Yuhan Suo1,2\{\}^\{\\textnormal\{1,2\}\}111Equal contribution\.Junjie Wang1\{\}^\{\\textnormal\{1\}\}111Equal contribution\.Yuhan Sun3\{\}^\{\\textnormal\{3\}\}111Equal contribution\.Yiyao Yu1\{\}^\{\\textnormal\{1\}\}Ruixu Zhang1\{\}^\{\\textnormal\{1\}\}Ruining Hu4\{\}^\{\\textnormal\{4\}\}Yubin Wang2\{\}^\{\\textnormal\{2\}\}Shouwei Ruan5\{\}^\{\\textnormal\{5\}\}Bin Wang2\{\}^\{\\textnormal\{2\}\}Yuxiang Zhang2\{\}^\{\\textnormal\{2\}\}222Corresponding authors\.Yujiu Yang1\{\}^\{\\textnormal\{1\}\}222Corresponding authors\.1\{\}^\{\\textnormal\{1\}\}Tsinghua University2\{\}^\{\\textnormal\{2\}\}Huawei Noah’s Ark Lab3\{\}^\{\\textnormal\{3\}\}East China Normal University4\{\}^\{\\textnormal\{4\}\}Tongji University5\{\}^\{\\textnormal\{5\}\}Institute of Artificial Intelligence, Beihang University[https://iigroup\.github\.io/WebRISE](https://iigroup.github.io/WebRISE)

††footnotetext:Under Review\.## 1Introduction

Multimodal large language models \(MLLMs\) are increasingly asked to generate executable web artifacts from multimodal specifications, including textual requirements, Markdown structures, sketches, screenshots, and interaction videos\(Yinet al\.,[2024](https://arxiv.org/html/2606.03220#bib.bib33); Siet al\.,[2025](https://arxiv.org/html/2606.03220#bib.bib14); Chenet al\.,[2025](https://arxiv.org/html/2606.03220#bib.bib12); Liuet al\.,[2026](https://arxiv.org/html/2606.03220#bib.bib11)\)\. This shift raises a basic benchmark question:when is a generated webpage usable, rather than merely visually plausible?In real use, a page can fail even when the expected controls are present: a filter may leave the item list unchanged, or a cart update may not propagate to the total price\. Evaluating MLLM\-generated web artifacts therefore requires testing requirement\-implied state transitions and state\-consistency constraints, rather than only initial appearance or isolated action outputs\.

![Refer to caption](https://arxiv.org/html/2606.03220v1/x1.png)Figure 1:Overview ofWebRISE\. Top: representative prior evaluation protocols often rely on modality\-fragmented inputs and local evidence, such as appearance, scripts, checkpoints, or open\-ended exploration\. Bottom:WebRISEevaluates generated web artifacts through a requirement\-induced interaction contract: it supports five input modalities \(❶\), maps explicit and implicit requirements to test items and transitions \(❷\), defines DOM/visual transition checks \(❸\), executes them with a contract\-guided agent \(❹\), and records transition\-level verdicts with structured evidence \(❺\)\.Recent benchmarks for web, UI, and artifact generation have moved beyond static visual fidelity by incorporating interaction evidence, such as dynamic screenshots and MLLM\-as\-a\-judge checklists\(Zhanget al\.,[2025](https://arxiv.org/html/2606.03220#bib.bib8)\), predefined scripts\(Zhuet al\.,[2025](https://arxiv.org/html/2606.03220#bib.bib9)\), web\-navigation agents\(Luet al\.,[2026](https://arxiv.org/html/2606.03220#bib.bib10)\), real user requirements\(Liuet al\.,[2026](https://arxiv.org/html/2606.03220#bib.bib11)\), and interaction videos\(Chenet al\.,[2025](https://arxiv.org/html/2606.03220#bib.bib12)\)\. These efforts establish interaction as a central dimension of web generation evaluation\. However, existing protocols still tend to operationalize interaction through local evidence rather than requirement\-level state obligations\. This creates two limitations\. \(i\)Event\-centric evaluation: screenshots, script steps, video trajectories, or expected\-result checkpoints can verify whether a local action produces a response, but they do not explicitly define which requirement\-induced states and transitions should be covered\. \(ii\)State\-consistency gap: a local response may pass even when the page violates cross\-component, cross\-view, or cross\-step constraints, such as filter–pagination synchronization, count updates after deletion, or hidden\-state preservation after navigation\. In short, existing benchmarks make interaction observable, but not yet fully enumerable or attributable as a requirement\-induced state space\.[Fig\.˜1](https://arxiv.org/html/2606.03220#S1.F1)summarizes this contrast\.

To address these limitations, we introduceWebRISE, a benchmark that evaluates MLLM\-generated web artifacts as*requirement\-induced observable state\-transition conformance*\.WebRISEderives a finite interaction contract from task requirements, consisting of observable UI states, user\-intent transitions, and DOM/visual assertions, and tests whether a generated page conforms to this contract under browser execution\. It is built on two design choices:*requirement\-conditioned state modeling*, which represents each task as an Interaction Contract Graph \(ICG\), and*conformance\-based diagnostic evaluation*, which links transition outcomes back to explicit requirements and implicit state\-consistency constraints\.

Concretely,WebRISEconverts explicit and implicit requirements into Test Data Contracts and test items, then compiles them into an Interaction Contract Graph \(ICG\)\. ICG states are requirement\-relevant observable UI configurations rather than full DOM snapshots, while transient behaviors such as loading, saving, debounce, and temporary disabled states are verified as transition\-level DOM evidence\. Each task is instantiated under Text, Markdown, Sketch, Image, and Video inputs, and models generate self\-contained executable HTML pages\. During evaluation, the ICG specifies*what*to verify, a contract\-guided agent decides*how*to execute each transition, and a DOM/visual dual oracle verifies process evidence and user\-visible outcomes\. The resulting reports are aggregated into state\-, transition\-, and requirement\-level diagnostics, includingS%S\\%,T%T\\%,R​e%Re\\%,R​i%Ri\\%, andR%R\\%\.

Table 1:Comparison with related web generation benchmarks\. Verdict: the mechanism used for pass/fail judgment\. Exp\.Req / Imp\.Req: whether the benchmark includes explicit \(user\-stated\) and implicit \(unstated product\-level\) requirements separately\. Input Modality: number of supported modalities with types listed\.We evaluateWebRISEon442442tasks,55input modalities, and1414representative models, and obtain three main findings\. First, interactive web generation remains far from solved: even the strongest model, GPT\-5\.5, reaches onlyT=65\.6%T=65\.6\\%andR=66\.3%R=66\.3\\%under its best modality, leaving roughly one third of required transitions or requirement checks unsatisfied\. Second, multimodal specifications improve interaction quality, with Video being the strongest modality: compared with Text, it improvesTT,RR, andRiR\_\{i\}by8\.88\.8,8\.38\.3, and10\.610\.6percentage points, respectively\. Third, implicit state constraints remain a consistent bottleneck: explicit requirements are easier across models, and hard tasks are enriched with feedback, error, edge\-state, and boundary\-condition failures\. As an additional evaluator sanity check, defect injection on GT\-validated pages shows that ICG\-based evaluation detects16/2516/25injected state\-related defects, compared with8/258/25under a broad checkpoint\-style WebGen criterion and1/251/25under a strict one\.

Our contributions are threefold:

- •We introduceWebRISE, a benchmark that reframes MLLM\-generated web artifact evaluation as requirement\-induced observable state\-transition conformance, covering442442tasks, five input modalities, and explicit/implicit requirement contracts\.
- •We develop a contract\-guided evaluation protocol that represents each task with an Interaction Contract Graph, executes transitions with an adaptive browser agent, and verifies process and outcome evidence through DOM/visual oracles\.
- •We conduct a large\-scale evaluation of1414representative models, revealing that current systems remain far from solving interactive web generation, that Video provides the strongest interaction signal, and that implicit state constraints remain a major bottleneck\.

## 2Related Work

MLLM\-generated web artifacts\.

Multimodal large language models are increasingly moving from UI understanding and static code generation toward executable web artifact generation\(Yinet al\.,[2024](https://arxiv.org/html/2606.03220#bib.bib33)\)\. Early UI\-to\-code, design\-to\-code, and sketch\-to\-code studies mainly evaluate whether models can recover layout, visual structure, and front\-end code from textual or visual specifications\(Siet al\.,[2025](https://arxiv.org/html/2606.03220#bib.bib14); Jainet al\.,[2019](https://arxiv.org/html/2606.03220#bib.bib15); Periasamiet al\.,[2026](https://arxiv.org/html/2606.03220#bib.bib16)\)\. Recent work further expands this setting to automated functional testing\(Zhuet al\.,[2025](https://arxiv.org/html/2606.03220#bib.bib9); Luet al\.,[2026](https://arxiv.org/html/2606.03220#bib.bib10)\), dynamic visual\-interactive evaluation\(Zhanget al\.,[2025](https://arxiv.org/html/2606.03220#bib.bib8)\), real user requirements with interpretable metrics\(Liuet al\.,[2026](https://arxiv.org/html/2606.03220#bib.bib11)\), interactive webpage reconstruction from video\(Chenet al\.,[2025](https://arxiv.org/html/2606.03220#bib.bib12)\), and agentic interactive verification\(Xuet al\.,[2025](https://arxiv.org/html/2606.03220#bib.bib13)\)\.

This shift changes what should be evaluated\. For static pages or local components, visual fidelity, structural similarity, and code executability are natural targets\. For interactive web artifacts, however, the key question is whether the page responds correctly to user actions and preserves task\-implied state constraints\. Accordingly,WebRISEevaluates MLLM\-generated web artifacts as executable, stateful interfaces rather than merely rendered pages or code\.

Interactive web evaluation\.Existing web generation benchmarks increasingly evaluate interaction through scripts, agents, visual judges, or demonstrated trajectories\. Script\-based protocols such as FrontendBench\(Zhuet al\.,[2025](https://arxiv.org/html/2606.03220#bib.bib9)\)provide reproducible functional checks but often depend on implementation\-specific selectors or entry points\. Checkpoint\-style protocols such as WebGen\-Bench\(Luet al\.,[2026](https://arxiv.org/html/2606.03220#bib.bib10)\)use web\-navigation agents to verify expected results, but still focus on local action–result pairs\. MLLM\-judge and video\-based protocols, such as ArtifactsBench\(Zhanget al\.,[2025](https://arxiv.org/html/2606.03220#bib.bib8)\)and IWR\-Bench\(Chenet al\.,[2025](https://arxiv.org/html/2606.03220#bib.bib12)\), assess rendered evidence or trajectory reproduction\. Beyond generation benchmarks, agent\-based web testing systems such as WebProber\(Yeet al\.,[2025](https://arxiv.org/html/2606.03220#bib.bib17)\)and UXAgent\(Luet al\.,[2025](https://arxiv.org/html/2606.03220#bib.bib18)\)explore websites to identify bugs or usability issues\. These protocols make interaction observable, but typically operationalize it through scripts, checkpoints, trajectories, visual evidence, or exploration traces\.WebRISEinstead formulates interaction evaluation as requirement conformance: an ICG defines requirement\-linked states, transitions, and assertions, and an adaptive agent executes them on each generated page, supporting diagnosis beyond pass/fail outcomes\.

![Refer to caption](https://arxiv.org/html/2606.03220v1/x2.png)Figure 2:Overview ofWebRISE\.WebRISEconverts multimodal web generation tasks into Interaction Contract Graphs \(ICGs\), executes each state transition with a contract\-guided agent, verifies process and outcome evidence with DOM/visual oracles, and aggregates transition\-level verdicts into diagnostic scores\.
## 3WebRISE: Benchmark Design

[Fig\.˜2](https://arxiv.org/html/2606.03220#S2.F2)summarizes the benchmark pipeline\.WebRISEconverts task requirements into executable interaction contracts and evaluates generated HTML through browser\-based conformance checks\.

### 3\.1Task Definition

WebRISEevaluates whether an MLLM can generate an executable web artifact that satisfies the interaction behavior of a user\-facing task\. For each taskτ\\tau, we define a requirement setRτR\_\{\\tau\}and five modality\-specific specificationsxτmx\_\{\\tau\}^\{m\}, where

m∈ℳ=\{Text,Markdown,Sketch,Image,Video\}\.m\\in\\mathcal\{M\}=\\\{\\text\{Text\},\\text\{Markdown\},\\text\{Sketch\},\\text\{Image\},\\text\{Video\}\\\}\.\(1\)Givenxτmx\_\{\\tau\}^\{m\}, a modelfθf\_\{\\theta\}generates a self\-contained HTML artifact:

hθ,τm=fθ​\(xτm\)\.h\_\{\\theta,\\tau\}^\{m\}=f\_\{\\theta\}\(x\_\{\\tau\}^\{m\}\)\.\(2\)The artifact must be directly executable in a browser and include the required HTML, CSS, and JavaScript without external back\-end services or manually prepared runtime state\.

For each task,WebRISEderives a requirement\-induced interaction contractGτG\_\{\\tau\}fromRτR\_\{\\tau\}\. The core evaluation asks whetherhθ,τmh\_\{\\theta,\\tau\}^\{m\}satisfiesGτG\_\{\\tau\}under browser execution, rather than whether it matches a reference DOM, follows a fixed selector path, or reproduces a single visual snapshot\. SinceGτG\_\{\\tau\}is shared across modalities,WebRISEcompares how textual, structural, visual, and temporal specifications affect generation of the same required interactive behavior\. Detailed modality construction procedures, prompt templates, and Image/Video specification rules are provided in[Sec\.˜A\.2](https://arxiv.org/html/2606.03220#A1.SS2)\.

Ground\-truth HTML pages validate contract executability and, when needed, provide Image/Video specifications, but are not treated as unique reference implementations\.

### 3\.2Requirement\-Induced Interaction Contracts

For each taskτ\\tau,WebRISEderives an interaction contract from the requirement setRτR\_\{\\tau\}and represents it as an Interaction Contract Graph \(ICG\):

Gτ=\(Sτ,Tτ,Φτ,Mτ\)\.G\_\{\\tau\}=\(S\_\{\\tau\},T\_\{\\tau\},\\Phi\_\{\\tau\},M\_\{\\tau\}\)\.\(3\)
Here,SτS\_\{\\tau\}denotes stable and replayable UI states,TτT\_\{\\tau\}denotes user\-intent\-driven transitions,Φτ\\Phi\_\{\\tau\}denotes observable DOM/visual predicates, andMτM\_\{\\tau\}maps requirements to test items, transitions, and assertions\.

The states inSτS\_\{\\tau\}are requirement\-relevant observable UI configurations, rather than full DOM snapshots\. Transient effects such as loading indicators, saving states, toasts, debounce effects, and temporary disabled controls are not modeled as standalone states; they are attached to transitions as process\-level predicates\. This keeps the state space finite and stable while preserving evidence for intermediate interaction behavior\.

Each transition inTτT\_\{\\tau\}specifies a user\-intent state change, describing the desired outcome rather than a selector\-level action sequence\. Predicates inΦτ\\Phi\_\{\\tau\}verify the transition through DOM evidence for structural or process\-level signals and visual evidence for final user\-visible outcomes, allowing the same contract to apply across diverse implementations\.

The mappingMτM\_\{\\tau\}connects transition\-level evidence back to the original requirements\. Explicit requirements describe user\-stated functional affordances, whereas implicit requirements capture product\-level constraints such as state synchronization, boundary feedback, pagination reset, loading feedback, and stale\-state removal\. Consequently, the contract specifies not only which interactions should be executed, but also how their evidence contributes to requirement\-level evaluation\.

### 3\.3Contract Construction Pipeline

WebRISEconstructs one interaction contract for each task and applies it to all model outputs across modalities\. The pipeline starts from expert\-provided task materials and converts them into executable, requirement\-attributable interaction contracts through four steps\.

Step 1: Expert\-informed task collection\.We design collection templates specifying the target domain, scenario, and expected web application setting\. Anonymous industry practitioners provide domain\-grounded task materials, including user\-facing requirements, representative interaction goals, and task\-relevant data assumptions\. These materials serve as raw task sources, rather than executable evaluation specifications\.

Step 2: Requirement normalization\.We normalize the collected materials into a requirement setRτR\_\{\\tau\}for each taskτ\\tau\. Each set contains explicit requirements for user\-stated functional affordances, such as search, filtering, sorting, dragging, and navigation, and implicit requirements for product\-level interaction constraints, such as state synchronization, boundary feedback, pagination reset, loading feedback, and stale\-state removal\.

Step 3: Test Data Contract and test items\.FromRτR\_\{\\tau\},WebRISEderives a Test Data Contract specifying the minimal functional readiness for evaluation, such as initial data, filters, navigation entries, or loadable content, without constraining layout, DOM hierarchy, style, or exact element counts\. It derives test items that describe user\-triggered behaviors and expected semantic outcomes, rather than CSS selectors, DOM paths, or click sequences\.

Step 4: ICG compilation\.The Test Data Contract and test items are compiled into the Interaction Contract GraphGτG\_\{\\tau\}\. Stable configurations become states, user\-triggered behaviors become transitions, and expected outcomes become DOM assertions or visual postconditions\.WebRISEalso constructs the coverage mappingMτM\_\{\\tau\}, linking requirements to test items, transitions, and assertions\.

This pipeline separates domain task authoring from executable evaluation design\. Practitioners provide realistic task content, whileWebRISEconverts it into an interaction contract that defines what should be evaluated;[Sec\.˜4](https://arxiv.org/html/2606.03220#S4)describes how the contract is executed on generated pages\.

### 3\.4Benchmark Statistics and Quality Control

[Fig\.˜3](https://arxiv.org/html/2606.03220#S3.F3)shows thatWebRISEspans diverse web application settings, with detailed construction statistics reported in[Sec\.˜A\.1](https://arxiv.org/html/2606.03220#A1.SS1)\.

After constructing each ICG, we validate it with a ground\-truth HTML page generated from the full requirement set\. A task is retained only when the ground\-truth page, the ICG, and the evaluator form a stable executable loop\. We also run schema checks over requirements, test items, states, transitions, assertions, and coverage mappings\. Human consistency validation is provided in[Sec\.˜A\.3](https://arxiv.org/html/2606.03220#A1.SS3)\.

![Refer to caption](https://arxiv.org/html/2606.03220v1/x3.png)Figure 3:Domain and scenario distribution ofWebRISE\. Tasks cover88domains and3535scenarios, such as Productivity Tools \(23\.76%23\.76\\%\) and Social Interaction \(16\.97%16\.97\\%\)\.

## 4Evaluation Protocol

Given a generated HTML artifactHHand its Interaction Contract GraphGτG\_\{\\tau\},WebRISEevaluates contract conformance under browser execution\. The ICG specifies*what*to verify, while a contract\-guided agent determines*how*to execute each transition on the generated page\.

### 4\.1Protocol Overview

Each transition is represented as

tj=\(sjfrom,sjto,gj,Pj,Ajdom,Ajvis\),t\_\{j\}=\(s\_\{j\}^\{\\mathrm\{from\}\},s\_\{j\}^\{\\mathrm\{to\}\},g\_\{j\},P\_\{j\},A^\{\\mathrm\{dom\}\}\_\{j\},A^\{\\mathrm\{vis\}\}\_\{j\}\),\(4\)wheresjfroms\_\{j\}^\{\\mathrm\{from\}\}andsjtos\_\{j\}^\{\\mathrm\{to\}\}are source and target states,gjg\_\{j\}is the natural\-language agent goal,PjP\_\{j\}is the precondition set, andAjdomA^\{\\mathrm\{dom\}\}\_\{j\},AjvisA^\{\\mathrm\{vis\}\}\_\{j\}are DOM assertions and visual postconditions\. This transition\-level formulation supports branching state graphs and localizes evidence to requirement\-linked state changes\.

[Algorithm˜1](https://arxiv.org/html/2606.03220#alg1)summarizes the evaluation loop\. A transition is marked asPassonly if the source state is reachable, the agent completes the intended interaction, and all required DOM/visual checks hold\. The resulting reports are aggregated into the diagnostic metrics in[Sec\.˜4\.4](https://arxiv.org/html/2606.03220#S4.SS4)\.

Algorithm 1Contract\-Guided Evaluation1:Page

HH, transitions

𝒯\\mathcal\{T\}, budget

KK, settle delay

Δ\\Delta
2:Load

HH; initialize replay cache

Π←∅\\Pi\\leftarrow\\emptyset
3:for

tj=\(sjfrom,sjto,gj,Pj,Ajdom,Ajvis\)∈𝒯t\_\{j\}=\(s\_\{j\}^\{\\mathrm\{from\}\},s\_\{j\}^\{\\mathrm\{to\}\},g\_\{j\},P\_\{j\},A^\{\\mathrm\{dom\}\}\_\{j\},A^\{\\mathrm\{vis\}\}\_\{j\}\)\\in\\mathcal\{T\}do

4:Restore

sjfroms\_\{j\}^\{\\mathrm\{from\}\}by replaying

Π\\Pi
5:ifrestore failsthen

6:

oj←Skippedo\_\{j\}\\leftarrow\\textsc\{Skipped\}; record evidence;continue

7:endif

8:Capture

imgpre\\mathrm\{img\}\_\{\\mathrm\{pre\}\}; check

PjP\_\{j\}
9:ifany precondition failsthen

10:

oj←Failo\_\{j\}\\leftarrow\\textsc\{Fail\}; record evidence;continue

11:endif

12:Monitor DOM events; run agent on

gjg\_\{j\}with budget

KK
13:Wait

Δ\\Delta; capture

imgpost\\mathrm\{img\}\_\{\\mathrm\{post\}\}; freeze event log

ℒ\\mathcal\{L\}
14:

rdom←ScoreDOM​\(Ajdom,ℒ\)r\_\{\\mathrm\{dom\}\}\\leftarrow\\textsc\{ScoreDOM\}\(A^\{\\mathrm\{dom\}\}\_\{j\},\\mathcal\{L\}\)
15:

rvis←ScoreVisual​\(Ajvis,imgpre,imgpost\)r\_\{\\mathrm\{vis\}\}\\leftarrow\\textsc\{ScoreVisual\}\(A^\{\\mathrm\{vis\}\}\_\{j\},\\mathrm\{img\}\_\{\\mathrm\{pre\}\},\\mathrm\{img\}\_\{\\mathrm\{post\}\}\)
16:

oj←Aggregate​\(agent​status,rdom,rvis\)o\_\{j\}\\leftarrow\\textsc\{Aggregate\}\(\\mathrm\{agent\\ status\},r\_\{\\mathrm\{dom\}\},r\_\{\\mathrm\{vis\}\}\)
17:if

oj=Passo\_\{j\}=\\textsc\{Pass\}then

18:Update

Π\\Piwith the trajectory reaching

sjtos\_\{j\}^\{\\mathrm\{to\}\}
19:endif

20:Record evidence

ℰj\\mathcal\{E\}\_\{j\}
21:endfor

### 4\.2Contract\-Guided Agent Execution

WebRISEuses an adaptive browser agent rather than a precompiled script\. At each step, the page is serialized into an indexed DOM observation containing interaction\-relevant controls, state fields, newly appeared elements, scroll context, and editable text selections\. Because indices are regenerated after each action, execution depends on the current page state rather than fixed selectors or reference DOM paths\. For branching ICGs, source states are restored by replaying previously verified trajectories, which isolates transitions and separates unreachable states from executable contract violations\.

### 4\.3DOM/Visual Oracle and Evidence

Each transition is verified with a dual\-channel oracle\. DOM assertions score process\-level or element\-level evidence from the event log, with\[CHANGE\]checking transient evidence during execution and\[AFTER\]checking the final stable DOM state\. Visual postconditions compare pre/post screenshots to verify final user\-visible outcomes such as list updates, sorting changes, moved cards, opened panels, or empty states\. For auditability,WebRISErecords the agent trace, DOM log, screenshots, assertion verdicts, and final transition outcome\. Details are provided in[Appendix˜B](https://arxiv.org/html/2606.03220#A2)\.

ModelTextMDSketchImageVideoOverallTTRRVVTTRRVVTTRRVVTTRRVVTTRRVVOpen\-SourceQwen3\.6\-35B\-A3B26\.830\.578\.215\.519\.280\.841\.245\.477\.046\.649\.671\.749\.552\.272\.850\.5Qwen3\.5\-122B\-A10B38\.041\.256\.842\.545\.972\.038\.042\.374\.040\.243\.870\.742\.847\.171\.351\.1Qwen3\.5\-27B36\.340\.059\.941\.745\.572\.138\.642\.776\.842\.646\.770\.643\.146\.971\.851\.7Qwen3\.5\-397B\-A17B45\.749\.264\.851\.154\.575\.746\.850\.578\.948\.451\.472\.849\.352\.872\.157\.6Kimi\-K2\.548\.551\.968\.957\.059\.673\.847\.850\.479\.956\.959\.172\.658\.660\.372\.961\.2Qwen3\.6\-27B47\.950\.975\.357\.560\.183\.050\.453\.387\.255\.257\.874\.154\.257\.274\.162\.5Kimi\-K2\.644\.647\.383\.151\.754\.987\.147\.851\.586\.358\.560\.473\.263\.765\.473\.563\.3ProprietaryClaude Opus 4\.643\.345\.556\.654\.356\.373\.952\.355\.072\.257\.759\.570\.252\.654\.970\.758\.3Gemini 3 Flash44\.748\.271\.950\.054\.179\.346\.149\.385\.454\.157\.572\.445\.648\.570\.858\.5Claude Opus 4\.748\.850\.968\.354\.556\.576\.249\.752\.477\.457\.058\.570\.565\.066\.172\.761\.6Gemini 3\.1 Pro50\.753\.669\.758\.961\.579\.252\.254\.984\.854\.557\.172\.252\.054\.971\.661\.9Qwen3\.6\-Plus49\.351\.968\.251\.754\.674\.553\.856\.486\.357\.559\.473\.861\.763\.474\.862\.5GPT\-5\.459\.761\.478\.460\.562\.279\.857\.860\.386\.660\.062\.171\.563\.164\.873\.766\.8GPT\-5\.560\.362\.385\.664\.466\.183\.360\.662\.986\.161\.863\.474\.165\.666\.373\.969\.1

Table 2:Overall model performance onWebRISEacross five input modalities\. We report transition validity \(TT\), overall requirement coverage \(RR\), and auxiliary visual quality \(VV\); Overall is a compact average ofTT,RR, andVVacross modalities\.Boldandunderlinedenote the best and second\-best results within each model group\.
### 4\.4Diagnostic Metrics

WebRISEreports diagnostics as different projections of the same interaction contract\. After evaluation, each transition receives one outcome in\{Pass,Fail,Blocked,Skipped\}\\\{\\textsc\{Pass\},\\textsc\{Fail\},\\textsc\{Blocked\},\\textsc\{Skipped\}\\\}\. OnlyPassis counted as successful, which avoids giving credit to incomplete interactions or unreachable states\.

State and transition metrics\.LetSτS\_\{\\tau\}andTτT\_\{\\tau\}denote the state and transition sets inGτG\_\{\\tau\}\. LetSτreachS\_\{\\tau\}^\{\\mathrm\{reach\}\}be the set of reached states, where the initial state is reachable only when its preconditions hold and any other state is reachable only through a passed incoming transition\. LetTτpassT\_\{\\tau\}^\{\\mathrm\{pass\}\}be the set of transitions marked asPass\. We define:

S%​\(τ\)=\|Sτreach\|\|Sτ\|×100,S\\%\(\\tau\)=\\frac\{\|S\_\{\\tau\}^\{\\mathrm\{reach\}\}\|\}\{\|S\_\{\\tau\}\|\}\\times 100,\(5\)T%​\(τ\)=\|Tτpass\|\|Tτ\|×100\.T\\%\(\\tau\)=\\frac\{\|T\_\{\\tau\}^\{\\mathrm\{pass\}\}\|\}\{\|T\_\{\\tau\}\|\}\\times 100\.\(6\)
Here,S%S\\%measures state reachability, whileT%T\\%measures transition\-level interaction correctness\.

Requirement coverage\.LetRτexpR\_\{\\tau\}^\{\\mathrm\{exp\}\}andRτimpR\_\{\\tau\}^\{\\mathrm\{imp\}\}denote explicit and implicit requirements, withRτ=Rτexp∪RτimpR\_\{\\tau\}=R\_\{\\tau\}^\{\\mathrm\{exp\}\}\\cup R\_\{\\tau\}^\{\\mathrm\{imp\}\}\. Using the coverage mappingMτM\_\{\\tau\}, each requirementrris linked to the transitions and assertions that verify it\. We setsat​\(r\)=1\\mathrm\{sat\}\(r\)=1if all mapped checks forrrpass, and0otherwise\. For any requirement subsetR^∈\{Rτexp,Rτimp,Rτ\}\\hat\{R\}\\in\\\{R\_\{\\tau\}^\{\\mathrm\{exp\}\},R\_\{\\tau\}^\{\\mathrm\{imp\}\},R\_\{\\tau\}\\\}, we define:

𝒞​\(R^\)=1\|R^\|​∑r∈R^sat​\(r\)×100\.\\mathcal\{C\}\(\\hat\{R\}\)=\\frac\{1\}\{\|\\hat\{R\}\|\}\\sum\_\{r\\in\\hat\{R\}\}\\mathrm\{sat\}\(r\)\\times 100\.\(7\)Applying𝒞\\mathcal\{C\}toRτexpR\_\{\\tau\}^\{\\mathrm\{exp\}\},RτimpR\_\{\\tau\}^\{\\mathrm\{imp\}\}, andRτR\_\{\\tau\}givesR​e%Re\\%,R​i%Ri\\%, andR%R\\%, respectively\.R​e%Re\\%measures user\-stated functional affordances, whileR​i%Ri\\%measures implicit state\-consistency constraints such as synchronization, boundary feedback, reset behavior, and stale\-state removal\.

Aggregation\.All metrics are computed at the task level and macro\-averaged over tasks:

q¯​\(θ,m\)=1\|𝒟\|​∑τ∈𝒟q​\(θ,τ,m\),\\bar\{q\}\(\\theta,m\)=\\frac\{1\}\{\|\\mathcal\{D\}\|\}\\sum\_\{\\tau\\in\\mathcal\{D\}\}q\(\\theta,\\tau,m\),\(8\)whereq∈\{S%,T%,R​e%,R​i%,R%\}q\\in\\\{S\\%,T\\%,Re\\%,Ri\\%,R\\%\\\}\. This prevents tasks with more transitions or assertions from dominating the aggregate score\.

## 5Experiments and Findings

### 5\.1Experimental Setup

We evaluateWebRISEon1414representative models\. The model set includes77open\-weight models and77proprietary models\. The open\-weight models are Qwen3\.5\-27B\(Team,[2026b](https://arxiv.org/html/2606.03220#bib.bib19)\), Qwen3\.5\-122B, Qwen3\.5\-397B, Qwen3\.6\-27B\(Qwen Team,[2026](https://arxiv.org/html/2606.03220#bib.bib29)\), Qwen3\.6\-35B\-A3B, Kimi K2\.5\(Team,[2026a](https://arxiv.org/html/2606.03220#bib.bib20)\), and Kimi K2\.6\(Moonshot AI,[2026](https://arxiv.org/html/2606.03220#bib.bib30)\)\. The proprietary models are GPT\-5\.4\(OpenAI,[2026a](https://arxiv.org/html/2606.03220#bib.bib21)\), GPT\-5\.5\(OpenAI,[2026b](https://arxiv.org/html/2606.03220#bib.bib22)\), Claude Opus 4\.6\(Anthropic,[2026a](https://arxiv.org/html/2606.03220#bib.bib23)\), Claude Opus 4\.7\(Anthropic,[2026b](https://arxiv.org/html/2606.03220#bib.bib24)\), Gemini\-3 Flash\(Google DeepMind,[2025](https://arxiv.org/html/2606.03220#bib.bib25)\), Gemini\-3\.1 Pro\(Google DeepMind,[2026](https://arxiv.org/html/2606.03220#bib.bib26)\), and Qwen3\.6\-Plus\.

### 5\.2Overall Model Performance

[Table˜2](https://arxiv.org/html/2606.03220#S4.T2)shows that interactive web artifact generation remains far from saturated\. Although GPT\-5\.5 achieves the highest compact Overall score, even its best modality, Video, reaches onlyT=65\.6T=65\.6andR=66\.3R=66\.3, leaving roughly one third of required transitions or requirement checks unsatisfied\.

Proprietary models lead, but open\-weight models remain competitive\.GPT\-5\.5 and GPT\-5\.4 obtain the top two Overall scores,69\.169\.1and66\.866\.8\. However, the gap is not determined solely by model access type\. Kimi\-K2\.6 achieves the best open\-weight Overall score \(63\.363\.3\), surpassing several proprietary systems and performing especially well under Image and Video\. Qwen3\.6\-27B also reaches a competitive Overall score \(62\.562\.5\), with strong Markdown and Sketch results\. These trends suggest that modality handling and stateful interaction reasoning contribute substantially to model ranking\.

Visual quality is not a proxy for interaction correctness\.High visual scores can coexist with weak executable behavior: Qwen3\.6\-35B\-A3B obtains a strong Markdown visual score \(V=80\.8V=80\.8\), but much lower interaction scores \(T=15\.5T=15\.5,R=19\.2R=19\.2\)\. This mismatch reinforces the need to evaluate generated web artifacts through state transitions and requirement satisfaction, rather than visual plausibility alone\.

### 5\.3Analysis

Table 3:Auxiliary safety and robustness diagnostic results by model\. Pass rates are computed over applicable check instances; higher is better\.#### 5\.3\.1Safety and Robustness Diagnostics

As an auxiliary diagnostic, we evaluate basic HTML safety and robustness checks\.[Table˜3](https://arxiv.org/html/2606.03220#S5.T3)shows uniformly low pass rates: even GPT\-5\.5 reaches only41\.3%41\.3\\%, while most models cluster within2525–32%32\\%\. The flat model ranking and small cross\-modality variation suggest that safer HTML generation is not automatically induced by stronger models or richer input specifications\.

![Refer to caption](https://arxiv.org/html/2606.03220v1/x4.png)Figure 4:Visual\-score distributions across input modalities\. Points denote models and boxes show distribution\.
#### 5\.3\.2Modality Effects

[Fig\.˜4](https://arxiv.org/html/2606.03220#S5.F4)shows that visual quality and interaction performance follow different patterns\. Text has the largest cross\-model variance, while Sketch obtains high visual scores due to strong spatial constraints from wireframes\. However, Image and Video have similar visual\-score distributions, whereas Video leads in interaction\-oriented metrics in[Sec\.˜C\.2](https://arxiv.org/html/2606.03220#A3.SS2)\. This indicates that Video’s advantage is better explained by temporal interaction evidence than by static visual fidelity, reinforcing that visual quality should remain an auxiliary signal\. The visual scoring procedure is described in[Sec\.˜B\.5](https://arxiv.org/html/2606.03220#A2.SS5)\.

![Refer to caption](https://arxiv.org/html/2606.03220v1/x5.png)Figure 5:Scaling behavior of the Qwen3\.5 family across input modalities\. Performance is largely flat from 27B to 122B\-A10B, but increases sharply at 397B\-A17B\.
#### 5\.3\.3Model Scaling Effects

[Fig\.˜5](https://arxiv.org/html/2606.03220#S5.F5)shows a non\-linear scaling trend within the Qwen3\.5 family: performance is largely flat from 27B to 122B\-A10B, but improves clearly at 397B\-A17B\. The gains are strongest under Text and Markdown, where layout, interaction logic, and state behavior must be inferred from weaker specifications\. This pattern suggests a scaling knee for stateful web artifact generation, where sufficient model capacity becomes important for jointly modeling layout, interaction logic, and state behavior\.

Table 4:Defect injection meta\-evaluation\. We compare ICG\-based evaluation with checkpoint\-style WebGen \(WG\) signals on 25 injected state\-related defects\. Det\. denotes detected defects and DR denotes detection rate\. ICG detects defects at2×2\\timesthe rate of WG under the broad criterion and16×16\\timesunder the strict criterion\.
#### 5\.3\.4Defect Injection Meta\-Evaluation

To assess evaluator sensitivity, we inject state\-related defects into GT\-validated pages and rerun the same pipeline\.[Table˜4](https://arxiv.org/html/2606.03220#S5.T4)shows that ICG\-based evaluation detects substantially more defects than checkpoint\-style WebGen signals, suggesting that explicit state\-transition contracts are more sensitive to state corruptions missed by local checkpoints\. The remaining missed cases show that defect\-sensitive evaluation is not yet exhaustive\.

#### 5\.3\.5Failure Attribution\.

[Fig\.˜6](https://arxiv.org/html/2606.03220#S5.F6)groups direct failed transitions into four functional error types\. GPT\-5\.5 and Kimi\-K2\.6 show similar profiles:*State & Logic*dominates, followed by*Feedback & Boundary*\. Therefore, many failures occur after required controls or interaction paths are exposed, indicating that the main bottleneck is maintaining correct state updates, result logic, validation behavior, and boundary feedback under user actions\.

![Refer to caption](https://arxiv.org/html/2606.03220v1/x6.png)Figure 6:Failure attribution \(GPT\-5\.5 and Kimi\-K2\.6\)\.![Refer to caption](https://arxiv.org/html/2606.03220v1/x7.png)Figure 7:Case study ofWebRISE’s transition\-level diagnosis on a shopping\-cart interaction\. After the only checked item is unchecked, the passing artifact resets the totals to zero and disables checkout\. The failing artifact changes the item checkbox state but leaves the price breakdown and checkout availability stale;WebRISElocalizes the error with failed DOM and visual assertions\.
#### 5\.3\.6Case Study

[Fig\.˜7](https://arxiv.org/html/2606.03220#S5.F7)illustrates transition\-level diagnosis on a shopping\-cart interaction\. The failing artifact accepts the user click but fails to propagate the resulting state change to dependent totals and checkout availability, exposing a state\-consistency error rather than a click\-execution failure\.

## 6Conclusion

We introducedWebRISE, a benchmark that evaluates MLLM\-generated web artifacts through requirement\-induced observable state\-transition conformance\.WebRISErepresents each task with an Interaction Contract Graph, enabling implementation\-agnostic browser execution and state\-, transition\-, and requirement\-level diagnostics over explicit functions and implicit state\-consistency constraints\. Experiments on442442tasks, five input modalities, and1414models show that current systems remain far from solving interactive web generation: Video provides the strongest interaction signal, while implicit state constraints remain a persistent bottleneck\. These results highlight the need to evaluate generated web artifacts by requirement\-level state behavior, rather than visual plausibility or isolated action success alone\.

## Limitations

WebRISEfocuses on self\-contained HTML artifacts executed in a controlled browser environment\. This enables consistent comparison across models and modalities, but does not cover full production web systems involving back\-end services, authentication, external APIs, persistent databases, multi\-user concurrency, or long\-lived sessions\. Accordingly,WebRISEshould be interpreted as measuring front\-end interaction conformance rather than deployment readiness\. A natural extension is to augment Interaction Contract Graphs with sandboxed API contracts, persistent data fixtures, and session\-level state transitions\.

WebRISEevaluates generated pages against requirement\-induced interaction contracts\. Although the contracts are validated through ground\-truth execution, schema checks, human consistency studies, and defect injection, their coverage is still bounded by the specified requirements, generated test items, and DOM/visual assertions\. Therefore,WebRISEprovides diagnostic evidence of conformance to the defined interaction contract, rather than an exhaustive characterization of all possible user behaviors\. Future work can broaden coverage by expanding contract templates, adding richer defect suites, incorporating multiple evaluator agents and selectively auditing uncertain cases\.

## Ethical Considerations

WebRISEis a diagnostic benchmark, not a deployable system\. Contributors and annotators participated under informed consent with aggregated reporting\. Because contributors are drawn primarily from a single region, regional product conventions shape what counts as expected interaction, and applications targeting other markets should treat our metrics as a baseline and extend the contract set with locale\-specific affordances\. LLM\-judge scoring is validated against human judgments \(κ=0\.74\\kappa=0\.74, Appendix[A\.3](https://arxiv.org/html/2606.03220#A1.SS3)\) and defect injection, but remains susceptible to prompt sensitivity and API version drift; reported scores should be read as stable rank\-orderings rather than absolute measurements\. We release all judge prompts, configurations, and per\-assertion verdicts to support independent re\-scoring\.

## References

- Introducing claude opus 4\.6\.Note:[https://www\.anthropic\.com/news/claude\-opus\-4\-6](https://www.anthropic.com/news/claude-opus-4-6)Accessed: 2026\-05\-25Cited by:[§5\.1](https://arxiv.org/html/2606.03220#S5.SS1.p1.3)\.
- Anthropic \(2026b\)Introducing claude opus 4\.7\.Note:[https://www\.anthropic\.com/news/claude\-opus\-4\-7](https://www.anthropic.com/news/claude-opus-4-7)Accessed: 2026\-05\-25Cited by:[§5\.1](https://arxiv.org/html/2606.03220#S5.SS1.p1.3)\.
- Y\. Chen, M\. Liu, Y\. Shen, Y\. Li, T\. Huang, X\. Fang, T\. Zheng, W\. Huang, C\. Yang, D\. Fu,et al\.\(2025\)IWR\-bench: can lvlms reconstruct interactive webpage from a user interaction video?\.arXiv preprint arXiv:2509\.24709\.Cited by:[Table 1](https://arxiv.org/html/2606.03220#S1.T1.1.1.7.6.1),[§1](https://arxiv.org/html/2606.03220#S1.p1.1),[§1](https://arxiv.org/html/2606.03220#S1.p2.1),[§2](https://arxiv.org/html/2606.03220#S2.p2.1),[§2](https://arxiv.org/html/2606.03220#S2.p4.1)\.
- Google DeepMind \(2025\)Gemini 3 flash model card\.Note:[https://storage\.googleapis\.com/deepmind\-media/Model\-Cards/Gemini\-3\-Flash\-Model\-Card\.pdf](https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card.pdf)Accessed: 2026\-05\-25Cited by:[§5\.1](https://arxiv.org/html/2606.03220#S5.SS1.p1.3)\.
- Google DeepMind \(2026\)Gemini 3\.1 pro model card\.Note:[https://deepmind\.google/models/model\-cards/gemini\-3\-1\-pro/](https://deepmind.google/models/model-cards/gemini-3-1-pro/)Accessed: 2026\-05\-25Cited by:[§5\.1](https://arxiv.org/html/2606.03220#S5.SS1.p1.3)\.
- V\. Jain, P\. Agrawal, S\. Banga, R\. Kapoor, and S\. Gulyani \(2019\)Sketch2Code: transformation of sketches to ui in real\-time using deep neural network\.arXiv preprint arXiv:1910\.08930\.Cited by:[§2](https://arxiv.org/html/2606.03220#S2.p2.1)\.
- C\. Liu, Y\. Fu, W\. Yang, Y\. Zhang, and T\. Xie \(2026\)WebCoderBench: benchmarking web application generation with comprehensive and interpretable evaluation metrics\.arXiv preprint arXiv:2601\.02430\.Cited by:[Table 1](https://arxiv.org/html/2606.03220#S1.T1.1.1.2.1.1),[§1](https://arxiv.org/html/2606.03220#S1.p1.1),[§1](https://arxiv.org/html/2606.03220#S1.p2.1),[§2](https://arxiv.org/html/2606.03220#S2.p2.1)\.
- Y\. Lu, B\. Yao, H\. Gu, J\. Huang, Z\. J\. Wang, Y\. Li, J\. Gesi, Q\. He, T\. J\. Li, and D\. Wang \(2025\)Uxagent: an llm agent\-based usability testing framework for web design\.InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems,pp\. 1–12\.Cited by:[§2](https://arxiv.org/html/2606.03220#S2.p4.1)\.
- Z\. Lu, Y\. Yang, H\. Ren, H\. Hou, H\. Xiao, K\. Wang, W\. Shi, A\. Zhou, M\. Zhan, and H\. Li \(2026\)Webgen\-bench: evaluating llms on generating interactive and functional websites from scratch\.Advances in Neural Information Processing Systems38\.Cited by:[Table 1](https://arxiv.org/html/2606.03220#S1.T1.1.1.6.5.1),[§1](https://arxiv.org/html/2606.03220#S1.p2.1),[§2](https://arxiv.org/html/2606.03220#S2.p2.1),[§2](https://arxiv.org/html/2606.03220#S2.p4.1)\.
- Moonshot AI \(2026\)Kimi\-k2\.6\.Note:[https://huggingface\.co/moonshotai/Kimi\-K2\.6](https://huggingface.co/moonshotai/Kimi-K2.6)Accessed: 2026\-05\-25Cited by:[§5\.1](https://arxiv.org/html/2606.03220#S5.SS1.p1.3)\.
- OpenAI \(2026a\)Introducing gpt\-5\.4\.Note:[https://openai\.com/index/introducing\-gpt\-5\-4/](https://openai.com/index/introducing-gpt-5-4/)Accessed: 2026\-05\-25Cited by:[§5\.1](https://arxiv.org/html/2606.03220#S5.SS1.p1.3)\.
- OpenAI \(2026b\)Introducing gpt\-5\.5\.Note:[https://openai\.com/index/introducing\-gpt\-5\-5/](https://openai.com/index/introducing-gpt-5-5/)Accessed: 2026\-05\-25Cited by:[§5\.1](https://arxiv.org/html/2606.03220#S5.SS1.p1.3)\.
- A\. V\. Periasami, J\. Wang, and B\. Dhingra \(2026\)Vision2Code: a multi\-domain benchmark for evaluating image\-to\-code generation\.arXiv preprint arXiv:2605\.11307\.Cited by:[§2](https://arxiv.org/html/2606.03220#S2.p2.1)\.
- Qwen Team \(2026\)Qwen3\.6\.Note:[https://github\.com/QwenLM/Qwen3\.6](https://github.com/QwenLM/Qwen3.6)Accessed: 2026\-05\-25Cited by:[§5\.1](https://arxiv.org/html/2606.03220#S5.SS1.p1.3)\.
- C\. Si, Y\. Zhang, R\. Li, Z\. Yang, R\. Liu, and D\. Yang \(2025\)Design2code: benchmarking multimodal code generation for automated front\-end engineering\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 3956–3974\.Cited by:[§1](https://arxiv.org/html/2606.03220#S1.p1.1),[§2](https://arxiv.org/html/2606.03220#S2.p2.1)\.
- K\. Team \(2026a\)Kimi K2\.5: visual agentic intelligence\.CoRRabs/2602\.02276\.Cited by:[§5\.1](https://arxiv.org/html/2606.03220#S5.SS1.p1.3)\.
- Q\. Team \(2026b\)Qwen3\.5: accelerating productivity with native multimodal agents\.External Links:[Link](https://qwen.ai/blog?id=qwen3.5)Cited by:[§5\.1](https://arxiv.org/html/2606.03220#S5.SS1.p1.3)\.
- H\. Tran, L\. Nashold, R\. Krishnan, and A\. Bigeard \(2026\)Vibe code bench: evaluating ai models on end\-to\-end web application development\.arXiv preprint arXiv:2603\.04601\.Cited by:[Table 1](https://arxiv.org/html/2606.03220#S1.T1.1.1.3.2.1)\.
- J\. Xiao, Y\. Wan, Y\. Huo, Z\. Wang, X\. Xu, W\. Wang, Z\. Xu, Y\. Wang, and M\. R\. Lyu \(2025\)Interaction2Code: benchmarking mllm\-based interactive webpage code generation from interactive prototyping\.InProceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering \(ASE\),Cited by:[Table 1](https://arxiv.org/html/2606.03220#S1.T1.1.1.4.3.1)\.
- M\. Xu, Z\. Yang, W\. Hong, L\. Pan, X\. Fan, Y\. Wang, X\. Gu, B\. Xu, and J\. Tang \(2025\)Webvia: a web\-based vision\-language agentic framework for interactive and verifiable ui\-to\-code generation\.arXiv preprint arXiv:2511\.06251\.Cited by:[§2](https://arxiv.org/html/2606.03220#S2.p2.1)\.
- N\. Ye, X\. Yu, R\. Xu, T\. Peng, and Z\. Yu \(2025\)AI agents for web testing: a case study in the wild\.arXiv preprint arXiv:2509\.05197\.Cited by:[§2](https://arxiv.org/html/2606.03220#S2.p4.1)\.
- S\. Yin, C\. Fu, S\. Zhao, K\. Li, X\. Sun, T\. Xu, and E\. Chen \(2024\)A survey on multimodal large language models\.National Science Review11\(12\),pp\. nwae403\.Cited by:[§1](https://arxiv.org/html/2606.03220#S1.p1.1),[§2](https://arxiv.org/html/2606.03220#S2.p2.1)\.
- C\. Zhang, Y\. Li, C\. Xu, J\. Liu, A\. Liu, C\. Zhou, K\. Deng, D\. Wu, G\. Huang, K\. Li,et al\.\(2025\)Artifactsbench: bridging the visual\-interactive gap in llm code generation evaluation\.arXiv preprint arXiv:2507\.04952\.Cited by:[§1](https://arxiv.org/html/2606.03220#S1.p2.1),[§2](https://arxiv.org/html/2606.03220#S2.p2.1),[§2](https://arxiv.org/html/2606.03220#S2.p4.1)\.
- H\. Zhu, Y\. Zhang, B\. Zhao, J\. Ding, S\. Liu, T\. Liu, D\. Wang, Y\. Liu, and Z\. Li \(2025\)Frontendbench: a benchmark for evaluating llms on front\-end development via automatic evaluation\.arXiv preprint arXiv:2506\.13832\.Cited by:[Table 1](https://arxiv.org/html/2606.03220#S1.T1.1.1.5.4.1),[§1](https://arxiv.org/html/2606.03220#S1.p2.1),[§2](https://arxiv.org/html/2606.03220#S2.p2.1),[§2](https://arxiv.org/html/2606.03220#S2.p4.1)\.

## Appendix

## Appendix AAdditional Benchmark Details

### A\.1Benchmark Statistics

[Table˜5](https://arxiv.org/html/2606.03220#A1.T5)reports additional construction statistics ofWebRISE\. The benchmark contains442442tasks across88domains and3535scenarios, instantiated under five input modalities\. At the interaction\-contract level, it includes5,0815\{,\}081states,5,4955\{,\}495transitions, and5,2715\{,\}271requirement checks, covering both explicit user\-stated requirements and implicit product\-level constraints\.

### A\.2Input Modality Construction

WebRISEinstantiates each task under five input modalities to simulate different specification conditions in practical web artifact generation\. The taskτ\\tauand its interaction contractGτG\_\{\\tau\}are fixed across modalities, while the input specificationxτmx\_\{\\tau\}^\{m\}varies\.[Table˜6](https://arxiv.org/html/2606.03220#A1.T6)summarizes the information provided by each modality and its intended evaluation role\.

### A\.3Human Consistency Validation

We conduct human consistency validation to examine whether the constructed interaction contracts and automatic evaluators align with human judgements\. The validation covers two aspects: \(i\) the requirement\-to\-ICG construction and agent\-based functional evaluation, and \(ii\) the modality\-specific visual evaluation\. This study is used only as a consistency check for benchmark construction and evaluator reliability; it is not used to tune model outputs or change the main evaluation results\.

Annotation setup\.We sample 300 interaction cases fromWebRISE, stratified across domains, input modalities, and task difficulty levels\. Each interaction case contains the original task requirement, the corresponding test item or ICG transition, the generated page execution trace, and the automatic verdict\. Human annotators judge whether the transition correctly reflects the intended requirement and whether the generated page satisfies the expected functional interaction\. For the visual validation, we sample 300 generated HTML pages across the five input modalities\. Annotators evaluate visual quality according to the modality\-specific criterion: single\-page visual quality for Text, reference\-page similarity for Image and Video, sketch similarity for Sketch, and Markdown\-structure consistency for Markdown\.

Annotator disclosure and privacy\.The annotators were informed about how the benchmark data were collected and how their annotations would be used in this research\. The annotation process does not require releasing private personal information\. For privacy reasons, we do not disclose additional identifying information about individual participants, such as names, employers, or detailed personal profiles\. All reported results are aggregated\.

Metrics\.We report accuracy, mean absolute error \(MAE\), Spearman correlation, Pearson correlation, and Cohen’sκ\\kappa\. Accuracy and Cohen’sκ\\kappameasure agreement on binary pass/fail judgements\. MAE and correlation metrics are computed over normalized scores when graded judgements are available\. For each validation setting, we compare the automatic result against the human\-majority judgement, and also report human–human agreement as a reference\.

Table 5:Benchmark construction statistics ofWebRISE\. The table summarizes task coverage, modality instantiation, interaction\-contract scale, and requirement\-check composition\.Table 6:Input modalities inWebRISE\. All modalities share the same task and interaction contract, but expose different specification signals to the model\.Table 7:Human consistency validation for interaction\-contract construction and functional evaluation\. The automatic requirement\-to\-ICG construction and agent\-based evaluation are compared with human\-majority judgements, with human–human agreement reported as a reference\.Interaction consistency\.As shown in[Table˜7](https://arxiv.org/html/2606.03220#A1.T7), the automatic requirement\-to\-ICG construction achieves 0\.86 accuracy and a Cohen’sκ\\kappaof 0\.78 against the human majority\. The agent\-based functional evaluator achieves 0\.84 accuracy, 0\.86 Spearman correlation, 0\.84 Pearson correlation, and a Cohen’sκ\\kappaof 0\.74\. These scores are close to the human–human agreement, suggesting that both the constructed interaction contracts and the automatic functional evaluation provide stable signals for requirement\-level interaction correctness\.

Table 8:Human consistency validation for modality\-specific visual evaluation\. The visual evaluator is compared with human\-majority judgements under modality\-specific criteria, including single\-page visual quality for Text, reference\-page similarity for Image/Video, sketch similarity for Sketch, and structure consistency for Markdown\.Visual consistency\.[Table˜8](https://arxiv.org/html/2606.03220#A1.T8)reports the consistency of the visual evaluator\. The overall visual evaluator obtains 0\.81 accuracy, 0\.80 Spearman correlation, 0\.78 Pearson correlation, and a Cohen’sκ\\kappaof 0\.69 against the human majority\. The agreement is slightly lower than the functional interaction evaluation, which is expected because visual assessment involves more subjective judgement\. Nevertheless, the results indicate that the visual evaluator provides a stable auxiliary signal for modality\-specific layout quality, visual consistency, and reference alignment\.

## Appendix BAdditional Evaluation Protocol Details

### B\.1Agent Observation and Action Space

WebRISEprovides the evaluation agent with a compact, action\-oriented view of the live webpage rather than the full HTML document\. At each interaction step, the browser state is converted into an indexed DOM observation that exposes only interaction\-relevant elements and state fields\. Each actionable element receives an ephemeralindex, which is local to the current observation and regenerated after the next browser action\. This design allows the agent to act on the current page state without relying on persistent CSS selectors, fixed DOM paths, or reference\-specific implementation details\.

Indexed DOM observation\.For each serialized element,WebRISErecords its tag, accessibility role, visible text, key attributes, and interaction states\. Typical fields includeplaceholder,value,href,type,checked,selected,expanded,pressed,disabled,aria\-disabled, andpointer\-events\. For structured or stateful widgets, the observation additionally records option lists, slider values, scroll offsets, and cursor or selection ranges for editable regions\. These fields support fine\-grained interactions such as selecting text spans, operating custom dropdowns, restoring scroll context, and performing drag\-and\-drop transitions\.

Non\-standard components\.Generated pages often implement interactive elements with custom DOM structures rather than native controls\. Therefore,WebRISEincludes not only native buttons, links, inputs, selects, and text areas, but also elements with interactive ARIA roles, non\-negativetabindex, event listeners, pointer or text cursors,contenteditable, or hover\-revealed subtrees\. Newly appeared elements are marked through cross\-step DOM diffing, and hidden or non\-interactable elements are explicitly annotated\. This makes the agent interface robust to diverse MLLM\-generated implementations while avoiding the cost and brittleness of exposing the full DOM\.

Action space\.The agent action space covers common web operations and interaction\-heavy behaviors\. It includes pointer actions, keyboard and text actions, form\-control actions, spatial actions, and navigation/lifecycle actions:Click,Hover,Type,Clear,PressKey,SelectOption,ToggleCheck,SetSliderValue,Scroll,DragAndDrop,UploadFile,CanvasClickAt,Back,Refresh,WaitFor, andDone\. We additionally supportSelectTextfor selecting contiguous text spans insideinput,textarea, andcontenteditableregions\. These actions allowWebRISEto evaluate interactions that cannot be expressed by simple click/type scripts, including anchored text editing, drag\-and\-drop reordering, file upload, slider control, canvas selection, and browser navigation recovery\.

### B\.2DOM and Visual Assertion Scoring

WebRISEscores each transition with two complementary assertion channels\. DOM assertions operate on structured browser evidence, including the initial DOM snapshot, the final DOM snapshot, and the event log collected during agent execution\. Visual postconditions operate on pre\- and post\-interaction screenshots\. This separation letsWebRISEcapture transient process evidence and element\-level states through DOM signals, while using visual evidence for final user\-visible outcomes\.

DOM assertion scoring\.Each DOM assertion is prefixed with a temporal operator\.\[CHANGE\]requires the condition to hold at some point during the execution timeline, and is used for transient signals such as loading, saving, progress, debounce, confirmation feedback, or temporary disabled states\.\[AFTER\]requires the condition to hold in the final stable DOM state, and is used for persistent outcomes such as selected filters, disabled controls, removed items, restored buttons, or updated ARIA states\.

To reduce free\-form interpretation,WebRISEapplies deterministic priority rules for common state predicates\. For non\-interactivity, the scorer first checkspointer\-events: none, then nativedisabledoraria\-disabled="true", and then state\-indicative class tokens such asdisabled,inactive,locked, orreadonly\. For selection or activation, the scorer prioritizesaria\-selected,aria\-pressed, andaria\-checked, followed by class tokens such asselected,active,highlighted, orcurrent\. For expansion, it usesaria\-expandedand visibility changes in the corresponding container subtree\.

Element localization uses visible text, role,aria\-label, placeholder, attributes, and child\-structure summaries\. When multiple candidates match the target and the evidence is insufficient to disambiguate them, the scorer returnsUncertainrather than selecting a target arbitrarily\. OnlyYesis treated as passing when aggregating assertion\-, transition\-, and requirement\-level scores\.

Visual postcondition scoring\.Visual postconditions compare the screenshots before and after a transition\. They are written as behavioral conditions rather than pixel\-level constraints, so different implementations can pass if they satisfy the same user\-visible semantics\. Typical postconditions include list updates, sorting changes, panel expansion, drag\-and\-drop placement, empty\-state display, stale\-state removal, and visible value updates\.

The visual scorer uses before/after differences to judge the requested semantic change\. For conditional assertions, it first determines which branch applies from the screenshots and evaluates only that branch\. If relevant content is clipped by the viewport or a scrollable container, the scorer relies only on fully visible evidence\. For search or filter assertions, an empty result may pass when the filter is visibly active and the page shows a valid empty state\. Ambiguous or unsupported visual evidence is marked asUncertain, and does not count as a passing assertion\.

### B\.3Transition Outcomes and Evidence

Each evaluated transition receives one of four outcomes\.Passindicates that the source state is reachable, the agent completes the intended interaction, and all required DOM/visual checks pass\.Failindicates that the transition is executable but at least one required assertion or postcondition is violated\.Blockedindicates that the agent cannot complete the interaction within the budget, typically because the required affordance is absent, hidden, or non\-functional\.Skippedindicates that the source state cannot be restored, usually because a prerequisite transition failed or the replay path is unavailable\. This taxonomy separates contract violations from execution failures and prevents a single upstream defect from being counted repeatedly across downstream transitions\.

For auditability,WebRISEstores a structured evidence bundle for every transition\. The bundle includes the transition identifier, source and target state descriptors, the natural\-language agent goal, pre\- and post\-interaction screenshots, the agent action trace, the DOM event log, initial and final DOM snapshots, per\-assertion verdicts, the final transition outcome, and the replay path when state replay is used\. Each per\-assertion record stores the verdict, supporting evidence fragments, and scorer version\. The evidence bundle allows each reported error to be traced to the relevant phase, such as source\-state restoration, agent execution, DOM assertion scoring, visual postcondition scoring, or replay\. It also supports manual auditing, phase\-level error analysis, and defect\-injection meta\-evaluation\.

### B\.4Additional Metric Details

In addition to the main metrics,WebRISErecords test\-item\-level and assertion\-level signals for diagnostic analysis\. These signals are not used as primary leaderboard metrics, but help localize errors between user\-facing behaviors, transition checks, and individual evidence channels\.

Test\-item coverage\.A test item corresponds to a user\-triggered behavior and its expected semantic outcome\. Using the coverage mappingMτM\_\{\\tau\}, each test item is linked to the transitions and assertions that verify it\. We mark a test item as satisfied only when all mapped transitions and required assertions pass:

T​I%​\(τ\)=1\|Iτ\|​∑i∈Iτsat​\(i\)×100,TI\\%\(\\tau\)=\\frac\{1\}\{\|I\_\{\\tau\}\|\}\\sum\_\{i\\in I\_\{\\tau\}\}\\mathrm\{sat\}\(i\)\\times 100,\(9\)whereIτI\_\{\\tau\}is the set of test items for taskτ\\tauandsat​\(i\)∈\{0,1\}\\mathrm\{sat\}\(i\)\\in\\\{0,1\\\}\. Because test items are closer to user\-facing behaviors than raw transitions,T​I%TI\\%is mainly used for qualitative error analysis\.

Assertion\-level verdicts\.Each DOM assertion and visual postcondition receives a verdict in\{Yes,No,Uncertain\}\\\{\\textsc\{Yes\},\\textsc\{No\},\\textsc\{Uncertain\}\\\}\. OnlyYesis treated as passing when aggregating assertion\-, transition\-, test\-item\-, and requirement\-level scores\.Noindicates contradicted evidence, whileUncertainindicates insufficient or ambiguous evidence\. This conservative rule prevents ambiguous observations from inflating final scores\.

Aggregation convention\.Unless otherwise specified, model\- and modality\-level scores are computed by macro\-averaging task\-level scores\. This gives each task equal weight and prevents tasks with more transitions, assertions, or requirements from dominating aggregate results\. Assertion\-level and test\-item\-level metrics are used for debugging, case studies, and failure attribution, while the main paper focuses on state reachability, transition validity, and explicit/implicit requirement coverage\.

Compact overall score\.For leaderboard readability, we report an auxiliary Overall score:

O​\(θ\)=1\|ℳ\|​∑m∈ℳT​\(θ,m\)\+R​\(θ,m\)\+V​\(θ,m\)3,O\(\\theta\)=\\frac\{1\}\{\|\\mathcal\{M\}\|\}\\sum\_\{m\\in\\mathcal\{M\}\}\\frac\{T\(\\theta,m\)\+R\(\\theta,m\)\+V\(\\theta,m\)\}\{3\},\(10\)whereVVis the modality\-specific auxiliary visual score\. Overall is used only as a compact summary; the primary analysis relies on the diagnostic interaction and requirement metrics, especiallyT%T\\%,Re%R\_\{e\}\\%,Ri%R\_\{i\}\\%, andR%R\\%\.

### B\.5Visual Quality Evaluation Details

WebRISEreports visual quality as an auxiliary signal, complementary to executable interaction metrics\. The visual evaluator combines three components: layout structure, color accessibility, and perceptual aesthetics, with modality\-specific aggregation\.

Layout and color\.The layout module performs coarse\-grained block modeling over the rendered page, measuring alignment, structural clarity, and floating\-element artifacts\. When a visual reference is available, it also measures cross\-page structural consistency using row\-level signatures and grid\-distribution similarity\. The color module checks text contrast against WCAG thresholds \(≥\\geq4\.5:1 for normal text and≥\\geq3:1 for large text\), and for Image/Video additionally compares palette and contrast\-profile similarity to the reference page\.

Aesthetics\.A VLM\-based scorer evaluates screenshots along high\-level perceptual dimensions, including whitespace balance, recurring\-element consistency, hierarchy clarity, and overall polish\. This complements rule\-based layout and color checks with visual judgments that are difficult to encode deterministically\.

Modality\-specific aggregation\.For Text, aesthetics is the primary signal, with layout and color used as auxiliary checks\. For Markdown and Sketch, structural similarity to the reference specification receives the largest weight, supplemented by aesthetics\. For Image and Video, layout fidelity and color reproduction relative to the reference page are primary, with aesthetics as a secondary signal\. All visual scores are macro\-averaged across tasks, and full model\-by\-modality visual scores are reported in[Table˜9](https://arxiv.org/html/2606.03220#A2.T9)\.

Table 9:Auxiliary visual\-quality scores by model and input modality\. Scores are reported on a 0–100 scale and macro\-averaged across tasks\.Table 10:Judge\-model robustness validation on 100 sampled GT/defect\-injected HTML pairs\.

## Appendix CAdditional Experimental Details and Results

### C\.1Evaluation Judge Configuration

We use GPT\-5\-mini for transition\-level DOM assertion and visual postcondition scoring, and Gemini\-3\-Flash\-Preview for auxiliary visual\-quality scoring\. The same judge configuration is applied to all evaluated models, tasks, and modalities\.

To verify that the lighter transition\-level judge does not reduce defect sensitivity, we compare GPT\-5\-mini with GPT\-5\.4 on 100 sampled GT/defect\-injected HTML pairs\. Each pair contains a GT\-validated page that passes the ICG\-based evaluation and a corresponding defect\-injected variant that introduces a controlled interaction fault\. As shown in[Table˜10](https://arxiv.org/html/2606.03220#A2.T10), GPT\-5\-mini remains close to GPT\-5\.4 on this sampled control set\.

### C\.2Additional Modality Analysis

Table 11:Average modality\-level performance onWebRISEacross all evaluated models and tasks\.TTdenotes transition validity;ReR\_\{e\}andRiR\_\{i\}denote explicit and implicit requirement coverage;Δ=Re−Ri\\Delta=R\_\{e\}\-R\_\{i\}is the explicit–implicit gap;RRdenotes overall requirement coverage;VVis the auxiliary visual score; and Overall is the mean ofTT,RR, andVV\. Bold and underlined values indicate the best and second\-best results in each column\.[Table˜11](https://arxiv.org/html/2606.03220#A3.T11)reports modality\-level averages across all evaluated models and tasks\. Video achieves the strongest interaction\-oriented performance, leading in transition validity \(TT\), implicit requirement coverage \(RiR\_\{i\}\), and overall requirement coverage \(RR\), while reducing the explicit–implicit gap to7\.77\.7points\. This suggests that temporal demonstrations are especially helpful for recovering state changes and implicit product\-level behavior\. Image obtains the highest explicit requirement coverage \(Re=62\.8R\_\{e\}=62\.8\) and closely follows Video onTTandRR, indicating that high\-fidelity visual grounding helps models recover visible components and initial interface state\. By contrast, Sketch obtains the highest auxiliary visual score \(V=81\.4V=81\.4\), but lags behind Image and Video on interaction and requirement metrics\. This indicates that visual organization alone is not a reliable proxy for executable interaction correctness\.

ModelTextMDSketchImageVideoOverallSSTTReR\_\{e\}RiR\_\{i\}RRVVSSTTReR\_\{e\}RiR\_\{i\}RRVVSSTTReR\_\{e\}RiR\_\{i\}RRVVSSTTReR\_\{e\}RiR\_\{i\}RRVVSSTTReR\_\{e\}RiR\_\{i\}RRVVOpen\-SourceQwen3\.6\-35B\-A3B31\.826\.836\.625\.930\.578\.220\.015\.522\.316\.719\.280\.847\.141\.254\.538\.145\.477\.051\.846\.656\.743\.849\.671\.753\.349\.557\.747\.952\.272\.850\.5Qwen3\.5\-122B\-A10B42\.838\.048\.935\.241\.256\.847\.542\.554\.139\.345\.972\.043\.438\.049\.736\.242\.374\.045\.440\.250\.938\.143\.870\.747\.042\.851\.743\.547\.171\.351\.1Qwen3\.5\-27B41\.436\.347\.334\.340\.059\.946\.941\.753\.538\.845\.572\.144\.638\.650\.736\.542\.776\.847\.742\.653\.641\.246\.770\.647\.243\.151\.043\.446\.971\.851\.7Qwen3\.5\-397B\-A17B51\.245\.757\.242\.849\.264\.856\.251\.162\.348\.254\.575\.752\.546\.860\.142\.850\.578\.953\.248\.457\.746\.351\.472\.853\.349\.356\.849\.452\.872\.157\.6Qwen3\.6\-27B52\.747\.958\.444\.850\.975\.362\.257\.567\.354\.360\.183\.055\.650\.460\.947\.253\.387\.260\.355\.264\.852\.057\.874\.158\.554\.261\.453\.457\.274\.162\.5Kimi\-K2\.553\.548\.559\.446\.151\.968\.961\.957\.067\.353\.559\.673\.852\.847\.858\.344\.050\.479\.961\.356\.965\.254\.059\.172\.662\.258\.665\.056\.560\.372\.961\.2Kimi\-K2\.649\.444\.654\.241\.847\.383\.156\.551\.762\.948\.454\.987\.153\.047\.858\.845\.651\.586\.363\.258\.566\.655\.460\.473\.267\.163\.768\.462\.665\.473\.563\.3ProprietaryGemini 3 Flash49\.744\.756\.241\.548\.271\.955\.350\.063\.247\.054\.179\.351\.246\.157\.742\.749\.385\.459\.554\.164\.751\.557\.572\.449\.945\.653\.544\.248\.570\.858\.5Claude Opus 4\.647\.943\.353\.139\.545\.556\.658\.854\.363\.350\.656\.373\.957\.552\.363\.648\.055\.072\.262\.157\.765\.954\.259\.570\.255\.752\.658\.451\.754\.970\.758\.3Gemini 3\.1 Pro55\.650\.761\.147\.553\.669\.763\.658\.969\.554\.961\.579\.256\.852\.262\.548\.854\.984\.859\.154\.563\.351\.957\.172\.255\.852\.058\.951\.554\.971\.661\.9Qwen3\.6\-Plus54\.249\.358\.646\.651\.968\.256\.751\.762\.648\.054\.674\.558\.853\.863\.850\.656\.486\.361\.757\.566\.054\.059\.473\.865\.161\.768\.358\.963\.474\.862\.5Claude Opus 4\.753\.448\.857\.645\.850\.968\.358\.654\.563\.151\.256\.576\.254\.349\.759\.346\.952\.477\.461\.357\.064\.553\.958\.570\.567\.965\.070\.062\.866\.172\.761\.6GPT\-5\.464\.659\.770\.354\.361\.478\.465\.260\.570\.555\.462\.279\.862\.757\.870\.252\.460\.386\.664\.560\.068\.756\.662\.171\.566\.163\.168\.461\.664\.873\.766\.8GPT\-5\.565\.160\.371\.155\.362\.385\.669\.164\.473\.659\.866\.183\.365\.360\.671\.656\.062\.986\.166\.461\.869\.858\.063\.474\.168\.465\.669\.463\.566\.373\.969\.1

Table 12:Full model×\\timesmodality results with state reachability \(SS\), transition validity \(TT\), explicit \(ReR\_\{e\}\) and implicit \(RiR\_\{i\}\) requirement coverage breakdown, and modality\-specific visual scores\.Table 13:Performance on the R\-based Hard50 and Easy50 splits by input modality\. Hard50 and Easy50 are selected as the 50 tasks with the lowest and highest model\-averaged overall requirement coverage \(RR\), respectively\. Video leads on both splits, with a larger advantage on Hard50, especially for implicit requirement coverage \(RiR\_\{i\}\)\.
### C\.3Difficulty and Failure Attribution

Failure\-type taxonomy\.To analyze where functional failures occur along the interaction implementation chain, we group direct failed transitions into four functional error types\.Availabilitycaptures whether the page provides the required entry point, control, or interaction flow for completing the task\.Executioncaptures whether a user action takes effect when the relevant control or input area is present\.State & Logiccaptures whether the page correctly updates state, data rules, target content, visual status, and context after an action\.Feedback & Boundarycaptures whether the page correctly handles validation, disabled states, loading, errors, confirmations, and empty states\.

To understand whether low scores arise from uniformly harder tasks or from qualitatively different failure modes, we analyze the R\-based Hard50 and Easy50 splits from both performance and failure\-attribution perspectives\.

[Table˜13](https://arxiv.org/html/2606.03220#A3.T13)compares the R\-based Hard50 and Easy50 splits by input modality\. The performance gap is large across all modalities, confirming that Hard50 captures genuinely difficult interaction tasks rather than small metric fluctuations\. Video remains the strongest modality on both splits, but its margin is much larger on Hard50: compared with Image, Video improvesTT,RiR\_\{i\}, andRRby4\.94\.9,6\.16\.1, and4\.84\.8points on Hard50, but only by1\.61\.6,1\.71\.7, and1\.41\.4points on Easy50\. This suggests that dynamic interaction evidence is especially useful when tasks require non\-trivial state transitions and implicit behavior recovery\.

[Fig\.˜8](https://arxiv.org/html/2606.03220#A3.F8)further shows that the two splits expose different failure profiles\. State and logic errors dominate both Hard50 and Easy50, indicating that stateful result logic remains the central bottleneck\. However, Hard50 contains higher shares of availability failures and feedback/boundary failures, suggesting that difficult tasks often fail before or around the interaction boundary: required affordances may be missing, states may be unreachable, or edge\-state feedback may be incomplete\. By contrast, Easy50 failures are more concentrated in state and logic errors, meaning that models often expose a basic interaction path but still fail to maintain the correct result logic or state consistency\.

![Refer to caption](https://arxiv.org/html/2606.03220v1/x8.png)Figure 8:Failure\-family attribution on the R\-based Hard50 and Easy50 splits\. State and logic errors dominate both splits, while Hard50 shows larger shares of availability and feedback/boundary failures\.
### C\.4Full Model×\\timesModality Results

[Table˜12](https://arxiv.org/html/2606.03220#A3.T12)reports the full model\-by\-modality results\.

Table 14:ICG\-only error patterns among cases where WebGen marks all test items asYes\. Counts are computed over 13 defect\-injected cases detected only by ICG\.Table 15:High\-frequency safety check details for GPT\-5\.5, sorted by pass rate\. The lowest\-pass checks mainly involve input constraints, unsafe DOM rendering, repeated\-trigger guards, and sensitive\-form protections\.Table 16:Safety rule\-level breakdown for GPT\-5\.5\. The weakest rule families are asynchronous interaction robustness, DOM rendering safety, and request security\.
### C\.5Defect Injection Details

We further inspect the 13 defect\-injected cases where WebGen marks all test items asYes, but ICG still detects the injected defect\. As shown in[Table˜14](https://arxiv.org/html/2606.03220#A3.T14), these cases are not dominated by visibly missing controls or rendering failures\. Instead, they involve longer\-range behavioral constraints, including accumulated history preservation, cross\-feature non\-interference, navigation\-time state retention, action gating, and pre/post state consistency\. This explains why checkpoint\-style evaluation can miss them: it often verifies whether the local target appears completed, whereas ICG follows transition chains and checks requirement\-linked postconditions and state invariants\. These ICG\-only cases therefore show that explicit state\-transition contracts provide complementary coverage for hidden state errors and cross\-feature side effects beyond local checkpoint judgments\.

### C\.6Safety Evaluation Details

We provide rule\-level safety diagnostics for GPT\-5\.5, the strongest model in the main interaction evaluation\. These diagnostics are auxiliary toWebRISE’s interaction metrics and are intended to reveal common engineering\-level weaknesses in generated HTML artifacts\.

As shown in[Table˜16](https://arxiv.org/html/2606.03220#A3.T16), the weakest rule families are asynchronous interaction robustness, DOM rendering safety, and request security\. The low pass rates for R7, R6, and R1 indicate that generated pages often miss repeated\-trigger guards, safe DOM rendering practices, and basic protections for sensitive requests\. In contrast, navigation security obtains a high pass rate, but covers far fewer applicable checks and should not be interpreted as broad safety reliability\.

[Table˜15](https://arxiv.org/html/2606.03220#A3.T15)further shows that the most frequent low\-pass checks involve missing input constraints, unsafe DOM rendering, repeated\-click guards, and sensitive\-form protections\. These results suggest that even strong MLLMs may generate functional and visually plausible webpages while omitting basic front\-end safety and robustness safeguards\.

### C\.7Case Study

This section presents representative qualitative cases for the failure types used in our failure attribution analysis\. Each case shows the input modality, a passing artifact, a failing artifact, the executed transition, and the failed evidence\. Together, these examples show howWebRISEevaluates each transition from the source state to the target state and records where the expected behavior breaks\.

![Refer to caption](https://arxiv.org/html/2606.03220v1/x9.png)Figure 9:Execution failure in a messaging interface\. The transition requires filtering the conversation list, selecting the visible conversations, and batch deleting them\. The failing artifact keeps the selected conversations visible after deletion\.Case 1: Execution failure\.This case tests whether a generated messaging interface can execute a batch operation after filtering and selecting visible conversations\. The expected behavior is that the selected conversations disappear after the batch\-delete action, while unmatched conversations remain in the restored full list\. Although the failing artifact displays the search and selection flow, the selected conversations are still visible after deletion\. This indicates an execution failure: the page exposes a plausible operation path, but the underlying delete action is not successfully applied to the selected items\.

![Refer to caption](https://arxiv.org/html/2606.03220v1/x10.png)Figure 10:Feedback and boundary failure in a feed\-loading interaction\. The transition requires scrolling to the bottom, triggering next\-page loading, and displaying loading feedback during data fetching\. The failing artifact does not show the required loading placeholder\.Case 2: Feedback & Boundary failure\.This case focuses on process feedback during an infinite\-scroll interaction\. After the user scrolls to the bottom, the page should indicate that the next page of content is being fetched, for example through a skeleton screen or loading placeholder\. The failing artifact reaches the scroll boundary but provides no observable loading state, and the evidence also shows no newly appended posts\. This failure shows that the main interaction entry point may exist, while the boundary\-state feedback required for a realistic web interaction is still missing\.

![Refer to caption](https://arxiv.org/html/2606.03220v1/x11.png)Figure 11:State\-and\-logic failure in a course waitlist interaction\. After Carol’s waitlist entry is cancelled, Dave should remain on the CS201 waitlist and move from position \#2 to \#1\. The failing artifact removes Carol but leaves Dave’s queue position unchanged\.Case 3: State & Logic failure – inconsistent state update\.This case evaluates whether a course registration page correctly updates dependent waitlist state\. The transition first enrolls Alice and Bob, adds Carol and Dave to the CS201 waitlist, cancels Carol’s entry, and then opens the waitlist view\. The failing artifact correctly removes Carol, but Dave remains marked as \#2 instead of being promoted to \#1\. The error is an incomplete state update: one part of the state changes, while the dependent queue order is left stale\.

![Refer to caption](https://arxiv.org/html/2606.03220v1/x12.png)Figure 12:State\-and\-logic failure in a layer\-list interaction\. The transition requires opening the layer panel and hiding the topmost layer\. The failing artifact gives weak or transient hidden\-state evidence but leaves the corresponding canvas element visible\.Case 4: State & Logic failure – cross\-view inconsistency\.This case tests synchronization between a layer list and the visible canvas\. After the topmost layer is hidden from the layer panel, the corresponding object should no longer appear on the canvas, and the layer list should reflect the hidden state\. The failing artifact provides only uncertain final\-state evidence in the layer list and still displays the hidden layer on the canvas\. This exposes a cross\-view state inconsistency: the control\-side state and the rendered canvas state are not synchronized\.

![Refer to caption](https://arxiv.org/html/2606.03220v1/x13.png)Figure 13:State preservation failure in a draft\-recovery workflow\. After typing text, attaching an image, refreshing the page, and reopening the editor, the draft should be restored\. The failing artifact loses both the entered text and the attached image\.Case 5: State & Logic failure – state not preserved\.This case examines whether a social post editor preserves draft content across an unexpected refresh\. The transition requires opening the editor, entering the text “draft recovery test”, attaching an image, refreshing the page, and reopening the editor\. The failing artifact reopens the editor but shows the placeholder text and no image preview, meaning that neither the text nor the attachment is restored\. This demonstrates a persistence failure: the interaction is locally available, but the generated page does not preserve user\-created state across the page lifecycle\.

![Refer to caption](https://arxiv.org/html/2606.03220v1/x14.png)Figure 14:State preservation failure in an image\-editing workflow\. The transition requires applying a 90\-degree rotation, entering crop mode, selecting an aspect ratio, and applying the crop while preserving the prior rotation\. The failing artifact applies the crop but resets the rotation state\.Case 6: State & Logic failure – operation state reset\.This case evaluates whether an image editor preserves earlier editing state when a later operation is applied\. The transition first rotates the image by 90 degrees and then performs a crop with a selected aspect ratio\. The failing artifact applies the crop, but the final result no longer preserves the prior 90\-degree rotation state\. This is a state\-preservation error across sequential editing operations: the later crop operation incorrectly resets an earlier transformation state\.

## Appendix DPrompt Templates

This section lists the prompt templates used inWebRISE, including templates for test data contract generation, test item generation, Interaction Contract Graph construction, contract\-guided agent execution, DOM assertion scoring, visual postcondition scoring\.

Test Data Contract GenerationYou are a frontend test specification designer\. Given a requirement list, produce a minimal test data contract describing what the page must be functionally ready to do on first load\. No reference implementation is provided; derive the contract from the requirements alone\.Rules:1\.Describe functional readiness, not UI structure, visual layout, DOM hierarchy, or exact element counts\.2\.For multi\-page, multi\-view, tabbed, wizard, or navigation\-based apps, explicitly name the initial page, view, route, or mode shown on first load\.3\.Do not prescribe positions, component hierarchy, styling, specific mock values, asset sources, or exact numbers of items\.4\.Include only conditions needed for the first test action to be possible\.5\.Do not expose implicit requirements\. Remove contract text that reveals behavior beyond the explicit requirements\.6\.If a default initial view is not specified but the app requires one, choose a reasonable primary workflow view and state it functionally\.Output:Return exactly one JSON object:\{"test\_data\_contract": "functional preconditions describing page readiness"\}

Figure 15:Prompt for deriving initial functional readiness from requirements\.Test Item GenerationYou are a frontend test specification designer\. Produce a test item list covering every testable behavior in the given requirement list\. No reference implementation is provided; derive the test items from the requirements alone\.Rules:1\.Generate one test item per distinct testable behavior\. Every explicit and implicit requirement ID must appear in at least one item’sreq\_ids\.2\.Triggers and expected results must be implementation\-neutral\. A trigger describes user intent, not a click sequence or pure observation; an expected result describes the semantic outcome\.3\.Combine tightly coupled behaviors that share the same trigger and artifact, but split named cases with distinct outcomes\.4\.Use only primary requirement IDs inreq\_ids, normally at most two per item\. If an expected result verifies a requirement, include that requirement ID\.5\.Do not invent behaviors\. Failure, error, boundary, and follow\-up cases must be grounded in named requirements\.6\.Implicit requirements may refine explicitly stated behaviors, but must not introduce new scenarios by themselves\.7\.Do not create standalone negative\-capability items; fold unavailable actions into the expected result of the state\-changing item that causes them\.8\.Selection\-gated controls are modeled as one behavior: the trigger selects the required content and invokes the control, while the expected result covers the gated availability\.9\.Guard or prevention behavior may be a separate item only when it has a distinct user trigger and a clearly observable prevention outcome\.10\.For toggles or bidirectional behavior, describe switching from the current state to the alternative without assuming a default\.11\.Do not split countdown, cooldown, or expiry flows across separate items; keep the complete timed user intent in one item\.Output:Return exactly one JSON object:\{"test\_items": \[\{"item\_id": "TI\-1", "req\_ids": \[\.\.\.\], "description": "\.\.\.", "trigger": "\.\.\.", "expected\_result": "\.\.\."\}\]\}

Figure 16:Prompt for converting requirements into implementation\-neutral test items\.ICG GenerationYou are a frontend interaction test case designer\. Generate a test specification withstatesandtransitions\. Each transition specifies a source state, a target state, a self\-containedagent\_task, mapped test item IDs,dom\_assertions, and/or visualpostconditions\.State rules:1\.States are stable, replayable page checkpoints\. UseS0,S1, etc\.;S0is the initial state\.2\.State descriptions are short, implementation\-neutral page snapshots\. Do not model transient UI such as spinners, toasts, timers, or animations as states\.3\.Define a new state whenever visible content, selected controls, open panels, input values, displayed artifacts, or persistent UI state differs materially\.4\.Preserve state continuity: unchanged visible aspects from the source state should carry into the target state’s description\.5\.A self\-loop is valid only when the after\-state is observably identical to the before\-state\.Transition rules:1\.Use sequential IDsT1,T2, etc\. Each transition declaresfrom,to,agent\_task,mapped\_test\_items, and at least one non\-empty assertion list\.2\.preconditionsare allowed only onT1; later transitions must not contain preconditions\.3\.Usedom\_assertionsfor DOM mutation evidence, temporal evidence, and element\-level state\. Prefix each DOM assertion with exactly\[CHANGE\]or\[AFTER\]\.4\.Use visualpostconditionsfor outcomes judged from before/after screenshots\. Do not prefix postconditions\.5\.Theagent\_taskmust describe the user’s goal from the currentfromstate\. It must be actionable, self\-contained, and not a low\-level selector or click sequence\.6\.Theagent\_taskmust not refer to previous transitions; if prior context is needed, describe it as a persistent property of the current source state\.7\.Do not create observation\-only or self\-check transitions\. Static checks should be attached to the transition that establishes the checked state\.8\.No hitchhiking: every mapped test item must be directly caused by this transition’sagent\_taskand verifiable from this transition’s final state\.9\.Every input test item must be covered by at least one transition\.10\.Every mapped test item must have at least one directdom\_assertionor visualpostconditionin the same transition\.11\.Independent features should fan out from the same source state rather than being falsely chained; serial chains are used only when the later transition genuinely requires the prior target state\.12\.Compound transitions may combine two or three operations only when they share the same artifact and are intended to test state interference or clobbering\.13\.Do not invent UI controls, defaults, failure paths, labels, data values, or data assumptions not supported by the Test Data Contract or test items\.14\.Empty\-state tests must explicitly clear or remove pre\-existing content when the Test Data Contract does not guarantee emptiness\.Output:Return exactly one JSON object with top\-level fieldsstatesandtransitions\. Do not include markdown fences or extra prose\.

Figure 17:Prompt for generating the state\-transition Interaction Contract Graph\.Agent ExecutionYou are a robot browsing the web to execute a web\-testing task\. In each iteration, you receive an indexed DOM observation where elements are prefixed with\[N\], and newly appeared elements may be marked with\*\[N\]\. Choose exactly one action using indices from the latest observation\.Action grammar:Click \[N\] Click \[N\]; count Dismiss DoubleClick \[N\] RightClick \[N\] LongPress \[N\]; ms Hover \[N\] Input \[N\]; text InputDate \[N\]; YYYY\-MM\-DD Clear \[N\] Blur \[N\] Select \[N\]; option label Check \[N\]/Uncheck \[N\] Press \[N\]; key Press \[N\]; key; count Scroll \[N or WINDOW\]; up\|down\|top\|bottom\|left\|right Drag \[N\]; \[M\] Drag \[N\]; offset\_x=px,offset\_y=px DragRange \[N\]; target value ClickAt; x=px y=px Upload \[N\]; file\_path Upload \[N\]; file\_path\|file\_path SelectText \[N\]; text to select Wait; ms Refresh GoBack Reset DoneExecution constraints:Use only provided test assets for upload, do not bypass a required interaction path with a similar final state, and emitDoneonly after the requested user action has fully executed\.Output:Return exactly one JSON object:\{"thought": "brief reasoning", "action": "ONE Action"\}

Figure 18:Prompt used by the browser agent to execute one transition\.DOM Assertion ScoringYou are a strict UI test evaluator\. Determine whether DOM assertions are satisfied based on structured DOM event evidence collected during a web interaction\. The evidence contains the action performed, initial and final DOM snapshots, mutation events, changed attributes, added and removed nodes, and interactive\-element summaries\.Assertion semantics:1\.\[CHANGE\]means the condition appeared at any point in the full timeline, including initial snapshot, mutation events, intermediate summaries, or final snapshot\.2\.\[AFTER\]means the condition must hold in the final stable state\. Timeline evidence may help locate the target, but the final state must satisfy the assertion\.Judging rules:1\.Locate elements by semantic role, tag, ID, class, text, attribute changes, and interactive\-element summaries\.2\.Hidden ornot\-visibletext counts as DOM evidence but not as proof that the text is visibly present\.3\.For disabled or non\-interactive state, prioritizepointer\-events: none, nativedisabledoraria\-disabled, then state\-indicative class tokens\.4\.For selected, active, highlighted, expanded, checked, or pressed states, use ARIA fields first and class tokens as secondary evidence\.5\.Added and removed nodes may prove transient feedback such as loading, saving, progress, confirmation, or disappearance\.6\.Judge by semantic equivalence rather than exact wording, but be strict on factual contradictions\.7\.PreferUNCERTAINwhen evidence is incomplete, except when absence from final interactive elements directly supports a non\-interactive or absent assertion\.8\.For debounce or delayed\-update assertions, accept evidence of a single delayed update after input settles rather than requiring keystroke\-level events\.9\.Do not treat hidden template text as evidence that a visible status or control is active\.Output:Return exactly one JSON object:\{"evaluations": \[\{"think": "\.\.\.", "result": "YES\|NO\|UNCERTAIN"\}\]\}

Figure 19:Prompt for judging DOM assertions from mutation evidence\.Visual Postcondition ScoringYou are a strict UI test evaluator\. Compare two screenshots: Image 1 is before the interaction, and Image 2 is after the interaction\. Determine whether each assertion holds on the current page\.Judging rules:1\.Judge by semantic equivalence rather than exact wording; be strict on factual correctness but lenient on terminology\.2\.For conditional assertions, determine which branch applies from the screenshots and evaluate only that branch\.3\.If content is clipped by the viewport or a scrollable container, evaluate only fully visible items\.4\.Accept small numeric changes hidden by rounding or abbreviation when the structural outcome is otherwise correct\.5\.For search or filter assertions, an empty result may pass if the filter is visibly active and the page shows a valid empty state\.6\.For body or full\-text search, do not require every visible result row to display the matching keyword if the active query and changed result set support the outcome\.7\.For visually ambiguous natural\-image flips, do not answerUNCERTAINsolely because the flip is hard to distinguish when other requested edits are clearly visible\.8\.Use before/after differences to judge reordering, expansion, collapsed panels, drag placement, list updates, and stale\-state removal\.Output:Return exactly one JSON object:\{"evaluations": \[\{"think": "\.\.\.", "result": "YES\|NO\|UNCERTAIN"\}\]\}

Figure 20:Prompt for judging postconditions from before/after screenshots\.
## Appendix ECode and Data Availability

Upon acceptance, we will release the code and data forWebRISEunder the MIT license\. The release will include task specifications, requirement annotations, Interaction Contract Graphs, evaluation scripts, prompt templates, and aggregated results for reproducing the main experiments\. We will exclude information that may identify individual contributors or annotators for privacy reasons\.

Similar Articles

WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models

Hugging Face Daily Papers

WebCompass is a multimodal benchmark for evaluating LLMs on web coding tasks across three input modalities (text, image, video) and three task types (generation, editing, repair). It introduces an Agent-as-a-Judge paradigm that autonomously executes generated websites in a real browser to assess visual fidelity and interactivity.

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

Hugging Face Daily Papers

MM-WebAgent is a hierarchical agentic framework that generates coherent and visually consistent webpages by coordinating AIGC-based element generation through joint optimization of layout and multimodal content. The paper introduces a benchmark and multi-level evaluation protocol, demonstrating improvements over code-generation and agent-based baselines.