WebRISE: MLLM이 생성한 웹 아티팩트를 위한 요구사항 기반 상태 평가

초록

기존의 MLLM 생성 웹 산출물 평가 벤치마크는 로컬 증거를 통해 상호작용을 평가하며, 페이지의 동작 여부를 결정짓는 요구사항 유발 상태와 전이를 간과한다. 본 논문에서는 WebRISE를 제안한다. WebRISE는 작업 요구사항을 관찰 가능한 상태, 사용자 의도 전이, 그리고 DOM/시각적 어서션(assertion)으로 구성된 상호작용 계약 그래프(ICG)로 컴파일하여 구현에 독립적인 브라우저 실행을 가능하게 한다. WebRISE는 텍스트, 마크다운, 스케치, 이미지, 비디오 등 다섯 가지 입력 모달리티(modality)에 걸친 442개의 작업을 포함하며, 5,495개의 전이와 5,271개의 요구사항 검증(requirement check)으로 구성되어 사용자가 명시한 기능과 암묵적인 제품 수준 제약 조건을 구분한다. 14개의 MLLM을 대상으로 평가한 결과, 가장 강력한 모델조차 전이 유효성(transition validity) 65.6%, 요구사항 커버리지 66.3%에 그쳤으며, 시각적 품질은 행동을 대체하지 못했다(마크다운에서 Qwen3.6-35B-A3B: V=80.8, T=15.5). 비디오는 가장 강력한 상호작용 신호를 제공했으며(텍스트 대비 암묵적 커버리지 +10.6%p), 암묵적 제약 조건은 여전히 존재했다. 결함 주입 실험 결과, ICG 기반 점수화가 체크포인트 방식 평가보다 2~16배 높은 비율로 상태 오류를 탐지하는 것으로 나타났다.

English

Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task requirements into Interaction Contract Graphs (ICGs) of observable states, user-intent transitions, and DOM/visual assertions for implementation-agnostic browser execution. WebRISE spans 442 tasks across five input modalities (Text, Markdown, Sketch, Image, Video), with 5,495 transitions and 5,271 requirement checks that separate user-stated functions from implicit product-level constraints. Across 14 MLLMs, even the strongest model reaches only 65.6% transition validity and 66.3% requirement coverage, and visual quality is no proxy for behavior (Qwen3.6-35B-A3B on Markdown: V=80.8 yet T=15.5). Video gives the strongest interaction signal (+10.6 pp implicit coverage over Text), while implicit constraints persist; defect injection shows ICG-based scoring detects state errors at 2-16x the rate of checkpoint-style evaluation.