**GEBench:將影像生成模型作為圖形使用者介面環境的基準測試平台**
GEBench: Benchmarking Image Generation Models as GUI Environments
February 9, 2026
作者: Haodong Li, Jingwei Wu, Quan Sun, Guopeng Li, Juanxi Tian, Huanyu Zhang, Yanlin Lai, Ruichuan An, Hongbo Peng, Yuhong Dai, Chenxi Li, Chunmei Qing, Jia Wang, Ziyang Meng, Zheng Ge, Xiangyu Zhang, Daxin Jiang
cs.AI
摘要
近期圖像生成模型的進展已能根據用戶指令預測未來圖形用戶界面(GUI)狀態。然而,現有基準主要關注通用領域的視覺保真度,對GUI特定場景中的狀態轉換與時間連貫性評估仍顯不足。為填補此空白,我們提出GEBench——一個用於評估GUI生成中動態交互與時間連貫性的綜合基準。該基準包含700個精心篩選的樣本,涵蓋五類任務範疇,涉及真實與虛構場景中的單步交互與多步軌跡,以及定位點標註。為支持系統化評估,我們提出GE-Score新型五維度量標準,從目標達成度、交互邏輯性、內容一致性、界面合理性及視覺品質五個維度進行評測。對現有模型的廣泛評估表明:雖然模型在單步轉換表現良好,但在長交互序列中維持時間連貫性與空間定位方面存在明顯不足。我們發現圖標解讀、文本渲染與定位精度是當前關鍵瓶頸。本研究為系統化評估奠定基礎,並為構建高保真生成式GUI環境的未來研究方向提供啟示。程式碼已開源於:https://github.com/stepfun-ai/GEBench。
English
Recent advancements in image generation models have enabled the prediction of future Graphical User Interface (GUI) states based on user instructions. However, existing benchmarks primarily focus on general domain visual fidelity, leaving the evaluation of state transitions and temporal coherence in GUI-specific contexts underexplored. To address this gap, we introduce GEBench, a comprehensive benchmark for evaluating dynamic interaction and temporal coherence in GUI generation. GEBench comprises 700 carefully curated samples spanning five task categories, covering both single-step interactions and multi-step trajectories across real-world and fictional scenarios, as well as grounding point localization. To support systematic evaluation, we propose GE-Score, a novel five-dimensional metric that assesses Goal Achievement, Interaction Logic, Content Consistency, UI Plausibility, and Visual Quality. Extensive evaluations on current models indicate that while they perform well on single-step transitions, they struggle significantly with maintaining temporal coherence and spatial grounding over longer interaction sequences. Our findings identify icon interpretation, text rendering, and localization precision as critical bottlenecks. This work provides a foundation for systematic assessment and suggests promising directions for future research toward building high-fidelity generative GUI environments. The code is available at: https://github.com/stepfun-ai/GEBench.