ChatPaper.aiChatPaper

GEBench:将图像生成模型作为图形用户界面环境的基准测试框架

GEBench: Benchmarking Image Generation Models as GUI Environments

February 9, 2026
作者: Haodong Li, Jingwei Wu, Quan Sun, Guopeng Li, Juanxi Tian, Huanyu Zhang, Yanlin Lai, Ruichuan An, Hongbo Peng, Yuhong Dai, Chenxi Li, Chunmei Qing, Jia Wang, Ziyang Meng, Zheng Ge, Xiangyu Zhang, Daxin Jiang
cs.AI

摘要

近期图像生成模型的进展使得基于用户指令预测未来图形用户界面(GUI)状态成为可能。然而现有基准主要关注通用领域的视觉保真度,对GUI特定场景中状态转换和时间连贯性的评估仍显不足。为填补这一空白,我们推出GEBench——一个用于评估GUI生成中动态交互与时间连贯性的综合基准。该基准包含700个精心筛选的样本,涵盖五大任务类别,涉及真实与虚构场景下的单步交互与多步轨迹,以及定位点标注任务。为支持系统化评估,我们提出GE-Score新型五维度量标准,从目标达成度、交互逻辑性、内容一致性、界面合理性和视觉质量五个维度进行评估。现有模型的广泛测试表明:虽然单步转换表现良好,但在长交互序列中维持时间连贯性和空间定位方面存在显著困难。研究发现图标理解、文本渲染和定位精度是当前的关键瓶颈。本工作为系统化评估奠定了基础,并为构建高保真生成式GUI环境的未来研究指明了方向。代码已开源:https://github.com/stepfun-ai/GEBench。
English
Recent advancements in image generation models have enabled the prediction of future Graphical User Interface (GUI) states based on user instructions. However, existing benchmarks primarily focus on general domain visual fidelity, leaving the evaluation of state transitions and temporal coherence in GUI-specific contexts underexplored. To address this gap, we introduce GEBench, a comprehensive benchmark for evaluating dynamic interaction and temporal coherence in GUI generation. GEBench comprises 700 carefully curated samples spanning five task categories, covering both single-step interactions and multi-step trajectories across real-world and fictional scenarios, as well as grounding point localization. To support systematic evaluation, we propose GE-Score, a novel five-dimensional metric that assesses Goal Achievement, Interaction Logic, Content Consistency, UI Plausibility, and Visual Quality. Extensive evaluations on current models indicate that while they perform well on single-step transitions, they struggle significantly with maintaining temporal coherence and spatial grounding over longer interaction sequences. Our findings identify icon interpretation, text rendering, and localization precision as critical bottlenecks. This work provides a foundation for systematic assessment and suggests promising directions for future research toward building high-fidelity generative GUI environments. The code is available at: https://github.com/stepfun-ai/GEBench.
PDF362February 11, 2026