PlanViz：面向计算机使用任务的规划导向图像生成与编辑评估

摘要

统一多模态模型在生成自然图像和支持多模态推理方面展现出卓越能力，然而其在支持与生活密切相关的计算机使用规划任务方面的潜力尚未得到充分探索。计算机使用任务中的图像生成与编辑需要空间推理、流程理解等能力，目前尚不清楚统一多模态模型是否具备完成这些任务的能力。为此，我们提出PlanViz新基准，专门评估计算机使用任务中的图像生成与编辑效果。为实现评估目标，我们聚焦于日常生活中频繁涉及且需要规划步骤的子任务，具体设计了三个新子任务：路径规划、工作流程图绘制以及网页与界面展示。通过人工标注的问题集、参考图像及质量控制流程，我们解决了数据质量保障的挑战。针对全面精准评估的难题，我们提出任务自适应评分体系PlanScore，该评分有助于理解生成图像的正确性、视觉质量与效能。实验结果表明了该研究领域的关键局限性与未来研究方向。

English

Unified multimodal models (UMMs) have shown impressive capabilities in generating natural images and supporting multimodal reasoning. However, their potential in supporting computer-use planning tasks, which are closely related to our lives, remain underexplored. Image generation and editing in computer-use tasks require capabilities like spatial reasoning and procedural understanding, and it is still unknown whether UMMs have these capabilities to finish these tasks or not. Therefore, we propose PlanViz, a new benchmark designed to evaluate image generation and editing for computer-use tasks. To achieve the goal of our evaluation, we focus on sub-tasks which frequently involve in daily life and require planning steps. Specifically, three new sub-tasks are designed: route planning, work diagramming, and web&UI displaying. We address challenges in data quality ensuring by curating human-annotated questions and reference images, and a quality control process. For challenges of comprehensive and exact evaluation, a task-adaptive score, PlanScore, is proposed. The score helps understanding the correctness, visual quality and efficiency of generated images. Through experiments, we highlight key limitations and opportunities for future research on this topic.