PlanViz: コンピュータ利用タスクにおける計画志向の画像生成・編集の評価

要旨

統合マルチモーダルモデル（UMM）は、自然画像の生成とマルチモーダル推論において顕著な能力を示している。しかし、我々の生活に密接に関連するコンピュータ利用計画タスクを支援する可能性については、未だ十分に探究されていない。コンピュータ利用タスクにおける画像生成と編集には、空間推論や手順理解などの能力が要求されるが、UMMがこれらのタスクを完了するために必要な能力を有するか否かは明らかではない。そこで本論文では、コンピュータ利用タスクにおける画像生成と編集を評価するための新しいベンチマークであるPlanVizを提案する。評価の目的を達成するため、日常生活で頻繁に関与し計画段階を要するサブタスクに焦点を当てる。具体的には、経路計画、作業図式化、Web・UI表示の3つの新規サブタスクを設計した。データ品質確保の課題に対処するため、人手で注釈付けされた質問と参照画像、および品質管理プロセスを整備した。包括的かつ正確な評価の課題に対しては、タスク適応型スコアであるPlanScoreを提案する。このスコアは、生成画像の正確性、視覚的品質、効率性の理解を支援する。実験を通じて、本トピックに関する将来研究の主要な限界と機会を明らかにする。

English

Unified multimodal models (UMMs) have shown impressive capabilities in generating natural images and supporting multimodal reasoning. However, their potential in supporting computer-use planning tasks, which are closely related to our lives, remain underexplored. Image generation and editing in computer-use tasks require capabilities like spatial reasoning and procedural understanding, and it is still unknown whether UMMs have these capabilities to finish these tasks or not. Therefore, we propose PlanViz, a new benchmark designed to evaluate image generation and editing for computer-use tasks. To achieve the goal of our evaluation, we focus on sub-tasks which frequently involve in daily life and require planning steps. Specifically, three new sub-tasks are designed: route planning, work diagramming, and web&UI displaying. We address challenges in data quality ensuring by curating human-annotated questions and reference images, and a quality control process. For challenges of comprehensive and exact evaluation, a task-adaptive score, PlanScore, is proposed. The score helps understanding the correctness, visual quality and efficiency of generated images. Through experiments, we highlight key limitations and opportunities for future research on this topic.

PlanViz: コンピュータ利用タスクにおける計画志向の画像生成・編集の評価

PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks

要旨

Support