CREval：複雑な指示に基づく創造的画像操作のための自動解釈可能評価

要旨

命令に基づくマルチモーダル画像編集は近年急速な進歩を遂げている。しかし、既存の評価手法は、複雑で創造的な編集タスクにおけるモデル性能を評価するための体系的かつ人間の判断に沿った枠組みを欠いている。この課題を解決するため、我々はCREvalを提案する。これは、不透明なマルチモーダル大規模言語モデル（MLLM）による評価の不完全さと解釈性の低さを克服する、完全自動化された質問応答（QA）ベースの評価パイプラインである。同時に、複雑な命令下での創造的画像編集に特化して設計された包括的ベンチマーク、CREval-Benchを導入する。CREval-Benchは3つのカテゴリと9つの創造的次元をカバーし、800以上の編集サンプルと13,000の評価クエリで構成される。このパイプラインとベンチマークを活用し、我々は多様な最先端のオープンソースおよびクローズドソースモデルを体系的に評価した。結果は、クローズドソースモデルが一般的に複雑で創造的なタスクにおいてオープンソースモデルを上回るものの、全てのモデルが依然としてそのような編集を効果的に完了するのに苦戦していることを明らかにした。さらに、ユーザー調査により、CREvalの自動評価指標と人間の判断との間に強い一貫性があることが実証された。したがって、CREvalは、複雑で創造的な画像編集タスクにおける画像編集モデルの評価のための信頼性の高い基盤を提供し、将来の研究における主要な課題と機会を浮き彫りにする。

English

Instruction-based multimodal image manipulation has recently made rapid progress. However, existing evaluation methods lack a systematic and human-aligned framework for assessing model performance on complex and creative editing tasks. To address this gap, we propose CREval, a fully automated question-answer (QA)-based evaluation pipeline that overcomes the incompleteness and poor interpretability of opaque Multimodal Large Language Models (MLLMs) scoring. Simultaneously, we introduce CREval-Bench, a comprehensive benchmark specifically designed for creative image manipulation under complex instructions. CREval-Bench covers three categories and nine creative dimensions, comprising over 800 editing samples and 13K evaluation queries. Leveraging this pipeline and benchmark, we systematically evaluate a diverse set of state-of-the-art open and closed-source models. The results reveal that while closed-source models generally outperform open-source ones on complex and creative tasks, all models still struggle to complete such edits effectively. In addition, user studies demonstrate strong consistency between CREval's automated metrics and human judgments. Therefore, CREval provides a reliable foundation for evaluating image editing models on complex and creative image manipulation tasks, and highlights key challenges and opportunities for future research.

CREval：複雑な指示に基づく創造的画像操作のための自動解釈可能評価

CREval: An Automated Interpretable Evaluation for Creative Image Manipulation under Complex Instructions

要旨

Support