CREval：面向复杂指令下创意图像编辑的自动化可解释评估系统

摘要

基于指令的多模态图像编辑技术近年来发展迅速，然而现有评估方法缺乏系统化且符合人类感知的框架来衡量模型在复杂创意编辑任务上的表现。为此，我们提出CREval——一个基于问答机制的自动化评估流程，该方案克服了不透明多模态大语言模型评分存在的完整性不足与可解释性差的问题。同时，我们构建了CREval-Bench基准数据集，这是专为复杂指令下的创意图像编辑而设计的综合评估体系，涵盖3大类别9个创意维度，包含800余个编辑样本和1.3万条评估查询。借助该流程与基准，我们系统评估了多种前沿的开源与闭源模型。结果表明：虽然闭源模型在复杂创意任务上总体优于开源模型，但所有模型仍难以有效完成此类编辑。此外，用户研究表明CREval的自动化指标与人类判断具有高度一致性。因此，CREval为评估复杂创意图像编辑任务提供了可靠基础，并揭示了未来研究的关键挑战与机遇。

English

Instruction-based multimodal image manipulation has recently made rapid progress. However, existing evaluation methods lack a systematic and human-aligned framework for assessing model performance on complex and creative editing tasks. To address this gap, we propose CREval, a fully automated question-answer (QA)-based evaluation pipeline that overcomes the incompleteness and poor interpretability of opaque Multimodal Large Language Models (MLLMs) scoring. Simultaneously, we introduce CREval-Bench, a comprehensive benchmark specifically designed for creative image manipulation under complex instructions. CREval-Bench covers three categories and nine creative dimensions, comprising over 800 editing samples and 13K evaluation queries. Leveraging this pipeline and benchmark, we systematically evaluate a diverse set of state-of-the-art open and closed-source models. The results reveal that while closed-source models generally outperform open-source ones on complex and creative tasks, all models still struggle to complete such edits effectively. In addition, user studies demonstrate strong consistency between CREval's automated metrics and human judgments. Therefore, CREval provides a reliable foundation for evaluating image editing models on complex and creative image manipulation tasks, and highlights key challenges and opportunities for future research.

CREval：面向复杂指令下创意图像编辑的自动化可解释评估系统

CREval: An Automated Interpretable Evaluation for Creative Image Manipulation under Complex Instructions

摘要

Support