Creation-MMBench：评估多模态大语言模型中的情境感知创造力

摘要

创造力是智能的核心要素，涉及在多样情境中生成新颖且适宜解决方案的能力。尽管大型语言模型（LLMs）的创造性能力已得到广泛评估，但多模态大型语言模型（MLLMs）在这一领域的评估仍鲜有探索。为填补这一空白，我们推出了Creation-MMBench，这是一个专门设计用于评估MLLMs在现实世界图像任务中创造能力的多模态基准。该基准包含765个测试案例，覆盖51项细粒度任务。为确保评估的严谨性，我们为每个测试案例定义了实例特定的评估标准，指导对响应整体质量及与视觉输入事实一致性的评估。实验结果显示，当前开源的MLLMs在创造性任务上显著落后于专有模型。此外，我们的分析表明，视觉微调可能会削弱基础LLM的创造能力。Creation-MMBench为推进MLLMs的创造力提供了宝贵洞见，并为未来多模态生成智能的改进奠定了基础。完整数据及评估代码已发布于https://github.com/open-compass/Creation-MMBench。

English

Creativity is a fundamental aspect of intelligence, involving the ability to generate novel and appropriate solutions across diverse contexts. While Large Language Models (LLMs) have been extensively evaluated for their creative capabilities, the assessment of Multimodal Large Language Models (MLLMs) in this domain remains largely unexplored. To address this gap, we introduce Creation-MMBench, a multimodal benchmark specifically designed to evaluate the creative capabilities of MLLMs in real-world, image-based tasks. The benchmark comprises 765 test cases spanning 51 fine-grained tasks. To ensure rigorous evaluation, we define instance-specific evaluation criteria for each test case, guiding the assessment of both general response quality and factual consistency with visual inputs. Experimental results reveal that current open-source MLLMs significantly underperform compared to proprietary models in creative tasks. Furthermore, our analysis demonstrates that visual fine-tuning can negatively impact the base LLM's creative abilities. Creation-MMBench provides valuable insights for advancing MLLM creativity and establishes a foundation for future improvements in multimodal generative intelligence. Full data and evaluation code is released on https://github.com/open-compass/Creation-MMBench.

Creation-MMBench：评估多模态大语言模型中的情境感知创造力

Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM

摘要

Support