Creation-MMBench:評估MLLM中的情境感知創造力智能
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM
March 18, 2025
作者: Xinyu Fang, Zhijian Chen, Kai Lan, Shengyuan Ding, Yingji Liang, Xiangyu Zhao, Farong Wen, Zicheng Zhang, Guofeng Zhang, Haodong Duan, Kai Chen, Dahua Lin
cs.AI
摘要
創造力是智能的一個基本面向,涉及在不同情境下生成新穎且適切解決方案的能力。雖然大型語言模型(LLMs)的創造能力已得到廣泛評估,但多模態大型語言模型(MLLMs)在這一領域的評估仍鮮有探索。為填補這一空白,我們引入了Creation-MMBench,這是一個專門設計用於評估MLLMs在現實世界、基於圖像任務中創造能力的多模態基準。該基準涵蓋了51個細粒度任務中的765個測試案例。為確保評估的嚴謹性,我們為每個測試案例定義了特定實例的評估標準,指導對一般回應質量及與視覺輸入事實一致性的評估。實驗結果顯示,當前開源的MLLMs在創造性任務中顯著落後於專有模型。此外,我們的分析表明,視覺微調可能會對基礎LLM的創造能力產生負面影響。Creation-MMBench為推進MLLM創造力提供了寶貴的見解,並為未來多模態生成智能的改進奠定了基礎。完整數據及評估代碼已發佈於https://github.com/open-compass/Creation-MMBench。
English
Creativity is a fundamental aspect of intelligence, involving the ability to
generate novel and appropriate solutions across diverse contexts. While Large
Language Models (LLMs) have been extensively evaluated for their creative
capabilities, the assessment of Multimodal Large Language Models (MLLMs) in
this domain remains largely unexplored. To address this gap, we introduce
Creation-MMBench, a multimodal benchmark specifically designed to evaluate the
creative capabilities of MLLMs in real-world, image-based tasks. The benchmark
comprises 765 test cases spanning 51 fine-grained tasks. To ensure rigorous
evaluation, we define instance-specific evaluation criteria for each test case,
guiding the assessment of both general response quality and factual consistency
with visual inputs. Experimental results reveal that current open-source MLLMs
significantly underperform compared to proprietary models in creative tasks.
Furthermore, our analysis demonstrates that visual fine-tuning can negatively
impact the base LLM's creative abilities. Creation-MMBench provides valuable
insights for advancing MLLM creativity and establishes a foundation for future
improvements in multimodal generative intelligence. Full data and evaluation
code is released on https://github.com/open-compass/Creation-MMBench.Summary
AI-Generated Summary