MMCOMPOSITION:重新審視預訓練視覺語言模型的組合性
MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models
October 13, 2024
作者: Hang Hua, Yunlong Tang, Ziyun Zeng, Liangliang Cao, Zhengyuan Yang, Hangfeng He, Chenliang Xu, Jiebo Luo
cs.AI
摘要
大型視覺語言模型(VLMs)的出現顯著推動了多模式理解的進步,使得在各種任務中更精密和準確地整合視覺和文本信息成為可能,包括圖像和視頻標題、視覺問答以及跨模態檢索。儘管VLMs具有卓越的能力,研究人員對其組成性仍缺乏全面的理解——即理解和生成已知視覺和文本組件的新組合的能力。先前的基準僅從對象、關係和屬性的角度提供了相對粗糙的組成性評估,卻忽略了對象交互、計數和複雜組合的更深入推理。然而,組成性是一種關鍵能力,有助於實現VLMs跨模態的連貫推理和理解。為解決這一限制,我們提出了MMCOMPOSITION,這是一個新穎的人工標註基準,用於全面和準確地評估VLMs的組成性。我們提出的基準作為先前工作的補充。通過MMCOMPOSITION,我們可以量化並探索主流VLMs的組成性。令人驚訝的是,我們發現GPT-4o的組成性不如最佳的開源模型,並分析了潛在原因。我們的實驗分析揭示了VLMs在細粒度組成感知和推理方面的局限性,並指出了VLM設計和訓練的改進方向。資源可在以下網址找到:https://hanghuacs.github.io/MMComposition/
English
The advent of large Vision-Language Models (VLMs) has significantly advanced
multimodal understanding, enabling more sophisticated and accurate integration
of visual and textual information across various tasks, including image and
video captioning, visual question answering, and cross-modal retrieval. Despite
VLMs' superior capabilities, researchers lack a comprehensive understanding of
their compositionality -- the ability to understand and produce novel
combinations of known visual and textual components. Prior benchmarks provide
only a relatively rough compositionality evaluation from the perspectives of
objects, relations, and attributes while neglecting deeper reasoning about
object interactions, counting, and complex compositions. However,
compositionality is a critical ability that facilitates coherent reasoning and
understanding across modalities for VLMs. To address this limitation, we
propose MMCOMPOSITION, a novel human-annotated benchmark for comprehensively
and accurately evaluating VLMs' compositionality. Our proposed benchmark serves
as a complement to these earlier works. With MMCOMPOSITION, we can quantify and
explore the compositionality of the mainstream VLMs. Surprisingly, we find
GPT-4o's compositionality inferior to the best open-source model, and we
analyze the underlying reasons. Our experimental analysis reveals the
limitations of VLMs in fine-grained compositional perception and reasoning, and
points to areas for improvement in VLM design and training. Resources available
at: https://hanghuacs.github.io/MMComposition/Summary
AI-Generated Summary