MMCOMPOSITION：重新審視預訓練視覺語言模型的組合性

摘要

大型視覺語言模型（VLMs）的出現顯著推動了多模式理解的進步，使得在各種任務中更精密和準確地整合視覺和文本信息成為可能，包括圖像和視頻標題、視覺問答以及跨模態檢索。儘管VLMs具有卓越的能力，研究人員對其組成性仍缺乏全面的理解——即理解和生成已知視覺和文本組件的新組合的能力。先前的基準僅從對象、關係和屬性的角度提供了相對粗糙的組成性評估，卻忽略了對象交互、計數和複雜組合的更深入推理。然而，組成性是一種關鍵能力，有助於實現VLMs跨模態的連貫推理和理解。為解決這一限制，我們提出了MMCOMPOSITION，這是一個新穎的人工標註基準，用於全面和準確地評估VLMs的組成性。我們提出的基準作為先前工作的補充。通過MMCOMPOSITION，我們可以量化並探索主流VLMs的組成性。令人驚訝的是，我們發現GPT-4o的組成性不如最佳的開源模型，並分析了潛在原因。我們的實驗分析揭示了VLMs在細粒度組成感知和推理方面的局限性，並指出了VLM設計和訓練的改進方向。資源可在以下網址找到：https://hanghuacs.github.io/MMComposition/

English

The advent of large Vision-Language Models (VLMs) has significantly advanced multimodal understanding, enabling more sophisticated and accurate integration of visual and textual information across various tasks, including image and video captioning, visual question answering, and cross-modal retrieval. Despite VLMs' superior capabilities, researchers lack a comprehensive understanding of their compositionality -- the ability to understand and produce novel combinations of known visual and textual components. Prior benchmarks provide only a relatively rough compositionality evaluation from the perspectives of objects, relations, and attributes while neglecting deeper reasoning about object interactions, counting, and complex compositions. However, compositionality is a critical ability that facilitates coherent reasoning and understanding across modalities for VLMs. To address this limitation, we propose MMCOMPOSITION, a novel human-annotated benchmark for comprehensively and accurately evaluating VLMs' compositionality. Our proposed benchmark serves as a complement to these earlier works. With MMCOMPOSITION, we can quantify and explore the compositionality of the mainstream VLMs. Surprisingly, we find GPT-4o's compositionality inferior to the best open-source model, and we analyze the underlying reasons. Our experimental analysis reveals the limitations of VLMs in fine-grained compositional perception and reasoning, and points to areas for improvement in VLM design and training. Resources available at: https://hanghuacs.github.io/MMComposition/

MMCOMPOSITION：重新審視預訓練視覺語言模型的組合性

MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models

摘要

Support