MMCOMPOSITION：重新审视预训练视觉-语言模型的组合性

摘要

大规模视觉语言模型（VLMs）的出现显著推动了多模态理解的发展，实现了更复杂和准确地整合视觉和文本信息，涵盖图像和视频字幕、视觉问答和跨模态检索等各种任务。尽管VLMs具有卓越的能力，研究人员仍然缺乏对其组合性的全面理解——即理解和生成已知视觉和文本组件的新组合的能力。先前的基准仅从对象、关系和属性的角度相对粗略地评估了组合性，却忽视了关于对象交互、计数和复杂组合的更深层推理。然而，组合性是促进VLMs在跨模态中进行连贯推理和理解的关键能力。为了解决这一局限性，我们提出了MMCOMPOSITION，这是一个新颖的人工标注基准，用于全面准确地评估VLMs的组合性。我们提出的基准可作为对先前工作的补充。通过MMCOMPOSITION，我们可以量化和探索主流VLMs的组合性。令人惊讶的是，我们发现GPT-4o的组合性不及最佳开源模型，并分析了潜在原因。我们的实验分析揭示了VLMs在细粒度组合感知和推理方面的局限性，并指出了VLM设计和训练的改进方向。资源可在以下链接找到：https://hanghuacs.github.io/MMComposition/

English

The advent of large Vision-Language Models (VLMs) has significantly advanced multimodal understanding, enabling more sophisticated and accurate integration of visual and textual information across various tasks, including image and video captioning, visual question answering, and cross-modal retrieval. Despite VLMs' superior capabilities, researchers lack a comprehensive understanding of their compositionality -- the ability to understand and produce novel combinations of known visual and textual components. Prior benchmarks provide only a relatively rough compositionality evaluation from the perspectives of objects, relations, and attributes while neglecting deeper reasoning about object interactions, counting, and complex compositions. However, compositionality is a critical ability that facilitates coherent reasoning and understanding across modalities for VLMs. To address this limitation, we propose MMCOMPOSITION, a novel human-annotated benchmark for comprehensively and accurately evaluating VLMs' compositionality. Our proposed benchmark serves as a complement to these earlier works. With MMCOMPOSITION, we can quantify and explore the compositionality of the mainstream VLMs. Surprisingly, we find GPT-4o's compositionality inferior to the best open-source model, and we analyze the underlying reasons. Our experimental analysis reveals the limitations of VLMs in fine-grained compositional perception and reasoning, and points to areas for improvement in VLM design and training. Resources available at: https://hanghuacs.github.io/MMComposition/

MMCOMPOSITION：重新审视预训练视觉-语言模型的组合性

MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models

摘要

Support