MMCOMPOSITION：事前学習されたビジョン言語モデルの合成性を再考する

要旨

大規模なビジョン言語モデル（VLMs）の出現は、視覚とテキスト情報をより洗練された方法で統合することを可能にし、画像キャプショニング、ビジュアル質問応答、クロスモーダル検索を含むさまざまなタスクで、多面的理解を大幅に前進させました。VLMsの優れた機能にもかかわらず、研究者はそれらの合成能力について包括的な理解を欠いています。つまり、既知の視覚とテキスト要素の新しい組み合わせを理解し、生成する能力です。従来のベンチマークは、対象、関係、属性の観点から比較的粗い合成性評価のみを提供しており、対象の相互作用、数え上げ、複雑な組み合わせについての深い推論を無視しています。しかし、合成能力は、VLMsにとって異なるモダリティ間での一貫した推論と理解を促進する重要な能力です。この制限に対処するために、私たちはMMCOMPOSITIONを提案します。これは、VLMsの合成能力を包括的かつ正確に評価するための新しい人間注釈付きベンチマークです。私たちの提案するベンチマークは、これら以前の研究を補完します。MMCOMPOSITIONを使用することで、主要なVLMsの合成能力を定量化し、探索することができます。驚くべきことに、私たちはGPT-4oの合成能力が最高のオープンソースモデルに劣っていることがわかり、その根本的な理由を分析しています。実験分析により、VLMsの微細な合成的知覚と推論の制限が明らかになり、VLMの設計とトレーニングの改善点が示唆されます。リソースはこちらで入手可能：https://hanghuacs.github.io/MMComposition/

English

The advent of large Vision-Language Models (VLMs) has significantly advanced multimodal understanding, enabling more sophisticated and accurate integration of visual and textual information across various tasks, including image and video captioning, visual question answering, and cross-modal retrieval. Despite VLMs' superior capabilities, researchers lack a comprehensive understanding of their compositionality -- the ability to understand and produce novel combinations of known visual and textual components. Prior benchmarks provide only a relatively rough compositionality evaluation from the perspectives of objects, relations, and attributes while neglecting deeper reasoning about object interactions, counting, and complex compositions. However, compositionality is a critical ability that facilitates coherent reasoning and understanding across modalities for VLMs. To address this limitation, we propose MMCOMPOSITION, a novel human-annotated benchmark for comprehensively and accurately evaluating VLMs' compositionality. Our proposed benchmark serves as a complement to these earlier works. With MMCOMPOSITION, we can quantify and explore the compositionality of the mainstream VLMs. Surprisingly, we find GPT-4o's compositionality inferior to the best open-source model, and we analyze the underlying reasons. Our experimental analysis reveals the limitations of VLMs in fine-grained compositional perception and reasoning, and points to areas for improvement in VLM design and training. Resources available at: https://hanghuacs.github.io/MMComposition/

MMCOMPOSITION：事前学習されたビジョン言語モデルの合成性を再考する

MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models

要旨

Support