ABC: ビジュアル・ランゲージモデルを用いたマルチモーダル埋め込みの制御性向上

要旨

視覚埋め込みモデルは、視覚検索や分類などのゼロショットタスクにおいて優れた性能を発揮します。しかし、これらのモデルは曖昧さを含むタスクやユーザー指示を必要とするタスクには使用できません。これらのタスクには、視覚と自然言語入力を組み合わせた埋め込みを出力するマルチモーダル埋め込みモデルが必要です。既存のCLIPベースのアプローチでは、画像とテキストを独立して埋め込み、その結果を融合します。しかし、この方法ではモダリティ間の相互作用が弱く、表現に対するユーザーの制御が不十分であることがわかりました。私たちはABCを紹介します。これは、視覚言語モデルのバックボーンを使用して、画像特徴と自然言語指示を深く統合するオープンソースのマルチモーダル埋め込みモデルです。ABCは、MSCOCOの画像からテキストへの検索においてサイズに対する最高の性能を達成し、Massive Multimodal Embedding Benchmarkの分類とVQAタスクでトップの性能を発揮します。強く統合された視覚言語表現により、ABCは自然言語を使用して微妙で潜在的に曖昧な視覚検索問題を解決できます。この能力を評価するために、正しい検索のためにテキスト指示と画像内容を交互に組み合わせる必要があるベンチマークCtrlBenchを設計しました。ABCは、高品質な表現と柔軟な自然言語制御を提供することで、マルチモーダル埋め込みの状態を進化させます。私たちのモデルとデータセットはプロジェクトページで利用可能です。

English

Visual embedding models excel at zero-shot tasks like visual retrieval and classification. However, these models cannot be used for tasks that contain ambiguity or require user instruction. These tasks necessitate a multimodal embedding model, which outputs embeddings that combine visual and natural language input. Existing CLIP-based approaches embed images and text independently, and fuse the result. We find that this results in weak interactions between modalities, and poor user control over the representation. We introduce ABC, an open-source multimodal embedding model that uses a vision-language model backbone to deeply integrate image features with natural language instructions. ABC achieves bestfor-size performance on MSCOCO image-to-text retrieval and is the top performing model on classification and VQA tasks in the Massive Multimodal Embedding Benchmark. With a strongly unified vision-language representation, ABC can use natural language to solve subtle and potentially ambiguous visual retrieval problems. To evaluate this capability, we design CtrlBench, a benchmark that requires interleaving textual instructions with image content for correct retrieval. ABC advances the state of multimodal embeddings by offering high-quality representations and flexible natural language control. Our model and datasets are available at our project page.

ABC: ビジュアル・ランゲージモデルを用いたマルチモーダル埋め込みの制御性向上

ABC: Achieving Better Control of Multimodal Embeddings using VLMs

要旨

Support