COLA: 属性でローカライズされたオブジェクトを構成するために視覚言語モデルを適応させる方法

要旨

構成推論は人間の視覚的知性の特徴であるが、大規模な視覚言語モデルにもかかわらず、それらはオブジェクトとその属性を組み合わせた単純な構成を表現するのに苦労している。この構成能力の欠如を測定するために、我々はColaというテキストから画像への検索ベンチマークを設計した。これは、属性でローカライズされたオブジェクトを構成する（Compose Objects Localized with Attributes）ためのものである。Colaをテストベッドとして、事前学習済みの視覚言語モデルを、複数のオブジェクトに付属する複数の属性について構成推論を行うように適応させるためのモデル設計を探求する。我々は、2つの代表的な視覚言語モデルに対して、3つのファインチューニングデータセットと2つのテストベンチマーク（ColaとCREPE）を用いて、6つのファインチューニング戦略を検討した。驚くべきことに、我々の最適なファインチューニング戦略は、事前学習中に画像と言語を分離してエンコードする151MパラメータのCLIPを、事前学習中にマルチモーダルトランスフォーマーエンコーダを使用して視覚と言語の両方のモダリティに注意を向ける241MパラメータのFLAVAと同等の性能にまで向上させた。この最適なファインチューニング戦略は、事前学習済みモデルによって生成された画像と言語の特徴の両方に共同で注意を向ける軽量なマルチモーダルアダプターである。我々は、これがプロンプト/ファインチューニングや同等の数のユニモーダル層を調整するといった一般的な戦略よりも優れていることを示した。

English

Compositional reasoning is a hallmark of human visual intelligence; yet despite the size of large vision-language models, they struggle to represent simple compositions by combining objects with their attributes. To measure this lack of compositional capability, we design Cola, a text-to-image retrieval benchmark to Compose Objects Localized with Attributes. Using Cola as a testbed, we explore modeling designs to adapt pre-trained vision-language models to reason compositionally about multiple attributes attached to multiple objects. We explore 6 finetuning strategies on 2 seminal vision-language models, using 3 finetuning datasets and 2 test benchmarks (Cola and CREPE). Surprisingly, our optimal finetuning strategy improves a 151M parameter CLIP, which disjointly encodes image and language during pretraining, to perform as well as a 241M parameter FLAVA, which uses a multi-modal transformer encoder during pretraining to attend over both vision and language modalities. This optimal finetuning strategy is a lightweight multi-modal adapter that jointly attends over both image and language features generated by the pretrained model. We show this works better than common strategies such as prompt/fine-tuning, or tuning a comparable number of unimodal layers.

COLA: 属性でローカライズされたオブジェクトを構成するために視覚言語モデルを適応させる方法

COLA: How to adapt vision-language models to Compose Objects Localized with Attributes?

要旨

Support