知覚エンコーダ：最良の視覚的埋め込みはネットワークの出力には存在しない

要旨

我々は、シンプルな視覚-言語学習を通じて訓練された、画像および映像理解のための最先端エンコーダであるPerception Encoder（PE）を紹介します。従来、視覚エンコーダは、分類、キャプション生成、位置特定といった特定の下流タスクに特化した多様な事前学習目的関数に依存してきました。驚くべきことに、我々が慎重に調整した画像事前学習レシピをスケールアップし、堅牢な映像データエンジンで洗練させた後、対照的な視覚-言語学習のみで、これら全ての下流タスクに対して強力で汎用的な埋め込みを生成できることがわかりました。唯一の注意点は、これらの埋め込みがネットワークの中間層に隠れていることです。これらを引き出すために、我々は2つのアライメント手法を導入します。マルチモーダル言語モデリングのための言語アライメントと、密な予測のための空間アライメントです。コアの対照的チェックポイントとともに、我々のPEモデルファミリーは、ゼロショット画像・映像分類と検索、文書・画像・映像のQ&A、検出、深度推定、追跡といった空間タスクなど、幅広いタスクで最先端の性能を達成します。さらなる研究を促進するため、我々はモデル、コード、そして合成および人手で注釈付けされた映像からなる新規データセットを公開します。

English

We introduce Perception Encoder (PE), a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each tailored to specific downstream tasks such as classification, captioning, or localization. Surprisingly, after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network. To draw them out, we introduce two alignment methods, language alignment for multimodal language modeling, and spatial alignment for dense prediction. Together with the core contrastive checkpoint, our PE family of models achieves state-of-the-art performance on a wide variety of tasks, including zero-shot image and video classification and retrieval; document, image, and video Q&A; and spatial tasks such as detection, depth estimation, and tracking. To foster further research, we are releasing our models, code, and a novel dataset of synthetically and human-annotated videos.

知覚エンコーダ：最良の視覚的埋め込みはネットワークの出力には存在しない

Perception Encoder: The best visual embeddings are not at the output of the network

要旨

Support