言語フリーな視覚表現学習のスケーリング

要旨

視覚的自己教師あり学習（Visual Self-Supervised Learning, SSL）は、視覚的質問応答（Visual Question Answering, VQA）などのマルチモーダル設定において、Contrastive Language-Image Pretraining（CLIP）に比べて性能が劣っている。このマルチモーダルギャップは、言語による監督が導入するセマンティクスに起因するとされることが多いが、視覚SSLとCLIPモデルはしばしば異なるデータで学習されている。本研究では、「視覚的自己教師あり学習がCLIPに遅れをとるのは、言語による監督の欠如によるものか、それとも学習データの違いによるものか？」という問いを立てる。この問いを検証するため、視覚SSLとCLIPモデルを同じMetaCLIPデータで学習し、VQAを視覚エンコーダの多様なテストベッドとして活用した。この制御された設定において、視覚SSLモデルはデータ量とモデル容量の面でCLIPモデルよりもスケーリングが優れており、7Bパラメータまでスケールアップしても性能が飽和しないことがわかった。その結果、視覚SSL手法が幅広いVQAおよび古典的な視覚ベンチマークにおいてCLIPレベルの性能を達成することを観察した。これらの発見は、純粋な視覚SSLが大規模な言語監督付き視覚事前学習に匹敵し得ることを示しており、視覚中心の表現学習に新たな可能性を開くものである。

English

Visual Self-Supervised Learning (SSL) currently underperforms Contrastive Language-Image Pretraining (CLIP) in multimodal settings such as Visual Question Answering (VQA). This multimodal gap is often attributed to the semantics introduced by language supervision, even though visual SSL and CLIP models are often trained on different data. In this work, we ask the question: "Do visual self-supervised approaches lag behind CLIP due to the lack of language supervision, or differences in the training data?" We study this question by training both visual SSL and CLIP models on the same MetaCLIP data, and leveraging VQA as a diverse testbed for vision encoders. In this controlled setup, visual SSL models scale better than CLIP models in terms of data and model capacity, and visual SSL performance does not saturate even after scaling up to 7B parameters. Consequently, we observe visual SSL methods achieve CLIP-level performance on a wide range of VQA and classic vision benchmarks. These findings demonstrate that pure visual SSL can match language-supervised visual pretraining at scale, opening new opportunities for vision-centric representation learning.

言語フリーな視覚表現学習のスケーリング

Scaling Language-Free Visual Representation Learning

要旨

Support