画像キャプションモデルもスケーラブルな視覚学習者である

要旨

ウェブ上の画像-テキストペアを用いたコントラスティブ事前学習は、特に大規模マルチモーダルモデルの文脈において、視覚バックボーン向けの最も一般的な大規模事前学習戦略の一つです。一方で、この種のデータを用いた画像キャプショニングは、一般的に劣った事前学習戦略と見なされています。本論文では、これらの2つの事前学習戦略を公平に比較し、トレーニングデータ、計算リソース、モデル容量を慎重に揃えました。標準的なエンコーダ-デコーダトランスフォーマーを使用した結果、キャプショニング単体が驚くほど効果的であることがわかりました：分類タスクにおいて、キャプショニングはコントラスティブ事前学習されたエンコーダと同等の視覚エンコーダを生成し、視覚と言語タスクではそれを上回りました。さらに、モデルアーキテクチャとスケール、および事前学習データが表現品質に与える影響を分析し、キャプショニングがこれらの軸に沿って同等またはより良いスケーリング挙動を示すことを発見しました。全体として、我々の結果は、単純な画像キャプショニングがこれまで考えられていたよりも強力な事前学習戦略であることを示しています。

English

Contrastive pretraining on image-text pairs from the web is one of the most popular large-scale pretraining strategies for vision backbones, especially in the context of large multimodal models. At the same time, image captioning on this type of data is commonly considered an inferior pretraining strategy. In this paper, we perform a fair comparison of these two pretraining strategies, carefully matching training data, compute, and model capacity. Using a standard encoder-decoder transformer, we find that captioning alone is surprisingly effective: on classification tasks, captioning produces vision encoders competitive with contrastively pretrained encoders, while surpassing them on vision & language tasks. We further analyze the effect of the model architecture and scale, as well as the pretraining data on the representation quality, and find that captioning exhibits the same or better scaling behavior along these axes. Overall our results show that plain image captioning is a more powerful pretraining strategy than was previously believed.

画像キャプションモデルもスケーラブルな視覚学習者である

Image Captioners Are Scalable Vision Learners Too

要旨

Support