圖像標題生成器也是可擴展的視覺學習者

摘要

在網絡上對圖像-文本配對進行對比預訓練是視覺主幹的最流行的大規模預訓練策略之一，特別是在大型多模態模型的背景下。與此同時，對這類數據進行圖像標註通常被認為是一種較差的預訓練策略。在本文中，我們對這兩種預訓練策略進行了公平比較，仔細匹配了訓練數據、計算和模型容量。使用標準的編碼器-解碼器Transformer，我們發現僅進行標註是出奇地有效：在分類任務中，標註生成的視覺編碼器與對比預訓練的編碼器競爭激烈，同時在視覺和語言任務上超越了它們。我們進一步分析了模型架構和規模，以及預訓練數據對表示質量的影響，發現標註在這些軸上展現出相同或更好的擴展行為。總的來說，我們的結果表明，純粹的圖像標註比以前所認為的更為強大的預訓練策略。

English

Contrastive pretraining on image-text pairs from the web is one of the most popular large-scale pretraining strategies for vision backbones, especially in the context of large multimodal models. At the same time, image captioning on this type of data is commonly considered an inferior pretraining strategy. In this paper, we perform a fair comparison of these two pretraining strategies, carefully matching training data, compute, and model capacity. Using a standard encoder-decoder transformer, we find that captioning alone is surprisingly effective: on classification tasks, captioning produces vision encoders competitive with contrastively pretrained encoders, while surpassing them on vision & language tasks. We further analyze the effect of the model architecture and scale, as well as the pretraining data on the representation quality, and find that captioning exhibits the same or better scaling behavior along these axes. Overall our results show that plain image captioning is a more powerful pretraining strategy than was previously believed.

圖像標題生成器也是可擴展的視覺學習者

Image Captioners Are Scalable Vision Learners Too

摘要

Support