ChatPaper.aiChatPaper

擴展無語言視覺表徵學習

Scaling Language-Free Visual Representation Learning

April 1, 2025
作者: David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nicolas Ballas, Yann LeCun, Amir Bar, Saining Xie
cs.AI

摘要

視覺自監督學習(SSL)目前在多模態場景(如視覺問答VQA)中的表現遜於對比語言-圖像預訓練(CLIP)。這種多模態差距通常被歸因於語言監督引入的語義,儘管視覺SSL和CLIP模型通常是在不同的數據上訓練的。在本研究中,我們提出了一個問題:「視覺自監督方法落後於CLIP,是因為缺乏語言監督,還是訓練數據的差異?」我們通過在相同的MetaCLIP數據上訓練視覺SSL和CLIP模型,並利用VQA作為視覺編碼器的多樣化測試平台來探討這個問題。在這個受控設置中,視覺SSL模型在數據和模型容量方面比CLIP模型更具擴展性,且視覺SSL的性能在擴展到70億參數後仍未飽和。因此,我們觀察到視覺SSL方法在廣泛的VQA和經典視覺基準測試中達到了CLIP級別的表現。這些發現表明,純視覺自監督學習在大規模下可以與語言監督的視覺預訓練相媲美,為以視覺為中心的表示學習開闢了新的機會。
English
Visual Self-Supervised Learning (SSL) currently underperforms Contrastive Language-Image Pretraining (CLIP) in multimodal settings such as Visual Question Answering (VQA). This multimodal gap is often attributed to the semantics introduced by language supervision, even though visual SSL and CLIP models are often trained on different data. In this work, we ask the question: "Do visual self-supervised approaches lag behind CLIP due to the lack of language supervision, or differences in the training data?" We study this question by training both visual SSL and CLIP models on the same MetaCLIP data, and leveraging VQA as a diverse testbed for vision encoders. In this controlled setup, visual SSL models scale better than CLIP models in terms of data and model capacity, and visual SSL performance does not saturate even after scaling up to 7B parameters. Consequently, we observe visual SSL methods achieve CLIP-level performance on a wide range of VQA and classic vision benchmarks. These findings demonstrate that pure visual SSL can match language-supervised visual pretraining at scale, opening new opportunities for vision-centric representation learning.

Summary

AI-Generated Summary

PDF294April 2, 2025