ChatPaper.aiChatPaper

CatLIP:在網絡規模的圖像文本數據上進行的訓練,實現比 CLIP 更快 2.7 倍的視覺識別準確性

CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data

April 24, 2024
作者: Sachin Mehta, Maxwell Horton, Fartash Faghri, Mohammad Hossein Sekhavat, Mahyar Najibi, Mehrdad Farajtabar, Oncel Tuzel, Mohammad Rastegari
cs.AI

摘要

對比學習已經成為一種透過對齊圖像和文本嵌入來學習有效視覺表示的轉變性方法。然而,在圖像和文本對之間的對比損失中進行成對相似度計算存在著計算挑戰。本文提出了一種新穎的基於網絡規模圖像文本數據的弱監督預訓練視覺模型的方法。所提出的方法將圖像文本數據上的預訓練重新定義為一個分類任務。因此,它消除了對比損失中成對相似度計算的需要,實現了與在網絡規模數據上進行對比學習相比訓練速度顯著提高了2.7倍。通過廣泛的實驗涵蓋各種視覺任務,包括檢測和分割,我們證明了所提出的方法保持了高表示質量。我們的源代碼以及預先訓練的模型權重和訓練配方可在https://github.com/apple/corenet 上找到。
English
Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings. However, pairwise similarity computation in contrastive loss between image and text pairs poses computational challenges. This paper presents a novel weakly supervised pre-training of vision models on web-scale image-text data. The proposed method reframes pre-training on image-text data as a classification task. Consequently, it eliminates the need for pairwise similarity computations in contrastive loss, achieving a remarkable 2.7times acceleration in training speed compared to contrastive learning on web-scale data. Through extensive experiments spanning diverse vision tasks, including detection and segmentation, we demonstrate that the proposed method maintains high representation quality. Our source code along with pre-trained model weights and training recipes is available at https://github.com/apple/corenet.

Summary

AI-Generated Summary

PDF303December 15, 2024