TiC-CLIP：持續訓練 CLIP 模型

摘要

保持大型基礎模型與最新數據同步是一個固有昂貴的過程。為了避免不斷重新訓練的高昂成本，持續地訓練這些模型是至關重要的。這個問題受到缺乏大規模持續學習基準或基準線的加劇。我們引入了第一組用於訓練視覺語言模型的網絡規模時間連續（TiC）基準：TiC-DataCompt、TiC-YFCC和TiC-RedCaps，涵蓋了超過12.7B個跨越9年（2014--2022）的時間戳圖像文本對。我們首先使用我們的基準來編輯各種動態評估，以測量現有模型的時間韌性。我們展示了OpenAI的CLIP（在2020年之前的數據上進行訓練）在我們從2021年至2022年精心策劃的檢索任務中，與OpenCLIP存儲庫中最近訓練的模型相比，失去了約8%的零樣本準確性。然後，我們研究如何有效地在連續時間數據上訓練模型。我們證明了一種簡單的基於複述的方法，該方法從上一次檢查點繼續訓練並重播舊數據，將計算量減少了2.5倍，與從頭重新訓練的標準做法相比。

English

Keeping large foundation models up to date on latest data is inherently expensive. To avoid the prohibitive costs of constantly retraining, it is imperative to continually train these models. This problem is exacerbated by the lack of any large scale continual learning benchmarks or baselines. We introduce the first set of web-scale Time-Continual (TiC) benchmarks for training vision-language models: TiC-DataCompt, TiC-YFCC, and TiC-RedCaps with over 12.7B timestamped image-text pairs spanning 9 years (2014--2022). We first use our benchmarks to curate various dynamic evaluations to measure temporal robustness of existing models. We show OpenAI's CLIP (trained on data up to 2020) loses approx 8% zero-shot accuracy on our curated retrieval task from 2021--2022 compared with more recently trained models in OpenCLIP repository. We then study how to efficiently train models on time-continuous data. We demonstrate that a simple rehearsal-based approach that continues training from the last checkpoint and replays old data reduces compute by 2.5times when compared to the standard practice of retraining from scratch.

TiC-CLIP：持續訓練 CLIP 模型

TiC-CLIP: Continual Training of CLIP Models

摘要

Support