TiC-CLIP: CLIPモデルの継続的トレーニング

要旨

大規模な基盤モデルを最新のデータに保つことは、本質的にコストがかかる。絶え間ない再訓練の莫大なコストを回避するため、これらのモデルを継続的に訓練することが不可欠である。この問題は、大規模な継続学習のベンチマークやベースラインが存在しないことによってさらに悪化している。我々は、視覚言語モデルの訓練のための初のウェブスケールの時間継続型（TiC）ベンチマークを導入する：TiC-DataCompt、TiC-YFCC、およびTiC-RedCapsで、9年間（2014年～2022年）にわたる12.7B以上のタイムスタンプ付き画像テキストペアをカバーする。まず、これらのベンチマークを使用して、既存モデルの時間的堅牢性を測定するための様々な動的評価を策定する。OpenAIのCLIP（2020年までのデータで訓練）が、OpenCLIPリポジトリのより最近に訓練されたモデルと比較して、2021年～2022年の我々のキュレートした検索タスクで約8％のゼロショット精度を失うことを示す。次に、時間連続データ上でモデルを効率的に訓練する方法を研究する。最後のチェックポイントから訓練を継続し、古いデータをリプレイする単純なリハーサルベースのアプローチが、ゼロから再訓練する標準的な手法と比較して、計算量を2.5倍削減することを実証する。

English

Keeping large foundation models up to date on latest data is inherently expensive. To avoid the prohibitive costs of constantly retraining, it is imperative to continually train these models. This problem is exacerbated by the lack of any large scale continual learning benchmarks or baselines. We introduce the first set of web-scale Time-Continual (TiC) benchmarks for training vision-language models: TiC-DataCompt, TiC-YFCC, and TiC-RedCaps with over 12.7B timestamped image-text pairs spanning 9 years (2014--2022). We first use our benchmarks to curate various dynamic evaluations to measure temporal robustness of existing models. We show OpenAI's CLIP (trained on data up to 2020) loses approx 8% zero-shot accuracy on our curated retrieval task from 2021--2022 compared with more recently trained models in OpenCLIP repository. We then study how to efficiently train models on time-continuous data. We demonstrate that a simple rehearsal-based approach that continues training from the last checkpoint and replays old data reduces compute by 2.5times when compared to the standard practice of retraining from scratch.

TiC-CLIP: CLIPモデルの継続的トレーニング

TiC-CLIP: Continual Training of CLIP Models

要旨

Support