TiC-CLIP：CLIP 模型的持续训练

摘要

保持大型基础模型与最新数据同步是一项固有昂贵的任务。为了避免不断重新训练造成的成本过高，必须持续对这些模型进行训练。这一问题受到缺乏大规模持续学习基准或基准线的影响。我们引入了首个面向视觉-语言模型训练的大规模时间连续（TiC）基准集：TiC-DataCompt、TiC-YFCC 和 TiC-RedCaps，涵盖了跨越9年（2014--2022）的超过127亿个带有时间戳的图像-文本对。我们首先利用我们的基准集策划各种动态评估，以衡量现有模型的时间稳健性。我们展示了 OpenAI 的 CLIP 模型（在2020年之前的数据上训练）在我们精心策划的2021年至2022年的检索任务中，与 OpenCLIP 存储库中更近期训练的模型相比，零样本准确率下降了约8%。然后，我们研究如何高效地在连续时间数据上训练模型。我们证明了一种简单的基于复习的方法，即从上一个检查点继续训练并重播旧数据，与从头开始重新训练相比，可以将计算量减少2.5倍。

English

Keeping large foundation models up to date on latest data is inherently expensive. To avoid the prohibitive costs of constantly retraining, it is imperative to continually train these models. This problem is exacerbated by the lack of any large scale continual learning benchmarks or baselines. We introduce the first set of web-scale Time-Continual (TiC) benchmarks for training vision-language models: TiC-DataCompt, TiC-YFCC, and TiC-RedCaps with over 12.7B timestamped image-text pairs spanning 9 years (2014--2022). We first use our benchmarks to curate various dynamic evaluations to measure temporal robustness of existing models. We show OpenAI's CLIP (trained on data up to 2020) loses approx 8% zero-shot accuracy on our curated retrieval task from 2021--2022 compared with more recently trained models in OpenCLIP repository. We then study how to efficiently train models on time-continuous data. We demonstrate that a simple rehearsal-based approach that continues training from the last checkpoint and replays old data reduces compute by 2.5times when compared to the standard practice of retraining from scratch.

TiC-CLIP：CLIP 模型的持续训练

TiC-CLIP: Continual Training of CLIP Models

摘要

Support