没有指数数据就没有“零样本”：预训练概念频率决定多模态模型性能

摘要

网络爬虫的预训练数据集是支撑多模态模型（如用于分类/检索的CLIP和用于图像生成的Stable-Diffusion）令人印象深刻的“零-shot”评估性能的基础。然而，对于这些多模态模型来说，“零-shot”泛化的概念有多有意义尚不清楚，因为尚不清楚它们的预训练数据集在“零-shot”评估期间所针对的下游概念的程度。在这项研究中，我们探讨：多模态模型在下游概念上的表现如何受到其预训练数据集中这些概念频率的影响？我们全面调查了34个模型和五个标准预训练数据集（CC-3M、CC-12M、YFCC-15M、LAION-400M、LAION-Aesthetics），生成了超过300GB的数据产物。我们始终发现，与“零-shot”泛化相去甚远，多模态模型需要指数级增加的数据才能在下游“零-shot”性能上实现线性改进，遵循一种样本效率低下的对数线性缩放趋势。即使在控制预训练和下游数据集之间的样本级相似性，并在纯合成数据分布上进行测试时，这种趋势仍然存在。此外，通过在基于我们的分析进行的长尾数据采样的基准测试模型，我们证明了全面而言，多模态模型表现不佳。我们将这个长尾测试集作为“让它摇摆！”基准，以促进这个方向的进一步研究。总的来说，我们的研究揭示了对训练数据的指数级需求，这意味着在大规模训练范式下实现“零-shot”泛化能力的关键尚待发现。

English

Web-crawled pretraining datasets underlie the impressive "zero-shot" evaluation performance of multimodal models, such as CLIP for classification/retrieval and Stable-Diffusion for image generation. However, it is unclear how meaningful the notion of "zero-shot" generalization is for such multimodal models, as it is not known to what extent their pretraining datasets encompass the downstream concepts targeted for during "zero-shot" evaluation. In this work, we ask: How is the performance of multimodal models on downstream concepts influenced by the frequency of these concepts in their pretraining datasets? We comprehensively investigate this question across 34 models and five standard pretraining datasets (CC-3M, CC-12M, YFCC-15M, LAION-400M, LAION-Aesthetics), generating over 300GB of data artifacts. We consistently find that, far from exhibiting "zero-shot" generalization, multimodal models require exponentially more data to achieve linear improvements in downstream "zero-shot" performance, following a sample inefficient log-linear scaling trend. This trend persists even when controlling for sample-level similarity between pretraining and downstream datasets, and testing on purely synthetic data distributions. Furthermore, upon benchmarking models on long-tailed data sampled based on our analysis, we demonstrate that multimodal models across the board perform poorly. We contribute this long-tail test set as the "Let it Wag!" benchmark to further research in this direction. Taken together, our study reveals an exponential need for training data which implies that the key to "zero-shot" generalization capabilities under large-scale training paradigms remains to be found.

没有指数数据就没有“零样本”：预训练概念频率决定多模态模型性能

No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance

摘要

Support