ChatPaper.aiChatPaper

沒有指數數據就沒有"零-shot":預訓練概念頻率決定多模型性能

No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance

April 4, 2024
作者: Vishaal Udandarao, Ameya Prabhu, Adhiraj Ghosh, Yash Sharma, Philip H. S. Torr, Adel Bibi, Samuel Albanie, Matthias Bethge
cs.AI

摘要

網路爬蟲預訓練數據集是支撐多模態模型(如用於分類/檢索的CLIP和用於圖像生成的Stable-Diffusion)令人印象深刻的“零樣本”評估表現的基礎。然而,對於這些多模態模型的“零樣本”泛化概念有多有意義並不清楚,因為不清楚它們的預訓練數據集在“零樣本”評估期間所針對的下游概念在多大程度上包含其中。在這項研究中,我們探討了一個問題:多模態模型在下游概念上的表現如何受到這些概念在其預訓練數據集中的頻率影響?我們全面調查了34個模型和五個標準預訓練數據集(CC-3M、CC-12M、YFCC-15M、LAION-400M、LAION-Aesthetics),產生了超過300GB的數據藝術品。我們一貫發現,與展現“零樣本”泛化相反,多模態模型需要指數級增加的數據才能在下游“零樣本”表現上實現線性改進,遵循一種樣本效率低下的對數線性比例趨勢。即使在控制預訓練和下游數據集之間的樣本級相似性,以及在純合成數據分佈上進行測試時,這種趨勢仍然存在。此外,通過在基於我們分析的長尾數據上進行基準測試,我們證明了全面性地多模態模型表現不佳。我們將這個長尾測試集作為“Let it Wag!”基準測試集,以進一步研究這個方向。綜上所述,我們的研究揭示了對訓練數據的指數級需求,這意味著在大規模訓練範式下實現“零樣本”泛化能力的關鍵仍有待發現。
English
Web-crawled pretraining datasets underlie the impressive "zero-shot" evaluation performance of multimodal models, such as CLIP for classification/retrieval and Stable-Diffusion for image generation. However, it is unclear how meaningful the notion of "zero-shot" generalization is for such multimodal models, as it is not known to what extent their pretraining datasets encompass the downstream concepts targeted for during "zero-shot" evaluation. In this work, we ask: How is the performance of multimodal models on downstream concepts influenced by the frequency of these concepts in their pretraining datasets? We comprehensively investigate this question across 34 models and five standard pretraining datasets (CC-3M, CC-12M, YFCC-15M, LAION-400M, LAION-Aesthetics), generating over 300GB of data artifacts. We consistently find that, far from exhibiting "zero-shot" generalization, multimodal models require exponentially more data to achieve linear improvements in downstream "zero-shot" performance, following a sample inefficient log-linear scaling trend. This trend persists even when controlling for sample-level similarity between pretraining and downstream datasets, and testing on purely synthetic data distributions. Furthermore, upon benchmarking models on long-tailed data sampled based on our analysis, we demonstrate that multimodal models across the board perform poorly. We contribute this long-tail test set as the "Let it Wag!" benchmark to further research in this direction. Taken together, our study reveals an exponential need for training data which implies that the key to "zero-shot" generalization capabilities under large-scale training paradigms remains to be found.

Summary

AI-Generated Summary

PDF301December 15, 2024