"제로샷"은 기하급수적인 데이터 없이 불가능: 사전 학습 개념 빈도가 멀티모달 모델 성능을 결정한다

초록

웹 크롤링을 통해 수집된 사전 학습 데이터셋은 CLIP(분류/검색) 및 Stable-Diffusion(이미지 생성)과 같은 멀티모달 모델의 인상적인 "제로샷" 평가 성능의 기반이 됩니다. 그러나 이러한 멀티모달 모델에 대한 "제로샷" 일반화 개념이 얼마나 의미 있는지는 명확하지 않습니다. 왜냐하면 이들의 사전 학습 데이터셋이 "제로샷" 평가 중 목표로 삼은 하위 개념들을 어느 정도 포함하고 있는지 알려져 있지 않기 때문입니다. 본 연구에서 우리는 다음과 같은 질문을 던집니다: 멀티모달 모델의 하위 개념에 대한 성능은 사전 학습 데이터셋에서 이러한 개념의 빈도에 의해 어떻게 영향을 받는가? 우리는 34개의 모델과 5개의 표준 사전 학습 데이터셋(CC-3M, CC-12M, YFCC-15M, LAION-400M, LAION-Aesthetics)에 걸쳐 이 질문을 포괄적으로 조사하며, 300GB가 넘는 데이터 아티팩트를 생성했습니다. 우리는 일관되게 "제로샷" 일반화를 보이는 것과는 거리가 먼 결과를 발견했습니다. 멀티모달 모델은 하위 "제로샷" 성능을 선형적으로 개선하기 위해 기하급수적으로 더 많은 데이터가 필요하며, 이는 샘플 비효율적인 로그-선형 스케일링 추세를 따릅니다. 이 추세는 사전 학습 데이터셋과 하위 데이터셋 간의 샘플 수준 유사성을 통제하고 순수 합성 데이터 분포에서 테스트할 때도 지속됩니다. 더 나아가, 우리의 분석을 기반으로 장기 꼬리 데이터를 샘플링하여 모델을 벤치마킹한 결과, 전반적으로 멀티모달 모델의 성능이 저조함을 입증했습니다. 우리는 이 장기 꼬리 테스트 세트를 "Let it Wag!" 벤치마크로 공개하여 이 방향의 추가 연구를 촉진하고자 합니다. 종합적으로, 우리의 연구는 대규모 학습 패러다임 하에서 "제로샷" 일반화 능력의 열쇠가 여전히 발견되지 않았음을 시사하는 기하급수적인 학습 데이터 필요성을 밝혀냈습니다.

English

Web-crawled pretraining datasets underlie the impressive "zero-shot" evaluation performance of multimodal models, such as CLIP for classification/retrieval and Stable-Diffusion for image generation. However, it is unclear how meaningful the notion of "zero-shot" generalization is for such multimodal models, as it is not known to what extent their pretraining datasets encompass the downstream concepts targeted for during "zero-shot" evaluation. In this work, we ask: How is the performance of multimodal models on downstream concepts influenced by the frequency of these concepts in their pretraining datasets? We comprehensively investigate this question across 34 models and five standard pretraining datasets (CC-3M, CC-12M, YFCC-15M, LAION-400M, LAION-Aesthetics), generating over 300GB of data artifacts. We consistently find that, far from exhibiting "zero-shot" generalization, multimodal models require exponentially more data to achieve linear improvements in downstream "zero-shot" performance, following a sample inefficient log-linear scaling trend. This trend persists even when controlling for sample-level similarity between pretraining and downstream datasets, and testing on purely synthetic data distributions. Furthermore, upon benchmarking models on long-tailed data sampled based on our analysis, we demonstrate that multimodal models across the board perform poorly. We contribute this long-tail test set as the "Let it Wag!" benchmark to further research in this direction. Taken together, our study reveals an exponential need for training data which implies that the key to "zero-shot" generalization capabilities under large-scale training paradigms remains to be found.

"제로샷"은 기하급수적인 데이터 없이 불가능: 사전 학습 개념 빈도가 멀티모달 모델 성능을 결정한다

No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance

초록

Support