CLIPA-v2: 1만 달러 예산 내에서 81.1% 제로샷 ImageNet 정확도로 CLIP 학습 확장; 추가 4,000달러로 81.8% 정확도 달성

초록

최근 연구인 CLIPA는 CLIP 학습에 대한 역비례 스케일링 법칙을 제시했습니다. 이 법칙에 따르면, 이미지/텍스트 인코더의 크기가 클수록 학습에 적용할 수 있는 이미지/텍스트 토큰의 시퀀스 길이가 짧아집니다. 이러한 발견은 계산량을 크게 줄이면서도 고성능 CLIP 모델을 학습시킬 수 있게 해줍니다. 이 연구를 기반으로, 우리는 두 가지 주요 기여를 담은 CLIPA-v2를 제시합니다. 기술적으로, 이 역비례 스케일링 법칙이 파인튜닝 단계에서도 적용 가능하다는 것을 발견하여, 계산 요구량을 더욱 줄일 수 있었습니다. 실험적으로, 우리는 CLIPA를 대규모로 탐구하며, 학습 중에 약 130억 개의 이미지-텍스트 쌍을 본 H/14 모델까지 실험을 확장했습니다. 우리의 결과는 매우 흥미롭습니다. 단 10,000의 예산만 할당하여도, 우리의 CLIP 모델은 81.1%의 인상적인 제로샷 ImageNet 정확도를 달성했으며, 이는 이전 최고의 CLIP 모델(OpenCLIP, 80.1%)을 1.0% 앞서면서도 계산 비용을 약 39배 줄였습니다. 또한, 추가로 4,000을 투자하면 제로샷 ImageNet 정확도를 81.8%로 더욱 높일 수 있습니다. 우리의 코드와 모델은 https://github.com/UCSC-VLAA/CLIPA에서 확인할 수 있습니다.

English

The recent work CLIPA presents an inverse scaling law for CLIP training -- whereby the larger the image/text encoders used, the shorter the sequence length of image/text tokens that can be applied in training. This finding enables us to train high-performance CLIP models with significantly reduced computations. Building upon this work, we hereby present CLIPA-v2 with two key contributions. Technically, we find this inverse scaling law is also applicable in the finetuning stage, enabling further reduction in computational needs. Empirically, we explore CLIPA at scale, extending the experiments up to the H/14 model with ~13B image-text pairs seen during training. Our results are exciting -- by only allocating a budget of \10,000, our CLIP model achieves an impressive zero-shot ImageNet accuracy of 81.1%, surpassing the prior best CLIP model (from OpenCLIP, 80.1%) by 1.0% and meanwhile reducing the computational cost by ~39X. Moreover, with an additional investment of 4,000, we can further elevate the zero-shot ImageNet accuracy to 81.8%. Our code and models are available at https://github.com/UCSC-VLAA/CLIPA.

CLIPA-v2: 1만 달러 예산 내에서 81.1% 제로샷 ImageNet 정확도로 CLIP 학습 확장; 추가 4,000달러로 81.8% 정확도 달성

CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a \10,000 Budget; An Extra 4,000 Unlocks 81.8% Accuracy

초록

Support