CLIP 훈련에 대한 역제곱 법칙

초록

이미지와 텍스트를 연결하는 최초의 파운데이션 모델인 CLIP은 최근 컴퓨터 비전 분야에서 많은 혁신을 이끌어냈습니다. 그러나 이와 관련된 훈련 비용이 지나치게 높아, 이를 널리 탐구하는 데 상당한 장벽이 되고 있습니다. 본 논문에서는 CLIP 훈련에 역스케일링 법칙이 존재한다는 놀라운 발견을 제시합니다. 즉, 사용된 이미지/텍스트 인코더가 클수록 훈련에 적용할 수 있는 이미지/텍스트 토큰의 시퀀스 길이가 짧아진다는 것입니다. 더 나아가, 이미지/텍스트 토큰 길이를 줄이는 전략이 이러한 스케일링 법칙의 품질을 결정하는 데 중요한 역할을 한다는 것을 보여줍니다. 이 발견의 결과로, 우리는 학술 자원만을 사용하여도 CLIP을 성공적으로 훈련시킬 수 있었습니다. 예를 들어, A100 8-GPU 서버에서 우리의 CLIP 모델은 약 2일 만에 63.2%, 약 3일 만에 67.8%, 그리고 약 4일 만에 69.3%의 제로샷 Top-1 ImageNet 정확도를 달성했습니다. CLIP과 관련된 계산적 장벽을 줄임으로써, 특히 학계에서 이 분야에 대한 더 많은 연구가 활성화되기를 바랍니다. 우리의 코드는 https://github.com/UCSC-VLAA/CLIPA에서 확인할 수 있습니다.

English

CLIP, the first foundation model that connects images and text, has enabled many recent breakthroughs in computer vision. However, its associated training cost is prohibitively high, imposing a significant barrier to its widespread exploration. In this paper, we present a surprising finding that there exists an inverse scaling law for CLIP training, whereby the larger the image/text encoders used, the shorter the sequence length of image/text tokens that can be applied in training. Moreover, we showcase that the strategy for reducing image/text token length plays a crucial role in determining the quality of this scaling law. As a result of this finding, we are able to successfully train CLIP even by using academic resources. For example, on an A100 eight-GPU server, our CLIP models achieve zero-shot top-1 ImageNet accuracies of 63.2% in ~2 days, 67.8% in ~3 days, and 69.3% in ~4 days. By reducing the computation barrier associated with CLIP, we hope to inspire more research in this field, particularly from academics. Our code is available at https://github.com/UCSC-VLAA/CLIPA.

CLIP 훈련에 대한 역제곱 법칙

An Inverse Scaling Law for CLIP Training

초록

Support