VLMs는 이미지 데이터로 사전 학습되어야 하는가?

초록

이미지 데이터로 추가 학습된 사전 학습된 대형 언어 모델(LLM)은 시각-언어 작업에서 우수한 성능을 보입니다. 두 번째 학습 단계에서 이미지를 추가하는 것이 이러한 능력을 효과적으로 해제하지만, 이 두 단계 파이프라인이 시각 토큰을 더 일찍 학습 과정에 통합하는 시각-언어 모델(VLM)에 비해 얼마나 이득이나 손실을 가져오는지는 명확하지 않습니다. 이를 조사하기 위해, 우리는 다양한 데이터셋, 규모, 이미지-텍스트 비율, 그리고 시각 토큰을 도입하기 전에 수행된 사전 학습 양을 아우르는 모델들을 학습시켰습니다. 그런 다음 이러한 모델들을 미세 조정하고, 시각-언어 및 텍스트 전용 작업에 대한 하위 성능을 평가했습니다. 우리는 이미지와 텍스트 데이터의 혼합으로 사전 학습을 수행한 모델이 시각-언어 작업에서 더 나은 성능을 보이면서도 텍스트 전용 평가에서도 강력한 성능을 유지한다는 것을 발견했습니다. 6가지 다양한 작업의 평균에서, 10억 파라미터 모델의 경우 사전 학습의 80% 시점에 시각 토큰을 도입하는 것이 완전히 사전 학습된 모델에 시각 토큰을 도입하는 것보다 평균 2%의 성능 향상을 가져온다는 것을 확인했습니다.

English

Pre-trained LLMs that are further trained with image data perform well on vision-language tasks. While adding images during a second training phase effectively unlocks this capability, it is unclear how much of a gain or loss this two-step pipeline gives over VLMs which integrate images earlier into the training process. To investigate this, we train models spanning various datasets, scales, image-text ratios, and amount of pre-training done before introducing vision tokens. We then fine-tune these models and evaluate their downstream performance on a suite of vision-language and text-only tasks. We find that pre-training with a mixture of image and text data allows models to perform better on vision-language tasks while maintaining strong performance on text-only evaluations. On an average of 6 diverse tasks, we find that for a 1B model, introducing visual tokens 80% of the way through pre-training results in a 2% average improvement over introducing visual tokens to a fully pre-trained model.

VLMs는 이미지 데이터로 사전 학습되어야 하는가?

Should VLMs be Pre-trained with Image Data?

초록

Support