Synth^2: 합성 캡션과 이미지 임베딩을 활용한 시각-언어 모델 성능 향상

초록

고품질의 인간이 라벨링한 이미지-캡션 데이터셋의 생성은 시각-언어 모델(VLM) 개발에 있어 상당한 병목 현상을 야기합니다. 본 연구에서는 대규모 언어 모델(LLM)과 이미지 생성 모델의 강점을 활용하여 합성 이미지-텍스트 쌍을 생성함으로써 효율적이고 효과적인 VLM 학습을 위한 새로운 접근 방식을 제안합니다. 우리의 방법은 LLM이 생성한 캡션을 시작점으로 텍스트-이미지 모델을 사전 학습하여 이미지 임베딩을 합성하는 방식입니다. 이러한 합성 쌍은 VLM을 학습시키는 데 사용됩니다. 광범위한 실험을 통해 합성 데이터로 학습된 VLM이 이미지 캡셔닝 작업에서 인간이 주석을 단 데이터만으로 학습된 모델과 비슷한 성능을 보이면서도 훨씬 적은 데이터를 필요로 한다는 것을 입증했습니다. 특히, 합성 데이터셋을 활용한 증강을 통해 기준 모델보다 17% 더 나은 성능을 달성했습니다. 또한, 이미지 임베딩 공간에서 합성하는 것이 픽셀 공간에서 합성하는 것보다 25% 더 빠르다는 것을 보여줍니다. 이 연구는 대규모의 맞춤형 이미지 데이터셋을 생성하는 유망한 기술을 소개함으로써 데이터 효율성과 자원 활용성을 개선하고, 다양한 도메인에서 VLM의 성능과 적용 범위를 확장할 수 있는 가능성을 제시합니다.

English

The creation of high-quality human-labeled image-caption datasets presents a significant bottleneck in the development of Visual-Language Models (VLMs). We propose a novel approach that leverages the strengths of Large Language Models (LLMs) and image generation models to create synthetic image-text pairs for efficient and effective VLM training. Our method employs pretraining a text-to-image model to synthesize image embeddings starting from captions generated by an LLM. These synthetic pairs are then used to train a VLM. Extensive experiments demonstrate that the VLM trained with synthetic data exhibits comparable performance on image captioning, while requiring a fraction of the data used by models trained solely on human-annotated data. In particular, we outperform the baseline by 17% through augmentation with a synthetic dataset. Furthermore, we show that synthesizing in the image embedding space is 25% faster than in the pixel space. This research introduces a promising technique for generating large-scale, customizable image datasets, leading to enhanced VLM performance and wider applicability across various domains, all with improved data efficiency and resource utilization.

Synth^2: 합성 캡션과 이미지 임베딩을 활용한 시각-언어 모델 성능 향상

Synth^2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

초록

Support