사이클 일관성을 통해 비대칭 데이터를 활용한 시각-언어 생성 모델

초록

현재의 시각-언어 생성 모델들은 최적의 성능과 일반화 능력을 달성하기 위해 대규모의 이미지-텍스트 쌍 데이터 코퍼스에 의존합니다. 그러나 이러한 데이터를 자동으로 수집하는 방법(예: 대규모 웹 스크래핑)은 낮은 품질과 이미지-텍스트 간의 약한 상관관계를 초래하는 반면, 인간 주석은 더 정확하지만 상당한 수작업과 비용이 필요합니다. 우리는 ITIT(InTegrating Image Text)를 소개합니다: 이는 사이클 일관성(cycle consistency) 개념에 기반한 혁신적인 훈련 패러다임으로, 쌍을 이루지 않은 이미지와 텍스트 데이터를 사용하여 시각-언어 훈련을 가능하게 합니다. ITIT는 결합된 이미지-텍스트 인코더와 분리된 이미지 및 텍스트 디코더로 구성되어 단일 프레임워크 내에서 양방향 이미지-텍스트 및 텍스트-이미지 생성을 가능하게 합니다. 훈련 중에 ITIT는 소량의 쌍을 이루는 이미지-텍스트 데이터를 활용하여 출력이 양방향에서 입력과 합리적으로 일치하도록 합니다. 동시에, 모델은 이미지나 텍스트만 포함하는 훨씬 더 큰 데이터셋에서도 훈련됩니다. 이는 원본 쌍을 이루지 않은 샘플과 사이클 생성된 대응물 간의 일관성을 강제함으로써 달성됩니다. 예를 들어, 주어진 입력 이미지에 대한 캡션을 생성한 다음, 그 캡션을 사용하여 출력 이미지를 생성하고, 입력 이미지와 출력 이미지 간의 유사성을 강제합니다. 우리의 실험은 쌍을 이루지 않은 데이터셋을 사용한 ITIT가 고품질 쌍 데이터를 사용할 때와 유사한 스케일링 행동을 보인다는 것을 보여줍니다. 우리는 최신 텍스트-이미지 및 이미지-텍스트 모델과 동등한 이미지 생성 및 캡션 생성 성능을 훨씬 적은 양(단 3M)의 쌍을 이루는 이미지-텍스트 데이터로 달성함을 입증합니다.

English

Current vision-language generative models rely on expansive corpora of paired image-text data to attain optimal performance and generalization capabilities. However, automatically collecting such data (e.g. via large-scale web scraping) leads to low quality and poor image-text correlation, while human annotation is more accurate but requires significant manual effort and expense. We introduce ITIT (InTegrating Image Text): an innovative training paradigm grounded in the concept of cycle consistency which allows vision-language training on unpaired image and text data. ITIT is comprised of a joint image-text encoder with disjoint image and text decoders that enable bidirectional image-to-text and text-to-image generation in a single framework. During training, ITIT leverages a small set of paired image-text data to ensure its output matches the input reasonably well in both directions. Simultaneously, the model is also trained on much larger datasets containing only images or texts. This is achieved by enforcing cycle consistency between the original unpaired samples and the cycle-generated counterparts. For instance, it generates a caption for a given input image and then uses the caption to create an output image, and enforces similarity between the input and output images. Our experiments show that ITIT with unpaired datasets exhibits similar scaling behavior as using high-quality paired data. We demonstrate image generation and captioning performance on par with state-of-the-art text-to-image and image-to-text models with orders of magnitude fewer (only 3M) paired image-text data.

사이클 일관성을 통해 비대칭 데이터를 활용한 시각-언어 생성 모델

Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency

초록

Support