Synth^2:透過合成標題和圖像嵌入來增強視覺語言模型
Synth^2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings
March 12, 2024
作者: Sahand Sharifzadeh, Christos Kaplanis, Shreya Pathak, Dharshan Kumaran, Anastasija Ilic, Jovana Mitrovic, Charles Blundell, Andrea Banino
cs.AI
摘要
在視覺語言模型(VLMs)的發展中,高質量的人工標註圖像語句數據集的創建是一個重要的瓶頸。我們提出了一種新方法,利用大型語言模型(LLMs)和圖像生成模型的優勢,為高效和有效的VLM訓練創建合成圖像文本對。我們的方法採用預訓練文本到圖像模型,從LLM生成的標題開始合成圖像嵌入。然後使用這些合成對來訓練VLM。廣泛的實驗表明,使用合成數據訓練的VLM在圖像標題上表現出色,同時所需的數據量僅為僅使用人工標註數據訓練的模型的一小部分。特別是,通過合成數據集的增強,我們超越了基準線17%。此外,我們展示,在圖像嵌入空間中進行合成比在像素空間中快25%。這項研究引入了一種有前途的技術,用於生成大規模、可定制的圖像數據集,從而提高VLM的性能並在各個領域中擁有更廣泛的應用,同時提高了數據效率和資源利用率。
English
The creation of high-quality human-labeled image-caption datasets presents a
significant bottleneck in the development of Visual-Language Models (VLMs). We
propose a novel approach that leverages the strengths of Large Language Models
(LLMs) and image generation models to create synthetic image-text pairs for
efficient and effective VLM training. Our method employs pretraining a
text-to-image model to synthesize image embeddings starting from captions
generated by an LLM. These synthetic pairs are then used to train a VLM.
Extensive experiments demonstrate that the VLM trained with synthetic data
exhibits comparable performance on image captioning, while requiring a fraction
of the data used by models trained solely on human-annotated data. In
particular, we outperform the baseline by 17% through augmentation with a
synthetic dataset. Furthermore, we show that synthesizing in the image
embedding space is 25% faster than in the pixel space. This research introduces
a promising technique for generating large-scale, customizable image datasets,
leading to enhanced VLM performance and wider applicability across various
domains, all with improved data efficiency and resource utilization.Summary
AI-Generated Summary