CapsFusion:重新思考大規模圖像文本數據
CapsFusion: Rethinking Image-Text Data at Scale
October 31, 2023
作者: Qiying Yu, Quan Sun, Xiaosong Zhang, Yufeng Cui, Fan Zhang, Xinlong Wang, Jingjing Liu
cs.AI
摘要
大型多模型展示出卓越的通用能力,能以零-shot方式執行多樣的多模任務。大規模基於網路的圖像-文本對對此成功至關重要,但存在著過多的噪音。最近的研究使用由標題生成模型合成的替代標題,並取得了顯著的基準表現。然而,我們的實驗揭示出在使用合成標題訓練的模型中存在顯著的可擴展性不足和世界知識損失問題,這些問題在最初的基準成功中被大部分掩蓋了。經過更仔細的檢查,我們確定根本原因是現有合成標題中過於簡化的語言結構和缺乏知識細節。為了提供更高質量和更具擴展性的多模預訓練數據,我們提出了CapsFusion,一個先進的框架,利用大型語言模型從基於網路的圖像-文本對和合成標題中整合和精煉信息。廣泛的實驗顯示,CapsFusion的標題在模型性能(例如,在COCO和NoCaps上的CIDEr分數分別提高了18.8和18.3)、樣本效率(比基準模型需要的計算量少了11-16倍)、世界知識深度和可擴展性方面展現出卓越的全面優勢。這些效果、效率和可擴展性優勢使CapsFusion成為未來大型多模模型訓練擴展的一個有前途的候選者。
English
Large multimodal models demonstrate remarkable generalist ability to perform
diverse multimodal tasks in a zero-shot manner. Large-scale web-based
image-text pairs contribute fundamentally to this success, but suffer from
excessive noise. Recent studies use alternative captions synthesized by
captioning models and have achieved notable benchmark performance. However, our
experiments reveal significant Scalability Deficiency and World Knowledge Loss
issues in models trained with synthetic captions, which have been largely
obscured by their initial benchmark success. Upon closer examination, we
identify the root cause as the overly-simplified language structure and lack of
knowledge details in existing synthetic captions. To provide higher-quality and
more scalable multimodal pretraining data, we propose CapsFusion, an advanced
framework that leverages large language models to consolidate and refine
information from both web-based image-text pairs and synthetic captions.
Extensive experiments show that CapsFusion captions exhibit remarkable
all-round superiority over existing captions in terms of model performance
(e.g., 18.8 and 18.3 improvements in CIDEr score on COCO and NoCaps), sample
efficiency (requiring 11-16 times less computation than baselines), world
knowledge depth, and scalability. These effectiveness, efficiency and
scalability advantages position CapsFusion as a promising candidate for future
scaling of LMM training.