CapsFusion：重新思考大規模圖像文本數據

摘要

大型多模型展示出卓越的通用能力，能以零-shot方式執行多樣的多模任務。大規模基於網路的圖像-文本對對此成功至關重要，但存在著過多的噪音。最近的研究使用由標題生成模型合成的替代標題，並取得了顯著的基準表現。然而，我們的實驗揭示出在使用合成標題訓練的模型中存在顯著的可擴展性不足和世界知識損失問題，這些問題在最初的基準成功中被大部分掩蓋了。經過更仔細的檢查，我們確定根本原因是現有合成標題中過於簡化的語言結構和缺乏知識細節。為了提供更高質量和更具擴展性的多模預訓練數據，我們提出了CapsFusion，一個先進的框架，利用大型語言模型從基於網路的圖像-文本對和合成標題中整合和精煉信息。廣泛的實驗顯示，CapsFusion的標題在模型性能（例如，在COCO和NoCaps上的CIDEr分數分別提高了18.8和18.3）、樣本效率（比基準模型需要的計算量少了11-16倍）、世界知識深度和可擴展性方面展現出卓越的全面優勢。這些效果、效率和可擴展性優勢使CapsFusion成為未來大型多模模型訓練擴展的一個有前途的候選者。

English

Large multimodal models demonstrate remarkable generalist ability to perform diverse multimodal tasks in a zero-shot manner. Large-scale web-based image-text pairs contribute fundamentally to this success, but suffer from excessive noise. Recent studies use alternative captions synthesized by captioning models and have achieved notable benchmark performance. However, our experiments reveal significant Scalability Deficiency and World Knowledge Loss issues in models trained with synthetic captions, which have been largely obscured by their initial benchmark success. Upon closer examination, we identify the root cause as the overly-simplified language structure and lack of knowledge details in existing synthetic captions. To provide higher-quality and more scalable multimodal pretraining data, we propose CapsFusion, an advanced framework that leverages large language models to consolidate and refine information from both web-based image-text pairs and synthetic captions. Extensive experiments show that CapsFusion captions exhibit remarkable all-round superiority over existing captions in terms of model performance (e.g., 18.8 and 18.3 improvements in CIDEr score on COCO and NoCaps), sample efficiency (requiring 11-16 times less computation than baselines), world knowledge depth, and scalability. These effectiveness, efficiency and scalability advantages position CapsFusion as a promising candidate for future scaling of LMM training.

CapsFusion：重新思考大規模圖像文本數據

CapsFusion: Rethinking Image-Text Data at Scale

摘要

Support