CapsFusion: 重新思考大规模图像文本数据

摘要

大型多模态模型展示了在零-shot方式下执行多样多模态任务的卓越通用能力。大规模基于网络的图像-文本对对此成功起到了根本性的贡献，但存在着过多的噪音。最近的研究使用由字幕模型合成的替代字幕，并取得了显著的基准性能。然而，我们的实验揭示了在使用合成字幕训练的模型中存在显著的可扩展性不足和世界知识丢失问题，这些问题在其最初的基准成功中被大部分掩盖了。经过更详细的检查，我们确定根本原因是现有合成字幕中过于简化的语言结构和缺乏知识细节。为了提供更高质量和更可扩展的多模态预训练数据，我们提出了CapsFusion，这是一个先进的框架，利用大型语言模型 consolida并精炼来自基于网络的图像-文本对和合成字幕的信息。大量实验表明，CapsFusion字幕在模型性能（例如，在COCO和NoCaps上的CIDEr分数分别提高了18.8和18.3）、样本效率（比基线计算少11-16倍）、世界知识深度和可扩展性方面表现出显著的全面优势。这些有效性、效率和可扩展性优势使CapsFusion成为未来大规模训练大型多模态模型的一个有前途的候选方案。

English

Large multimodal models demonstrate remarkable generalist ability to perform diverse multimodal tasks in a zero-shot manner. Large-scale web-based image-text pairs contribute fundamentally to this success, but suffer from excessive noise. Recent studies use alternative captions synthesized by captioning models and have achieved notable benchmark performance. However, our experiments reveal significant Scalability Deficiency and World Knowledge Loss issues in models trained with synthetic captions, which have been largely obscured by their initial benchmark success. Upon closer examination, we identify the root cause as the overly-simplified language structure and lack of knowledge details in existing synthetic captions. To provide higher-quality and more scalable multimodal pretraining data, we propose CapsFusion, an advanced framework that leverages large language models to consolidate and refine information from both web-based image-text pairs and synthetic captions. Extensive experiments show that CapsFusion captions exhibit remarkable all-round superiority over existing captions in terms of model performance (e.g., 18.8 and 18.3 improvements in CIDEr score on COCO and NoCaps), sample efficiency (requiring 11-16 times less computation than baselines), world knowledge depth, and scalability. These effectiveness, efficiency and scalability advantages position CapsFusion as a promising candidate for future scaling of LMM training.

CapsFusion: 重新思考大规模图像文本数据

CapsFusion: Rethinking Image-Text Data at Scale

摘要

Support