重新审视大规模图像字幕数据在预训练多模态基础模型中的作用
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models
October 3, 2024
作者: Zhengfeng Lai, Vasileios Saveris, Chen Chen, Hong-You Chen, Haotian Zhang, Bowen Zhang, Juan Lao Tebar, Wenze Hu, Zhe Gan, Peter Grasch, Meng Cao, Yinfei Yang
cs.AI
摘要
最近多模态模型的进展突显了改写标题以提高性能的价值,但仍存在关键挑战。例如,尽管合成标题通常提供更优质的质量和图像文本对齐,但尚不清楚它们是否能完全取代AltTexts:合成标题及其与原始网络抓取的AltTexts在预训练中的作用仍不为人了解。此外,不同的多模态基础模型可能对特定标题格式有独特偏好,但为每个模型确定最佳标题的努力仍然有限。在这项工作中,我们提出了一种新颖、可控且可扩展的字幕生成流程,旨在生成多样的标题格式,以适应各种多模态模型。通过将短合成标题(SSC)转向密集合成标题(DSC+)作为案例研究,我们系统地探讨它们与AltTexts在CLIP、多模态LLMs和扩散模型等模型中的影响和互动。我们的研究结果表明,保留合成标题和AltTexts两者的混合方法可以优于仅使用合成标题,提高对齐和性能,每个模型都展现出对特定标题格式的偏好。这一全面分析为优化字幕策略提供了宝贵的见解,从而推动多模态基础模型的预训练。
English
Recent advancements in multimodal models highlight the value of rewritten
captions for improving performance, yet key challenges remain. For example,
while synthetic captions often provide superior quality and image-text
alignment, it is not clear whether they can fully replace AltTexts: the role of
synthetic captions and their interaction with original web-crawled AltTexts in
pre-training is still not well understood. Moreover, different multimodal
foundation models may have unique preferences for specific caption formats, but
efforts to identify the optimal captions for each model remain limited. In this
work, we propose a novel, controllable, and scalable captioning pipeline
designed to generate diverse caption formats tailored to various multimodal
models. By examining Short Synthetic Captions (SSC) towards Dense Synthetic
Captions (DSC+) as case studies, we systematically explore their effects and
interactions with AltTexts across models such as CLIP, multimodal LLMs, and
diffusion models. Our findings reveal that a hybrid approach that keeps both
synthetic captions and AltTexts can outperform the use of synthetic captions
alone, improving both alignment and performance, with each model demonstrating
preferences for particular caption formats. This comprehensive analysis
provides valuable insights into optimizing captioning strategies, thereby
advancing the pre-training of multimodal foundation models.Summary
AI-Generated Summary