构建视觉-语言模型时需要考虑哪些因素？

摘要

对视觉语言模型（VLMs）日益增长的兴趣是由于大型语言模型和视觉Transformer的改进。尽管关于这一主题的文献丰富，但我们观察到关于VLM设计的关键决策通常缺乏合理的论证。我们认为这些不受支持的决定阻碍了该领域的进展，因为很难确定哪些选择会提高模型性能。为了解决这一问题，我们围绕预训练模型、架构选择、数据和训练方法进行了大量实验。我们的研究成果包括开发了Idefics2，一个拥有80亿参数的高效基础VLM。Idefics2在各种多模态基准测试中实现了同类规模中的最新性能，并且通常与其四倍大小的模型不相上下。我们发布了该模型（基础、指导和聊天版本），以及为其训练创建的数据集。

English

The growing interest in vision-language models (VLMs) has been driven by improvements in large language models and vision transformers. Despite the abundance of literature on this subject, we observe that critical decisions regarding the design of VLMs are often not justified. We argue that these unsupported decisions impede progress in the field by making it difficult to identify which choices improve model performance. To address this issue, we conduct extensive experiments around pre-trained models, architecture choice, data, and training methods. Our consolidation of findings includes the development of Idefics2, an efficient foundational VLM of 8 billion parameters. Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks, and is often on par with models four times its size. We release the model (base, instructed, and chat) along with the datasets created for its training.

构建视觉-语言模型时需要考虑哪些因素？

What matters when building vision-language models?

摘要

Support