ChatPaper.aiChatPaper

在建構視覺語言模型時有哪些重要事項?

What matters when building vision-language models?

May 3, 2024
作者: Hugo Laurençon, Léo Tronchon, Matthieu Cord, Victor Sanh
cs.AI

摘要

對視覺語言模型(VLMs)日益增長的興趣是由於大型語言模型和視覺Transformer的改進所驅動。儘管有許多關於這個主題的文獻,我們觀察到在設計VLMs時,關鍵決策通常沒有得到合理的證明。我們認為這些不受支持的決策阻礙了該領域的進展,因為這使得很難確定哪些選擇能提高模型的性能。為了解決這個問題,我們圍繞預訓練模型、架構選擇、數據和訓練方法進行了大量實驗。我們的研究結果包括開發了Idefics2,一個具有80億參數的高效基礎VLM。Idefics2在各種多模態基準測試中實現了同類型模型中的最先進性能,並且通常與其四倍大小的模型不相上下。我們釋出了該模型(基本、指導和對話)以及為其訓練而創建的數據集。
English
The growing interest in vision-language models (VLMs) has been driven by improvements in large language models and vision transformers. Despite the abundance of literature on this subject, we observe that critical decisions regarding the design of VLMs are often not justified. We argue that these unsupported decisions impede progress in the field by making it difficult to identify which choices improve model performance. To address this issue, we conduct extensive experiments around pre-trained models, architecture choice, data, and training methods. Our consolidation of findings includes the development of Idefics2, an efficient foundational VLM of 8 billion parameters. Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks, and is often on par with models four times its size. We release the model (base, instructed, and chat) along with the datasets created for its training.

Summary

AI-Generated Summary

PDF1043December 15, 2024