DeepSpeed-VisualChat:透過多模態因果注意力進行多輪多圖片交錯對話
DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention
September 25, 2023
作者: Zhewei Yao, Xiaoxia Wu, Conglong Li, Minjia Zhang, Heyang Qi, Olatunji Ruwase, Ammar Ahmad Awan, Samyam Rajbhandari, Yuxiong He
cs.AI
摘要
大多數現有的多模型模型受限於無法熟練地處理多圖像、多輪對話中交錯的圖像和文本輸入,導致在培訓和數據訪問方面存在重大限制,影響其在不同交互領域中的適應性和可擴展性。為了解決這個問題,我們提出了DeepSpeed-VisualChat框架,旨在通過結合多模能力來優化大型語言模型(LLMs),並專注於增強大視覺和語言模型處理交錯輸入的能力。我們的框架值得注意的地方在於:(1)支持多輪和多圖像對話的開源支持,(2)引入創新的多模因果注意機制,以及(3)利用現有數據集上的數據混合技術,以確保在多輪、多圖像對話中的無縫互動。與現有框架相比,DeepSpeed-VisualChat展示了出色的可擴展性,可達到70B參數語言模型大小,代表了多模語言模型的重大進步,並為未來的探索奠定了堅實基礎。
English
Most of the existing multi-modal models, hindered by their incapacity to
adeptly manage interleaved image-and-text inputs in multi-image, multi-round
dialogues, face substantial constraints in resource allocation for training and
data accessibility, impacting their adaptability and scalability across varied
interaction realms. To address this, we present the DeepSpeed-VisualChat
framework, designed to optimize Large Language Models (LLMs) by incorporating
multi-modal capabilities, with a focus on enhancing the proficiency of Large
Vision and Language Models in handling interleaved inputs. Our framework is
notable for (1) its open-source support for multi-round and multi-image
dialogues, (2) introducing an innovative multi-modal causal attention
mechanism, and (3) utilizing data blending techniques on existing datasets to
assure seamless interactions in multi-round, multi-image conversations.
Compared to existing frameworks, DeepSpeed-VisualChat shows superior
scalability up to 70B parameter language model size, representing a significant
advancement in multi-modal language models and setting a solid foundation for
future explorations.