DeepSpeed-VisualChat:通过多模态因果注意力实现多轮多图像交错对话
DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention
September 25, 2023
作者: Zhewei Yao, Xiaoxia Wu, Conglong Li, Minjia Zhang, Heyang Qi, Olatunji Ruwase, Ammar Ahmad Awan, Samyam Rajbhandari, Yuxiong He
cs.AI
摘要
大多数现有的多模态模型受限于无法熟练处理多图像、多轮对话中交错的图像和文本输入,存在着在训练资源分配和数据获取方面的重大约束,影响了它们在不同交互领域中的适应性和可扩展性。为了解决这一问题,我们提出了DeepSpeed-VisualChat框架,旨在通过融合多模态能力来优化大型语言模型(LLMs),重点是提高大型视觉和语言模型处理交错输入的效率。我们的框架值得注意的地方在于:(1)支持多轮和多图像对话的开源支持;(2)引入创新的多模态因果注意机制;以及(3)利用现有数据集的数据融合技术,以确保在多轮、多图像对话中的无缝交互。与现有框架相比,DeepSpeed-VisualChat展现出卓越的可扩展性,可扩展至70B参数的语言模型规模,代表了多模态语言模型的重大进步,并为未来的探索奠定了坚实基础。
English
Most of the existing multi-modal models, hindered by their incapacity to
adeptly manage interleaved image-and-text inputs in multi-image, multi-round
dialogues, face substantial constraints in resource allocation for training and
data accessibility, impacting their adaptability and scalability across varied
interaction realms. To address this, we present the DeepSpeed-VisualChat
framework, designed to optimize Large Language Models (LLMs) by incorporating
multi-modal capabilities, with a focus on enhancing the proficiency of Large
Vision and Language Models in handling interleaved inputs. Our framework is
notable for (1) its open-source support for multi-round and multi-image
dialogues, (2) introducing an innovative multi-modal causal attention
mechanism, and (3) utilizing data blending techniques on existing datasets to
assure seamless interactions in multi-round, multi-image conversations.
Compared to existing frameworks, DeepSpeed-VisualChat shows superior
scalability up to 70B parameter language model size, representing a significant
advancement in multi-modal language models and setting a solid foundation for
future explorations.