VITA:面向开源的交互式全模态LLM
VITA: Towards Open-Source Interactive Omni Multimodal LLM
August 9, 2024
作者: Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Meng Zhao, Yifan Zhang, Xiong Wang, Di Yin, Long Ma, Xiawu Zheng, Ran He, Rongrong Ji, Yunsheng Wu, Caifeng Shan, Xing Sun
cs.AI
摘要
GPT-4o的显著多模态能力和交互体验突显了它们在实际应用中的必要性,然而开源模型很少在这两个领域都表现出色。在本文中,我们介绍了VITA,这是第一个开源的多模态大型语言模型(MLLM),擅长同时处理和分析视频、图像、文本和音频模态,同时具有先进的多模态交互体验。从Mixtral 8x7B作为语言基础出发,我们扩展了其中文词汇,然后进行了双语指导调优。我们进一步通过两阶段多任务学习的多模态对齐和指导调优,赋予语言模型视觉和音频能力。VITA展示了多语言、视觉和音频理解的稳健基础能力,其在一系列单模态和多模态基准测试中表现出色。除了基础能力,我们在增强自然多模态人机交互体验方面取得了可观进展。据我们所知,我们是第一个在MLLM中利用非唤醒交互和音频中断的团队。VITA是开源社区探索多模态理解和交互无缝集成的第一步。尽管在接近闭源对应模型方面还有很多工作要做,但我们希望它作为先驱的角色可以成为后续研究的基石。项目页面:https://vita-home.github.io。
English
The remarkable multimodal capabilities and interactive experience of GPT-4o
underscore their necessity in practical applications, yet open-source models
rarely excel in both areas. In this paper, we introduce VITA, the first-ever
open-source Multimodal Large Language Model (MLLM) adept at simultaneous
processing and analysis of Video, Image, Text, and Audio modalities, and
meanwhile has an advanced multimodal interactive experience. Starting from
Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary
followed by bilingual instruction tuning. We further endow the language model
with visual and audio capabilities through two-stage multi-task learning of
multimodal alignment and instruction tuning. VITA demonstrates robust
foundational capabilities of multilingual, vision, and audio understanding, as
evidenced by its strong performance across a range of both unimodal and
multimodal benchmarks. Beyond foundational capabilities, we have made
considerable progress in enhancing the natural multimodal human-computer
interaction experience. To the best of our knowledge, we are the first to
exploit non-awakening interaction and audio interrupt in MLLM. VITA is the
first step for the open-source community to explore the seamless integration of
multimodal understanding and interaction. While there is still lots of work to
be done on VITA to get close to close-source counterparts, we hope that its
role as a pioneer can serve as a cornerstone for subsequent research. Project
Page: https://vita-home.github.io.Summary
AI-Generated Summary