VITA:邁向開源互動式全方位多模態LLM
VITA: Towards Open-Source Interactive Omni Multimodal LLM
August 9, 2024
作者: Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Meng Zhao, Yifan Zhang, Xiong Wang, Di Yin, Long Ma, Xiawu Zheng, Ran He, Rongrong Ji, Yunsheng Wu, Caifeng Shan, Xing Sun
cs.AI
摘要
GPT-4o的卓越多模式能力和互動體驗凸顯了它們在實際應用中的必要性,然而開源模型很少在這兩個領域表現出色。在本文中,我們介紹了VITA,這是第一個開源多模式大型語言模型(MLLM),擅長同時處理和分析視頻、圖像、文本和音頻模式,同時具有先進的多模式互動體驗。從以Mixtral 8x7B為語言基礎開始,我們擴展了其中文詞彙,並進行了雙語指導調整。我們通過兩階段多任務學習的多模式對齊和指導調整,進一步賦予語言模型視覺和音頻能力。VITA展示了多語言、視覺和音頻理解的堅實基礎能力,其在一系列單模式和多模式基準測試中表現出色。除了基礎能力外,我們在增強自然多模式人機交互體驗方面取得了顯著進展。據我們所知,我們是第一個在MLLM中利用非覺醒互動和音頻中斷的研究者。VITA是開源社區探索多模式理解和互動無縫整合的第一步。雖然在接近封閉源對應方面還有很多工作要做,但我們希望它作為先驅的角色可以成為後續研究的基石。項目頁面:https://vita-home.github.io。
English
The remarkable multimodal capabilities and interactive experience of GPT-4o
underscore their necessity in practical applications, yet open-source models
rarely excel in both areas. In this paper, we introduce VITA, the first-ever
open-source Multimodal Large Language Model (MLLM) adept at simultaneous
processing and analysis of Video, Image, Text, and Audio modalities, and
meanwhile has an advanced multimodal interactive experience. Starting from
Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary
followed by bilingual instruction tuning. We further endow the language model
with visual and audio capabilities through two-stage multi-task learning of
multimodal alignment and instruction tuning. VITA demonstrates robust
foundational capabilities of multilingual, vision, and audio understanding, as
evidenced by its strong performance across a range of both unimodal and
multimodal benchmarks. Beyond foundational capabilities, we have made
considerable progress in enhancing the natural multimodal human-computer
interaction experience. To the best of our knowledge, we are the first to
exploit non-awakening interaction and audio interrupt in MLLM. VITA is the
first step for the open-source community to explore the seamless integration of
multimodal understanding and interaction. While there is still lots of work to
be done on VITA to get close to close-source counterparts, we hope that its
role as a pioneer can serve as a cornerstone for subsequent research. Project
Page: https://vita-home.github.io.Summary
AI-Generated Summary