VITA: 오픈 소스 대화형 옴니 멀티모달 LLM을 향하여

초록

GPT-4의 현저한 다중 모달 기능과 상호 작용 경험은 실용적인 응용 프로그램에서 필수적임을 강조하지만, 오픈 소스 모델은 이 두 영역에서 드물게 뛰어나지 않습니다. 본 논문에서는 비디오, 이미지, 텍스트 및 오디오 모달리티를 동시에 처리하고 분석하는 데 능숙한 최초의 오픈 소스 다중 모달 대형 언어 모델(VITA)을 소개합니다. 동시에 고급 다중 모달 상호 작용 경험을 갖추고 있습니다. 언어 기반으로 Mixtral 8x7B를 시작으로 중국어 어휘를 확장하고 이중 언어 지시 조정을 수행합니다. 또한, 다중 모달 정렬 및 지시 조정의 이중 단계 다중 작업 학습을 통해 시각 및 오디오 기능을 언어 모델에 부여합니다. VITA는 다국어, 비전 및 오디오 이해의 견고한 기본 기능을 보여주며, 단일 모달 및 다중 모달 벤치마크 범위에서 강력한 성능을 나타냅니다. 기본 기능 이상으로, 자연스러운 다중 모달 인간-컴퓨터 상호 작용 경험을 향상시키는 데 상당한 진전을 이루었습니다. 우리는 MLLM에서 비각성 상호 작용과 오디오 중단을 활용하는 최초의 연구자로 알려져 있습니다. VITA는 오픈 소스 커뮤니티가 다중 모달 이해와 상호 작용의 원활한 통합을 탐구하기 위한 첫걸음입니다. VITA에는 닫힌 소스와 유사한 모델에 가까워지기 위해 많은 작업이 남아 있지만, 이를 선구자로서의 역할로 후속 연구의 기초로 제공할 수 있기를 희망합니다. 프로젝트 페이지: https://vita-home.github.io.

English

The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research. Project Page: https://vita-home.github.io.

VITA: 오픈 소스 대화형 옴니 멀티모달 LLM을 향하여

VITA: Towards Open-Source Interactive Omni Multimodal LLM

초록

Support