AnyMAL: 효율적이고 확장 가능한 임의의 모달리티 증강 언어 모델

초록

우리는 다양한 입력 모달리티 신호(즉, 텍스트, 이미지, 비디오, 오디오, IMU 모션 센서)를 처리하고 텍스트 응답을 생성하는 통합 모델인 Any-Modality Augmented Language Model(AnyMAL)을 제안합니다. AnyMAL은 LLaMA-2(70B)를 포함한 최신 대형 언어 모델(LLM)의 강력한 텍스트 기반 추론 능력을 계승하며, 사전 훈련된 정렬 모듈을 통해 모달리티별 신호를 공통 텍스트 공간으로 변환합니다. 다중모달 LLM의 능력을 더욱 강화하기 위해, 단순한 질의응답을 넘어 다양한 주제와 작업을 다루는 수동으로 수집된 다중모달 명령어 세트로 모델을 미세 조정합니다. 인간 평가와 자동 평가를 포함한 포괄적인 실증적 분석을 수행하며, 다양한 다중모달 작업에서 최첨단 성능을 입증합니다.

English

We present Any-Modality Augmented Language Model (AnyMAL), a unified model that reasons over diverse input modality signals (i.e. text, image, video, audio, IMU motion sensor), and generates textual responses. AnyMAL inherits the powerful text-based reasoning abilities of the state-of-the-art LLMs including LLaMA-2 (70B), and converts modality-specific signals to the joint textual space through a pre-trained aligner module. To further strengthen the multimodal LLM's capabilities, we fine-tune the model with a multimodal instruction set manually collected to cover diverse topics and tasks beyond simple QAs. We conduct comprehensive empirical analysis comprising both human and automatic evaluations, and demonstrate state-of-the-art performance on various multimodal tasks.

AnyMAL: 효율적이고 확장 가능한 임의의 모달리티 증강 언어 모델

AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model

초록

Support