Macaw-LLM: 이미지, 오디오, 비디오 및 텍스트 통합을 통한 다중 모달 언어 모델링

초록

명령어 튜닝된 대형 언어 모델(LLMs)이 다양한 NLP 작업에서 놀라운 능력을 보여주었지만, 텍스트를 넘어선 다른 데이터 양식에 대한 효과는 아직 충분히 연구되지 않았습니다. 본 연구에서는 시각, 청각, 텍스트 정보를 원활하게 통합한 새로운 다중 모달 LLM인 Macaw-LLM을 제안합니다. Macaw-LLM은 다중 모달 데이터를 인코딩하는 모달리티 모듈, 사전 학습된 LLM을 활용하는 인지 모듈, 다양한 표현을 조화시키는 정렬 모듈로 구성됩니다. 우리의 새로운 정렬 모듈은 다중 모달 특징을 텍스트 특징과 원활하게 연결하여 모달리티 모듈에서 인지 모듈로의 적응 과정을 단순화합니다. 또한, 69K개의 이미지 인스턴스와 50K개의 비디오 인스턴스를 포함한 대규모 다중 모달 명령어 데이터셋을 다중 턴 대화 형식으로 구축했습니다. 우리는 데이터, 코드 및 모델을 공개하여, 다중 모달 LLM의 미래 연구를 위한 길을 열고 LLM이 다양한 데이터 양식을 처리하고 복잡한 실제 시나리오를 해결할 수 있는 능력을 확장할 수 있기를 바랍니다.

English

Although instruction-tuned large language models (LLMs) have exhibited remarkable capabilities across various NLP tasks, their effectiveness on other data modalities beyond text has not been fully studied. In this work, we propose Macaw-LLM, a novel multi-modal LLM that seamlessly integrates visual, audio, and textual information. Macaw-LLM consists of three main components: a modality module for encoding multi-modal data, a cognitive module for harnessing pretrained LLMs, and an alignment module for harmonizing diverse representations. Our novel alignment module seamlessly bridges multi-modal features to textual features, simplifying the adaptation process from the modality modules to the cognitive module. In addition, we construct a large-scale multi-modal instruction dataset in terms of multi-turn dialogue, including 69K image instances and 50K video instances. We have made our data, code and model publicly available, which we hope can pave the way for future research in multi-modal LLMs and expand the capabilities of LLMs to handle diverse data modalities and address complex real-world scenarios.

Macaw-LLM: 이미지, 오디오, 비디오 및 텍스트 통합을 통한 다중 모달 언어 모델링

Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration

초록

Support