Macaw-LLM: 多模态语言建模，融合图像、音频、视频和文本

摘要

尽管经过指导调整的大型语言模型（LLMs）在各种自然语言处理任务中展现出卓越的能力，但它们在文本以外的其他数据模态上的有效性尚未得到充分研究。在这项工作中，我们提出了Macaw-LLM，这是一种新颖的多模态LLM，能够无缝地整合视觉、音频和文本信息。Macaw-LLM由三个主要组件组成：用于编码多模态数据的模态模块，用于利用预训练的LLMs的认知模块，以及用于协调不同表示的对齐模块。我们的新型对齐模块无缝地将多模态特征与文本特征连接起来，简化了从模态模块到认知模块的适应过程。此外，我们构建了一个大规模的多模态指导数据集，涵盖了69K个图像实例和50K个视频实例的多轮对话。我们已经公开提供了我们的数据、代码和模型，希望能为未来多模态LLM研究铺平道路，拓展LLMs处理多样数据模态和解决复杂现实场景的能力。

English

Although instruction-tuned large language models (LLMs) have exhibited remarkable capabilities across various NLP tasks, their effectiveness on other data modalities beyond text has not been fully studied. In this work, we propose Macaw-LLM, a novel multi-modal LLM that seamlessly integrates visual, audio, and textual information. Macaw-LLM consists of three main components: a modality module for encoding multi-modal data, a cognitive module for harnessing pretrained LLMs, and an alignment module for harmonizing diverse representations. Our novel alignment module seamlessly bridges multi-modal features to textual features, simplifying the adaptation process from the modality modules to the cognitive module. In addition, we construct a large-scale multi-modal instruction dataset in terms of multi-turn dialogue, including 69K image instances and 50K video instances. We have made our data, code and model publicly available, which we hope can pave the way for future research in multi-modal LLMs and expand the capabilities of LLMs to handle diverse data modalities and address complex real-world scenarios.

Macaw-LLM: 多模态语言建模，融合图像、音频、视频和文本

Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration

摘要

Support