AnyMAL：一种高效且可扩展的任意模态增强语言模型

摘要

我们提出了Any-Modality Augmented Language Model（AnyMAL），这是一个统一的模型，可以推理各种输入模态信号（即文本、图像、视频、音频、IMU运动传感器），并生成文本响应。AnyMAL继承了最先进的LLM（如LLaMA-2（70B））强大的基于文本的推理能力，并通过一个预训练的对齐器模块将特定于模态的信号转换为联合文本空间。为了进一步增强多模态LLM的能力，我们使用手动收集的多模态指令集对模型进行微调，以涵盖简单问答之外的各种主题和任务。我们进行了全面的实证分析，包括人工和自动评估，并展示了在各种多模态任务上的最先进性能。

English

We present Any-Modality Augmented Language Model (AnyMAL), a unified model that reasons over diverse input modality signals (i.e. text, image, video, audio, IMU motion sensor), and generates textual responses. AnyMAL inherits the powerful text-based reasoning abilities of the state-of-the-art LLMs including LLaMA-2 (70B), and converts modality-specific signals to the joint textual space through a pre-trained aligner module. To further strengthen the multimodal LLM's capabilities, we fine-tune the model with a multimodal instruction set manually collected to cover diverse topics and tasks beyond simple QAs. We conduct comprehensive empirical analysis comprising both human and automatic evaluations, and demonstrate state-of-the-art performance on various multimodal tasks.

AnyMAL：一种高效且可扩展的任意模态增强语言模型

AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model

摘要

Support