AnyMAL：一個高效且可擴展的任意模態增強語言模型

摘要

我們提出了Any-Modality Augmented Language Model (AnyMAL)，這是一個統一的模型，可以推理各種輸入模態信號（例如文本、圖像、視頻、音頻、IMU運動感應器），並生成文本回應。AnyMAL繼承了最先進的語言模型（LLM）包括LLaMA-2（70B）的強大基於文本的推理能力，並通過預先訓練的對齊器模塊將特定於模態的信號轉換為聯合文本空間。為了進一步加強多模態LLM的能力，我們使用手動收集的多模態指令集對模型進行微調，以涵蓋超出簡單問答之外的各種主題和任務。我們進行了全面的實證分析，包括人工和自動評估，並展示了在各種多模態任務上的最先進表現。

English

We present Any-Modality Augmented Language Model (AnyMAL), a unified model that reasons over diverse input modality signals (i.e. text, image, video, audio, IMU motion sensor), and generates textual responses. AnyMAL inherits the powerful text-based reasoning abilities of the state-of-the-art LLMs including LLaMA-2 (70B), and converts modality-specific signals to the joint textual space through a pre-trained aligner module. To further strengthen the multimodal LLM's capabilities, we fine-tune the model with a multimodal instruction set manually collected to cover diverse topics and tasks beyond simple QAs. We conduct comprehensive empirical analysis comprising both human and automatic evaluations, and demonstrate state-of-the-art performance on various multimodal tasks.

AnyMAL：一個高效且可擴展的任意模態增強語言模型

AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model

摘要

Support