AnyMAL: 効率的でスケーラブルな任意モダリティ拡張言語モデル

要旨

我々はAny-Modality Augmented Language Model（AnyMAL）を提案する。これは、多様な入力モダリティ信号（テキスト、画像、ビデオ、オーディオ、IMUモーションセンサーなど）を推論し、テキスト応答を生成する統一モデルである。AnyMALは、LLaMA-2（70B）を含む最先端の大規模言語モデル（LLM）の強力なテキストベースの推論能力を継承し、事前学習されたアライナーモジュールを通じてモダリティ固有の信号を共通のテキスト空間に変換する。さらに、マルチモーダルLLMの能力を強化するため、単純なQ&Aを超えた多様なトピックとタスクをカバーする手動で収集されたマルチモーダル命令セットでモデルをファインチューニングする。我々は、人間評価と自動評価を含む包括的な実証分析を行い、様々なマルチモーダルタスクにおいて最先端の性能を実証する。

English

We present Any-Modality Augmented Language Model (AnyMAL), a unified model that reasons over diverse input modality signals (i.e. text, image, video, audio, IMU motion sensor), and generates textual responses. AnyMAL inherits the powerful text-based reasoning abilities of the state-of-the-art LLMs including LLaMA-2 (70B), and converts modality-specific signals to the joint textual space through a pre-trained aligner module. To further strengthen the multimodal LLM's capabilities, we fine-tune the model with a multimodal instruction set manually collected to cover diverse topics and tasks beyond simple QAs. We conduct comprehensive empirical analysis comprising both human and automatic evaluations, and demonstrate state-of-the-art performance on various multimodal tasks.

AnyMAL: 効率的でスケーラブルな任意モダリティ拡張言語モデル

AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model

要旨

Support