LLaDA-V: ビジュアル命令チューニングを備えた大規模言語拡散モデル

要旨

本研究では、LLaDA-Vを紹介する。これは純粋な拡散モデルに基づくマルチモーダル大規模言語モデル（MLLM）であり、視覚的指示チューニングをマスク拡散モデルと統合することで、現在のマルチモーダルアプローチで主流となっている自己回帰的パラダイムからの脱却を図っている。代表的な大規模言語拡散モデルであるLLaDAを基盤として構築されたLLaDA-Vは、視覚エンコーダとMLPコネクタを備えており、視覚的特徴を言語埋め込み空間に投影することで、効果的なマルチモーダルアラインメントを実現している。我々の実証研究からは、いくつかの興味深い結果が得られた。第一に、LLaDA-Vは、純粋なテキストタスクにおいてLLaMA3-8BやQwen2-7Bなどのモデルに比べて言語モデルが弱いにもかかわらず、有望なマルチモーダル性能を示している。同じ指示データで訓練された場合、LLaDA-VはLLaMA3-Vと比較してマルチモーダルタスクにおいて高い競争力を発揮し、データスケーラビリティも優れている。また、Qwen2-VLとの性能差を縮めており、そのアーキテクチャがマルチモーダルタスクに有効であることを示唆している。第二に、LLaDA-Vは、既存のハイブリッド自己回帰-拡散モデルや純粋な拡散ベースのMLLMと比較して、マルチモーダル理解において最先端の性能を達成している。我々の知見は、大規模言語拡散モデルがマルチモーダルコンテキストにおいて有望であることを示しており、今後の研究におけるさらなる調査の必要性を裏付けている。プロジェクトページとコードは以下を参照：https://ml-gsai.github.io/LLaDA-V-demo/。

English

In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure from the autoregressive paradigms dominant in current multimodal approaches. Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and MLP connector that projects visual features into the language embedding space, enabling effective multimodal alignment. Our empirical investigation reveals several intriguing results: First, LLaDA-V demonstrates promising multimodal performance despite its language model being weaker on purely textual tasks than counterparts like LLaMA3-8B and Qwen2-7B. When trained on the same instruction data, LLaDA-V is highly competitive to LLaMA3-V across multimodal tasks with better data scalability. It also narrows the performance gap to Qwen2-VL, suggesting the effectiveness of its architecture for multimodal tasks. Second, LLaDA-V achieves state-of-the-art performance in multimodal understanding compared to existing hybrid autoregressive-diffusion and purely diffusion-based MLLMs. Our findings suggest that large language diffusion models show promise in multimodal contexts and warrant further investigation in future research. Project page and codes: https://ml-gsai.github.io/LLaDA-V-demo/.

LLaDA-V: ビジュアル命令チューニングを備えた大規模言語拡散モデル

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

要旨

Support