LaViDa: マルチモーダル理解のための大規模拡散言語モデル

要旨

現代のVision-Language Models（VLM）は、視覚的推論を必要とする幅広いタスクを解決することができます。現実世界のシナリオでは、VLMに求められる特性として、高速な推論と制御可能な生成（例えば、出力を所望の形式に従わせる）が挙げられます。しかし、LLaVAのような既存の自己回帰型（AR）VLMはこれらの点で課題を抱えています。離散拡散モデル（DM）は有望な代替手段を提供し、並列デコードによる高速な推論と、テキスト埋め込みを通じた双方向コンテキストによる制御可能な生成を可能にします。DMは言語のみの設定では効果的ですが、マルチモーダルタスクにおける可能性は未開拓です。本論文では、DMを基盤としたVLMファミリーであるLaViDaを紹介します。LaViDaは、DMに視覚エンコーダを組み込み、マルチモーダル指示追従のために各部分を共同でファインチューニングすることで構築されます。遭遇した課題に対処するため、LaViDaは効果的なトレーニングのための補完的マスキング、効率的な推論のためのプレフィックスKVキャッシュ、高品質なサンプリングのためのタイムステップシフトといった新技術を組み込んでいます。実験結果は、LaViDaがMMMUのようなマルチモーダルベンチマークにおいてAR VLMと同等または優れた性能を達成しつつ、柔軟な速度-品質トレードオフ、制御可能性、双方向推論といったDMの独自の利点を提供することを示しています。COCOキャプショニングでは、LaViDaはOpen-LLaVa-Next-8BをCIDErスコアで+4.1上回り、1.92倍の高速化を実現しました。双方向タスクでは、Constrained Poem Completionで+59%の改善を達成しました。これらの結果は、LaViDaがAR VLMに対する強力な代替手段であることを示しています。コードとモデルはカメラレディ版で公開される予定です。

English

Modern Vision-Language Models (VLMs) can solve a wide range of tasks requiring visual reasoning. In real-world scenarios, desirable properties for VLMs include fast inference and controllable generation (e.g., constraining outputs to adhere to a desired format). However, existing autoregressive (AR) VLMs like LLaVA struggle in these aspects. Discrete diffusion models (DMs) offer a promising alternative, enabling parallel decoding for faster inference and bidirectional context for controllable generation through text-infilling. While effective in language-only settings, DMs' potential for multimodal tasks is underexplored. We introduce LaViDa, a family of VLMs built on DMs. We build LaViDa by equipping DMs with a vision encoder and jointly fine-tune the combined parts for multimodal instruction following. To address challenges encountered, LaViDa incorporates novel techniques such as complementary masking for effective training, prefix KV cache for efficient inference, and timestep shifting for high-quality sampling. Experiments show that LaViDa achieves competitive or superior performance to AR VLMs on multi-modal benchmarks such as MMMU, while offering unique advantages of DMs, including flexible speed-quality tradeoff, controllability, and bidirectional reasoning. On COCO captioning, LaViDa surpasses Open-LLaVa-Next-8B by +4.1 CIDEr with 1.92x speedup. On bidirectional tasks, it achieves +59% improvement on Constrained Poem Completion. These results demonstrate LaViDa as a strong alternative to AR VLMs. Code and models will be released in the camera-ready version.

LaViDa: マルチモーダル理解のための大規模拡散言語モデル

LaViDa: A Large Diffusion Language Model for Multimodal Understanding

要旨

Support