Dimple: 並列デコードを備えた離散拡散マルチモーダル大規模言語モデル

要旨

本研究では、初の離散拡散型マルチモーダル大規模言語モデル（DMLLM）であるDimpleを提案します。純粋な離散拡散アプローチによる学習では、著しい学習不安定性、最適でない性能、および深刻な長さバイアスの問題が生じることが観察されました。これらの課題に対処するため、初期の自己回帰フェーズとその後の拡散フェーズを組み合わせた新しい学習パラダイムを設計しました。このアプローチにより、LLaVA-NEXTと同じデータセットと類似の学習パイプラインを使用して訓練されたDimple-7Bモデルが得られ、最終的にLLaVA-NEXTを3.9%上回る性能を示し、DMLLMが自己回帰モデルに匹敵する性能を達成できることを実証しました。推論効率を向上させるため、各ステップで生成されるトークン数を動的に調整し、生成イテレーション数を大幅に削減する「確信度デコーディング」と呼ばれるデコード戦略を提案します。自己回帰モデルでは、生成中の前方イテレーション数は応答長と等しくなりますが、確信度デコーディングを用いることで、Dimpleに必要なイテレーション数は応答長の3分の1にまで削減されます。また、自己回帰モデルにおけるプリフィリング技術を再実装し、ほとんどのベンチマーク評価において性能に大きな影響を与えずに1.5倍から7倍の高速化を実現できることを示しました。さらに、Dimpleが構造事前情報を用いて応答を精密に制御する能力を探求しました。これらの事前情報は、指示ベースや連鎖思考プロンプトとは異なる方法で構造化された応答を可能にし、自己回帰モデルでは難しい応答形式や長さの細かい制御を実現します。全体として、本研究はDMLLMの実現可能性と利点を検証し、その推論効率と制御性を向上させました。コードとモデルはhttps://github.com/yu-rp/Dimpleで公開されています。

English

In this work, we propose Dimple, the first Discrete Diffusion Multimodal Large Language Model (DMLLM). We observe that training with a purely discrete diffusion approach leads to significant training instability, suboptimal performance, and severe length bias issues. To address these challenges, we design a novel training paradigm that combines an initial autoregressive phase with a subsequent diffusion phase. This approach yields the Dimple-7B model, trained on the same dataset and using a similar training pipeline as LLaVA-NEXT. Dimple-7B ultimately surpasses LLaVA-NEXT in performance by 3.9%, demonstrating that DMLLM can achieve performance comparable to that of autoregressive models. To improve inference efficiency, we propose a decoding strategy termed confident decoding, which dynamically adjusts the number of tokens generated at each step, significantly reducing the number of generation iterations. In autoregressive models, the number of forward iterations during generation equals the response length. With confident decoding, however, the number of iterations needed by Dimple is even only text{response length}{3}. We also re-implement the prefilling technique in autoregressive models and demonstrate that it does not significantly impact performance on most benchmark evaluations, while offering a speedup of 1.5x to 7x. Additionally, we explore Dimple's capability to precisely control its response using structure priors. These priors enable structured responses in a manner distinct from instruction-based or chain-of-thought prompting, and allow fine-grained control over response format and length, which is difficult to achieve in autoregressive models. Overall, this work validates the feasibility and advantages of DMLLM and enhances its inference efficiency and controllability. Code and models are available at https://github.com/yu-rp/Dimple.

Dimple: 並列デコードを備えた離散拡散マルチモーダル大規模言語モデル

Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding

要旨

Support