Dimple：具備平行解碼能力的離散擴散多模態大型語言模型

摘要

在本研究中，我們提出了Dimple，首個基於離散擴散的多模態大語言模型（DMLLM）。我們觀察到，純粹使用離散擴散方法進行訓練會導致顯著的訓練不穩定性、次優性能以及嚴重的長度偏差問題。為應對這些挑戰，我們設計了一種新穎的訓練範式，該範式結合了初始的自回歸階段與後續的擴散階段。這一方法催生了Dimple-7B模型，其訓練數據集與訓練流程與LLaVA-NEXT相似。最終，Dimple-7B在性能上超越了LLaVA-NEXT達3.9%，證明了DMLLM能夠達到與自回歸模型相當的性能水平。為提升推理效率，我們提出了一種名為自信解碼的解碼策略，該策略動態調整每一步生成的令牌數量，顯著減少了生成迭代次數。在自回歸模型中，生成過程中的前向迭代次數等於響應長度；而採用自信解碼後，Dimple所需的迭代次數僅為響應長度的三分之一。我們還重新實現了自回歸模型中的預填充技術，並證明其在大多數基準評估中對性能影響不大，同時提供了1.5倍至7倍的加速效果。此外，我們探索了Dimple利用結構先驗精確控制其響應的能力。這些先驗使得結構化響應以不同於基於指令或思維鏈提示的方式實現，並允許對響應格式和長度進行細粒度控制，這在自回歸模型中難以達成。總體而言，本研究驗證了DMLLM的可行性與優勢，並提升了其推理效率與可控性。代碼與模型已公開於https://github.com/yu-rp/Dimple。

English

In this work, we propose Dimple, the first Discrete Diffusion Multimodal Large Language Model (DMLLM). We observe that training with a purely discrete diffusion approach leads to significant training instability, suboptimal performance, and severe length bias issues. To address these challenges, we design a novel training paradigm that combines an initial autoregressive phase with a subsequent diffusion phase. This approach yields the Dimple-7B model, trained on the same dataset and using a similar training pipeline as LLaVA-NEXT. Dimple-7B ultimately surpasses LLaVA-NEXT in performance by 3.9%, demonstrating that DMLLM can achieve performance comparable to that of autoregressive models. To improve inference efficiency, we propose a decoding strategy termed confident decoding, which dynamically adjusts the number of tokens generated at each step, significantly reducing the number of generation iterations. In autoregressive models, the number of forward iterations during generation equals the response length. With confident decoding, however, the number of iterations needed by Dimple is even only text{response length}{3}. We also re-implement the prefilling technique in autoregressive models and demonstrate that it does not significantly impact performance on most benchmark evaluations, while offering a speedup of 1.5x to 7x. Additionally, we explore Dimple's capability to precisely control its response using structure priors. These priors enable structured responses in a manner distinct from instruction-based or chain-of-thought prompting, and allow fine-grained control over response format and length, which is difficult to achieve in autoregressive models. Overall, this work validates the feasibility and advantages of DMLLM and enhances its inference efficiency and controllability. Code and models are available at https://github.com/yu-rp/Dimple.

Dimple：具備平行解碼能力的離散擴散多模態大型語言模型

Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding

摘要

Support