ChatPaper.aiChatPaper

Dimple:并行解码的离散扩散多模态大语言模型

Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding

May 22, 2025
作者: Runpeng Yu, Xinyin Ma, Xinchao Wang
cs.AI

摘要

在本研究中,我们提出了Dimple,首个基于离散扩散的多模态大语言模型(DMLLM)。我们观察到,采用纯离散扩散方法进行训练会导致显著的训练不稳定性、性能欠佳以及严重的长度偏差问题。为应对这些挑战,我们设计了一种新颖的训练范式,将初始的自回归阶段与后续的扩散阶段相结合。这一方法催生了Dimple-7B模型,其训练数据集与训练流程与LLaVA-NEXT相似。最终,Dimple-7B在性能上超越了LLaVA-NEXT达3.9%,证明了DMLLM能够达到与自回归模型相媲美的性能。为提升推理效率,我们提出了一种称为自信解码的策略,该策略动态调整每一步生成的令牌数量,显著减少了生成迭代次数。在自回归模型中,生成过程中的前向迭代次数等于响应长度;而采用自信解码后,Dimple所需的迭代次数仅为响应长度的三分之一。我们还重新实现了自回归模型中的预填充技术,并证明其在多数基准评估上对性能影响不大,同时带来了1.5倍至7倍的加速。此外,我们探索了Dimple利用结构先验精确控制其响应的能力。这些先验使得模型能够以不同于基于指令或思维链提示的方式生成结构化响应,并允许对响应格式和长度进行细粒度控制,这在自回归模型中难以实现。总体而言,本研究验证了DMLLM的可行性和优势,并提升了其推理效率与可控性。代码和模型已发布于https://github.com/yu-rp/Dimple。
English
In this work, we propose Dimple, the first Discrete Diffusion Multimodal Large Language Model (DMLLM). We observe that training with a purely discrete diffusion approach leads to significant training instability, suboptimal performance, and severe length bias issues. To address these challenges, we design a novel training paradigm that combines an initial autoregressive phase with a subsequent diffusion phase. This approach yields the Dimple-7B model, trained on the same dataset and using a similar training pipeline as LLaVA-NEXT. Dimple-7B ultimately surpasses LLaVA-NEXT in performance by 3.9%, demonstrating that DMLLM can achieve performance comparable to that of autoregressive models. To improve inference efficiency, we propose a decoding strategy termed confident decoding, which dynamically adjusts the number of tokens generated at each step, significantly reducing the number of generation iterations. In autoregressive models, the number of forward iterations during generation equals the response length. With confident decoding, however, the number of iterations needed by Dimple is even only text{response length}{3}. We also re-implement the prefilling technique in autoregressive models and demonstrate that it does not significantly impact performance on most benchmark evaluations, while offering a speedup of 1.5x to 7x. Additionally, we explore Dimple's capability to precisely control its response using structure priors. These priors enable structured responses in a manner distinct from instruction-based or chain-of-thought prompting, and allow fine-grained control over response format and length, which is difficult to achieve in autoregressive models. Overall, this work validates the feasibility and advantages of DMLLM and enhances its inference efficiency and controllability. Code and models are available at https://github.com/yu-rp/Dimple.

Summary

AI-Generated Summary

PDF142May 23, 2025