Dimple: 병렬 디코딩을 지원하는 이산 확산 멀티모달 대형 언어 모델

초록

본 연구에서는 이산 확산(Discrete Diffusion) 기반의 첫 번째 멀티모달 대형 언어 모델(DMLLM)인 Dimple을 제안합니다. 순수 이산 확산 방식으로 학습을 진행할 경우 심각한 학습 불안정성, 최적 성능 미달, 그리고 길이 편향 문제가 발생함을 관찰했습니다. 이러한 문제를 해결하기 위해, 우리는 초기 자기회귀(autoregressive) 단계와 후속 확산 단계를 결합한 새로운 학습 패러다임을 설계했습니다. 이 접근법을 통해 LLaVA-NEXT와 동일한 데이터셋과 유사한 학습 파이프라인을 사용하여 학습된 Dimple-7B 모델을 개발하였으며, 이 모델은 LLaVA-NEXT를 3.9% 앞서는 성능을 보여 DMLLM이 자기회귀 모델과 비슷한 성능을 달성할 수 있음을 입증했습니다. 추론 효율성을 개선하기 위해, 우리는 각 단계에서 생성되는 토큰 수를 동적으로 조정하여 생성 반복 횟수를 크게 줄이는 '확신 디코딩(confident decoding)' 전략을 제안합니다. 자기회귀 모델에서는 생성 시 순방향 반복 횟수가 응답 길이와 동일하지만, 확신 디코딩을 사용할 경우 Dimple은 응답 길이의 1/3 수준의 반복만으로도 충분합니다. 또한, 우리는 자기회귀 모델의 프리필링(prefilling) 기법을 재구현하여 대부분의 벤치마크 평가에서 성능에 큰 영향을 미치지 않으면서도 1.5배에서 7배의 속도 향상을 제공함을 입증했습니다. 추가적으로, 우리는 Dimple이 구조적 사전 정보(structure priors)를 사용하여 응답을 정밀하게 제어할 수 있는 능력을 탐구했습니다. 이러한 사전 정보는 명령 기반이나 사고 연쇄(chain-of-thought) 프롬프트와는 다른 방식으로 구조화된 응답을 가능하게 하며, 자기회귀 모델에서는 달성하기 어려운 응답 형식과 길이에 대한 세밀한 제어를 허용합니다. 전반적으로, 본 연구는 DMLLM의 실현 가능성과 장점을 검증하고, 추론 효율성과 제어 가능성을 향상시켰습니다. 코드와 모델은 https://github.com/yu-rp/Dimple에서 확인할 수 있습니다.

English

In this work, we propose Dimple, the first Discrete Diffusion Multimodal Large Language Model (DMLLM). We observe that training with a purely discrete diffusion approach leads to significant training instability, suboptimal performance, and severe length bias issues. To address these challenges, we design a novel training paradigm that combines an initial autoregressive phase with a subsequent diffusion phase. This approach yields the Dimple-7B model, trained on the same dataset and using a similar training pipeline as LLaVA-NEXT. Dimple-7B ultimately surpasses LLaVA-NEXT in performance by 3.9%, demonstrating that DMLLM can achieve performance comparable to that of autoregressive models. To improve inference efficiency, we propose a decoding strategy termed confident decoding, which dynamically adjusts the number of tokens generated at each step, significantly reducing the number of generation iterations. In autoregressive models, the number of forward iterations during generation equals the response length. With confident decoding, however, the number of iterations needed by Dimple is even only text{response length}{3}. We also re-implement the prefilling technique in autoregressive models and demonstrate that it does not significantly impact performance on most benchmark evaluations, while offering a speedup of 1.5x to 7x. Additionally, we explore Dimple's capability to precisely control its response using structure priors. These priors enable structured responses in a manner distinct from instruction-based or chain-of-thought prompting, and allow fine-grained control over response format and length, which is difficult to achieve in autoregressive models. Overall, this work validates the feasibility and advantages of DMLLM and enhances its inference efficiency and controllability. Code and models are available at https://github.com/yu-rp/Dimple.

Dimple: 병렬 디코딩을 지원하는 이산 확산 멀티모달 대형 언어 모델

Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding

초록

Support