ChatPaper.aiChatPaper

Sparse-LaViDa:稀疏多模态离散扩散语言模型

Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Language Models

December 16, 2025
作者: Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei, Aditya Grover, Jason Kuen
cs.AI

摘要

掩码离散扩散模型(MD3)在图像理解、生成与编辑等多模态任务中展现出卓越性能,但其推理速度仍受限于每个采样步骤需重复处理大量冗余掩码标记。本研究提出Sparse-LaViDa——一种通过动态截断各推理步骤中非必要掩码标记来加速MD3采样的新型建模框架。为保持生成质量,我们引入专用寄存器标记作为被截断标记的紧凑表征。此外,为确保训练与推理的一致性,我们设计了能精准匹配截断采样过程的特殊注意力掩码机制。基于当前最先进的统一MD3框架LaViDa-O构建的Sparse-LaViDa,在文本到图像生成、图像编辑和数学推理等多样化任务中实现了最高2倍的加速,同时保持了原有生成质量。
English
Masked Discrete Diffusion Models (MDMs) have achieved strong performance across a wide range of multimodal tasks, including image understanding, generation, and editing. However, their inference speed remains suboptimal due to the need to repeatedly process redundant masked tokens at every sampling step. In this work, we propose Sparse-LaViDa, a novel modeling framework that dynamically truncates unnecessary masked tokens at each inference step to accelerate MDM sampling. To preserve generation quality, we introduce specialized register tokens that serve as compact representations for the truncated tokens. Furthermore, to ensure consistency between training and inference, we design a specialized attention mask that faithfully matches the truncated sampling procedure during training. Built upon the state-of-the-art unified MDM LaViDa-O, Sparse-LaViDa achieves up to a 2x speedup across diverse tasks including text-to-image generation, image editing, and mathematical reasoning, while maintaining generation quality.
PDF71December 18, 2025