ChatPaper.aiChatPaper

ReFusion:基于并行自回归解码的扩散大语言模型

ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding

December 15, 2025
作者: Jia-Nan Li, Jian Guan, Wei Wu, Chongxuan Li
cs.AI

摘要

自回归模型(ARM)因顺序推理速度缓慢而受限。虽然掩码扩散模型(MDM)提供了并行化替代方案,但其存在关键缺陷:因无法使用键值(KV)缓存而计算开销高昂,且在学习基于不可行令牌组合空间的依赖关系时会产生不连贯生成。为解决这些局限,我们提出ReFusion——一种新颖的掩码扩散模型,通过将并行解码从令牌级提升至更高级别的槽位级(每个槽位为固定长度的连续子序列),实现了卓越的性能与效率。该模型采用迭代式“规划-填充”解码流程:基于扩散的规划步骤首先识别一组弱依赖槽位,随后自回归填充步骤并行解码这些选定槽位。这种基于槽位的设计既通过统一的因果框架实现了完整的KV缓存复用,又将学习复杂度从令牌组合空间降至可管理的槽位排列空间。在七个多样化基准测试上的大量实验表明,ReFusion不仅以34%的性能提升和超18倍的平均加速比显著超越现有MDM,更在保持2.33倍平均加速的同时弥合了与强ARM模型的性能差距。
English
Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency by elevating parallel decoding from the token level to a higher slot level, where each slot is a fixed-length, contiguous sub-sequence. This is achieved through an iterative ``plan-and-infill'' decoding process: a diffusion-based planning step first identifies a set of weakly dependent slots, and an autoregressive infilling step then decodes these selected slots in parallel. The slot-based design simultaneously unlocks full KV cache reuse with a unified causal framework and reduces the learning complexity from the token combination space to a manageable slot-level permutation space. Extensive experiments on seven diverse benchmarks show that ReFusion not only overwhelmingly surpasses prior MDMs with 34% performance gains and an over 18times speedup on average, but also bridges the performance gap to strong ARMs while maintaining a 2.33times average speedup.
PDF814December 17, 2025