ChatPaper.aiChatPaper

ReFusion:具備平行自迴歸解碼能力的擴散式大型語言模型

ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding

December 15, 2025
作者: Jia-Nan Li, Jian Guan, Wei Wu, Chongxuan Li
cs.AI

摘要

自回歸模型(ARMs)因序列推斷速度緩慢而受限。雖然掩碼擴散模型(MDMs)提供了並行化替代方案,但其存在關鍵缺陷:因無法使用鍵值(KV)緩存而導致計算開銷高昂,以及在學習依賴關係時因標記組合空間難以處理而產生不連貫生成內容。為解決這些局限性,我們提出ReFusion——一種新穎的掩碼擴散模型,通過將並行解碼從標記層級提升至更高維度的槽層級(每個槽為固定長度的連續子序列),實現了卓越的性能與效率。該模型採用迭代式「規劃-填充」解碼流程:基於擴散機制的規劃步驟首先識別一組弱依賴性槽,隨後自回歸填充步驟對這些選定槽進行並行解碼。此槽式設計同步實現了兩大突破:在統一因果框架下完全復用KV緩存,並將學習複雜度從標記組合空間降至可管理的槽級排列空間。在七個多樣化基準測試上的廣泛實驗表明,ReFusion不僅以34%的性能提升和平均超過18倍的加速比顯著超越現有MDMs,更在保持平均2.33倍加速的同時,縮小了與強力ARMs的性能差距。
English
Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency by elevating parallel decoding from the token level to a higher slot level, where each slot is a fixed-length, contiguous sub-sequence. This is achieved through an iterative ``plan-and-infill'' decoding process: a diffusion-based planning step first identifies a set of weakly dependent slots, and an autoregressive infilling step then decodes these selected slots in parallel. The slot-based design simultaneously unlocks full KV cache reuse with a unified causal framework and reduces the learning complexity from the token combination space to a manageable slot-level permutation space. Extensive experiments on seven diverse benchmarks show that ReFusion not only overwhelmingly surpasses prior MDMs with 34% performance gains and an over 18times speedup on average, but also bridges the performance gap to strong ARMs while maintaining a 2.33times average speedup.
PDF814December 17, 2025