Focus-dLLM:基于置信度引导上下文聚焦的长文本扩散大模型推理加速方法
Focus-dLLM: Accelerating Long-Context Diffusion LLM Inference via Confidence-Guided Context Focusing
February 2, 2026
作者: Lingkun Long, Yushi Huang, Shihao Bai, Ruihao Gong, Jun Zhang, Ao Zhou, Jianlei Yang
cs.AI
摘要
扩散大语言模型(dLLMs)在非自回归解码范式下展现出卓越的长上下文处理能力。然而,双向全注意力机制的巨大计算成本限制了推理效率。虽然稀疏注意力具有潜力,但现有方法仍存在不足。这源于需要预测尚未解码词元的注意力重要性,而扩散过程中未掩码词元的位置是未知的。本文提出Focus-dLLM,一种专为精准高效的长上下文dLLM推理设计的无训练注意力稀疏化框架。基于相邻步骤间词元置信度强相关性的发现,我们首先设计了一种历史置信度引导的指示器来预测未掩码区域。在此基础上,提出感知注意力汇的剪枝策略,在保留高影响力注意力汇的同时,精准估计并消除冗余注意力计算。为进一步降低开销,该策略通过利用观测到的跨层一致性,在多个层级间复用已识别的注意力汇位置。实验结果表明,在32K上下文长度下,本方法可实现超过29倍的无损加速。代码已开源于:https://github.com/Longxmas/Focus-dLLM
English
Diffusion Large Language Models (dLLMs) deliver strong long-context processing capability in a non-autoregressive decoding paradigm. However, the considerable computational cost of bidirectional full attention limits the inference efficiency. Although sparse attention is promising, existing methods remain ineffective. This stems from the need to estimate attention importance for tokens yet to be decoded, while the unmasked token positions are unknown during diffusion. In this paper, we present Focus-dLLM, a novel training-free attention sparsification framework tailored for accurate and efficient long-context dLLM inference. Based on the finding that token confidence strongly correlates across adjacent steps, we first design a past confidence-guided indicator to predict unmasked regions. Built upon this, we propose a sink-aware pruning strategy to accurately estimate and remove redundant attention computation, while preserving highly influential attention sinks. To further reduce overhead, this strategy reuses identified sink locations across layers, leveraging the observed cross-layer consistency. Experimental results show that our method offers more than 29times lossless speedup under 32K context length. The code is publicly available at: https://github.com/Longxmas/Focus-dLLM