口罩亦可成为干扰:论扩散语言模型中的语境理解
Masks Can Be Distracting: On Context Comprehension in Diffusion Language Models
November 26, 2025
作者: Julianna Piskorz, Cristina Pinneri, Alvaro Correia, Motasem Alfarra, Risheek Garrepalli, Christos Louizos
cs.AI
摘要
掩码扩散语言模型(MDLM)近期作为自回归语言模型(ARLM)的替代方案崭露头角,其采用的去噪目标在原理上应能实现更均衡的上下文利用。本研究系统考察了MDLM的上下文理解能力,揭示出两个关键局限:首先,尽管具备更全局的训练目标和双向注意力机制,MDLM与ARLM类似仍存在显著的局部性偏好——模型性能对输入中关键信息的位置高度敏感,更倾向于利用局部而非远距离上下文;其次,我们发现生成所需的大量追加掩码会显著削弱模型的上下文理解能力。通过系统性消融实验,这些掩码被证实会作为干扰项降低模型处理关键信息的能力。为此,我们提出一种掩码无关的损失函数,使预测结果对追加掩码数量保持恒定。基于该目标的微调有效缓解了掩码的干扰效应,显著提升了MDLM的鲁棒性。总体而言,我们的研究揭示了当前MDLM训练范式的关键缺陷,并为构建具有更强上下文理解能力的扩散式语言模型提供了可行路径。
English
Masked Diffusion Language Models (MDLMs) have recently emerged as a promising alternative to Autoregressive Language Models (ARLMs), leveraging a denoising objective that, in principle, should enable more uniform context utilisation. In this work, we examine the context comprehension abilities of MDLMs and uncover two key limitations. First, despite their more global training objective and bidirectional attention mechanism, similarly to ARLMS, MDLMs exhibit a strong locality bias: performance is highly sensitive to the position of relevant information within the input, favouring local over distant context. Second, we show that appending a large number of mask tokens--required for generation--can significantly degrade context comprehension. Through systematic ablations, we find that these masks act as distractors, reducing the model's ability to process relevant information. To address this, we introduce a mask-agnostic loss function that encourages predictions to remain invariant to the number of appended masks. Fine-tuning with this objective substantially mitigates the distracting effect of masks, improving robustness of MDLMs. Overall, our findings reveal critical limitations of the current MDLM training paradigm and provide actionable insights for building diffusion-based language models with stronger context comprehension.