口罩效应：扩散语言模型中的语境理解探析

摘要

掩码扩散语言模型（MDLMs）近期作为自回归语言模型（ARLMs）的替代方案崭露头角，其采用的去噪目标在理论上应能实现更均衡的上下文利用。本研究深入探究了MDLMs的上下文理解能力，并揭示了两大关键局限：首先，尽管具备更全局的训练目标和双向注意力机制，MDLMs与ARLMs类似地表现出强烈的局部性偏好——模型性能对输入中关键信息的位置高度敏感，倾向于依赖局部上下文而非远距离语境；其次，我们发现生成所需的大量掩码标记会显著削弱模型的上下文理解能力。通过系统性消融实验，这些掩码被证实会作为干扰项降低模型处理关键信息的能力。为此，我们提出了一种掩码无关的损失函数，使模型预测不受附加掩码数量的影响。基于该目标的微调有效缓解了掩码的干扰效应，显著提升了MDLMs的鲁棒性。总体而言，我们的研究揭示了当前MDLM训练范式的关键缺陷，并为构建具有更强上下文理解能力的扩散式语言模型提供了可行路径。

English

Masked Diffusion Language Models (MDLMs) have recently emerged as a promising alternative to Autoregressive Language Models (ARLMs), leveraging a denoising objective that, in principle, should enable more uniform context utilisation. In this work, we examine the context comprehension abilities of MDLMs and uncover two key limitations. First, despite their more global training objective and bidirectional attention mechanism, similarly to ARLMS, MDLMs exhibit a strong locality bias: performance is highly sensitive to the position of relevant information within the input, favouring local over distant context. Second, we show that appending a large number of mask tokens--required for generation--can significantly degrade context comprehension. Through systematic ablations, we find that these masks act as distractors, reducing the model's ability to process relevant information. To address this, we introduce a mask-agnostic loss function that encourages predictions to remain invariant to the number of appended masks. Fine-tuning with this objective substantially mitigates the distracting effect of masks, improving robustness of MDLMs. Overall, our findings reveal critical limitations of the current MDLM training paradigm and provide actionable insights for building diffusion-based language models with stronger context comprehension.

口罩效应：扩散语言模型中的语境理解探析

Masks Can Be Distracting: On Context Comprehension in Diffusion Language Models

摘要

Support