ChatPaper.aiChatPaper

扩散语言模型中的注意力汇聚机制

Attention Sinks in Diffusion Language Models

October 17, 2025
作者: Maximo Eduardo Rulli, Simone Petruzzi, Edoardo Michielon, Fabrizio Silvestri, Simone Scardapane, Alessio Devoto
cs.AI

摘要

掩码扩散语言模型(DLMs)近期作为传统自回归模型(ARMs)的有力替代方案崭露头角。DLMs采用具备双向注意力机制的Transformer编码器,实现了并行化的token生成,同时保持了优异的性能。尽管其效率与效果已得到广泛研究,但支配DLMs的内部机制仍多未探明。本研究对DLM的注意力模式进行了实证分析,特别聚焦于先前在多种基于Transformer的架构中观察到的注意力下沉现象。我们的发现表明,DLMs同样展现出注意力下沉,但具有独特特征。首先,与ARMs不同,DLMs中的下沉位置在生成过程中倾向于动态移动,表现出一种活跃的变化特性。其次,尽管ARMs对移除注意力下沉极为敏感,DLMs却展现出较强的鲁棒性:遮蔽下沉仅导致性能轻微下降。这些结果为理解基于扩散的语言模型的内在运作机制提供了新视角,并凸显了其在注意力分配与利用方面与自回归模型存在根本性差异。
English
Masked Diffusion Language Models (DLMs) have recently emerged as a promising alternative to traditional Autoregressive Models (ARMs). DLMs employ transformer encoders with bidirectional attention, enabling parallel token generation while maintaining competitive performance. Although their efficiency and effectiveness have been extensively studied, the internal mechanisms that govern DLMs remain largely unexplored. In this work, we conduct an empirical analysis of DLM attention patterns, focusing on the attention sinking phenomenon, an effect previously observed in various transformer-based architectures. Our findings reveal that DLMs also exhibit attention sinks, but with distinct characteristics. First, unlike in ARMs, the sink positions in DLMs tend to shift throughout the generation process, displaying a dynamic behaviour. Second, while ARMs are highly sensitive to the removal of attention sinks, DLMs remain robust: masking sinks leads to only a minor degradation in performance. These results provide new insights into the inner workings of diffusion-based language models and highlight fundamental differences in how they allocate and utilize attention compared to autoregressive models.
PDF61October 23, 2025