擴散語言模型中的注意力匯聚機制
Attention Sinks in Diffusion Language Models
October 17, 2025
作者: Maximo Eduardo Rulli, Simone Petruzzi, Edoardo Michielon, Fabrizio Silvestri, Simone Scardapane, Alessio Devoto
cs.AI
摘要
掩碼擴散語言模型(DLMs)近期作為傳統自回歸模型(ARMs)的一種有前景的替代方案而嶄露頭角。DLMs採用具有雙向注意力機制的變壓器編碼器,實現了並行詞元生成,同時保持了競爭力的性能。儘管其效率與效能已得到廣泛研究,但支配DLMs的內部機制仍大多未被探索。在本研究中,我們對DLM的注意力模式進行了實證分析,重點關注了先前在多種基於變壓器的架構中觀察到的注意力下沉現象。我們的研究發現,DLMs同樣展現出注意力下沉,但具有獨特的特徵。首先,與ARMs不同,DLMs中的下沉位置在生成過程中傾向於移動,表現出動態行為。其次,雖然ARMs對移除注意力下沉極為敏感,但DLMs卻表現出穩健性:遮蔽下沉僅導致性能的輕微下降。這些結果為基於擴散的語言模型的內部運作提供了新的見解,並凸顯了其在注意力分配與利用方面與自回歸模型之間的根本差異。
English
Masked Diffusion Language Models (DLMs) have recently emerged as a promising
alternative to traditional Autoregressive Models (ARMs). DLMs employ
transformer encoders with bidirectional attention, enabling parallel token
generation while maintaining competitive performance. Although their efficiency
and effectiveness have been extensively studied, the internal mechanisms that
govern DLMs remain largely unexplored. In this work, we conduct an empirical
analysis of DLM attention patterns, focusing on the attention sinking
phenomenon, an effect previously observed in various transformer-based
architectures. Our findings reveal that DLMs also exhibit attention sinks, but
with distinct characteristics. First, unlike in ARMs, the sink positions in
DLMs tend to shift throughout the generation process, displaying a dynamic
behaviour. Second, while ARMs are highly sensitive to the removal of attention
sinks, DLMs remain robust: masking sinks leads to only a minor degradation in
performance. These results provide new insights into the inner workings of
diffusion-based language models and highlight fundamental differences in how
they allocate and utilize attention compared to autoregressive models.