解構注意力機制:探討高效語言建模的設計原則
Deconstructing Attention: Investigating Design Principles for Effective Language Modeling
October 13, 2025
作者: Huiyin Xue, Nafise Sadat Moosavi, Nikolaos Aletras
cs.AI
摘要
Transformer語言模型之成功,廣被歸功於其點積注意力機制,該機制融合了一組關鍵設計原則:跨位置資訊混合(實現多詞彙交互)、序列依賴性激活(注意力權重適應每個輸入)、特定數學形式(點積相似度加softmax加權)以及查詢與鍵與演進隱藏狀態的耦合(將注意力錨定於當前層)。然而,這些原則各自之必要性大多未經檢驗。本研究中,我們系統性地解構注意力機制,設計了可控變體,選擇性地放寬這些原則,既應用於所有層面,也應用於僅部分層保留標準注意力的混合架構中。實證分析揭示,詞彙混合機制不可或缺,其缺失會導致模型行為近乎隨機,而精確的數學形式與序列依賴性則可大幅放寬,特別是在僅保留於部分層時。令人驚訝的是,即便孤立情況下失效的變體,在與標準注意力交織時也能展現穩健性能,凸顯了一種協同效應。這些發現深化了我們對注意力有效性真正基礎的理解,並為在不犧牲性能的前提下簡化語言模型開闢了新途徑。
English
The success of Transformer language models is widely credited to their
dot-product attention mechanism, which interweaves a set of key design
principles: mixing information across positions (enabling multi-token
interactions), sequence-dependent activations (where attention weights adapt to
each input), a specific mathematical form (dot-product similarities plus
softmax weighting), and coupling of queries and keys to evolving hidden states
(grounding attention in the current layer). However, the necessity of each of
these principles remains largely untested. In this work, we systematically
deconstruct attention by designing controlled variants that selectively relax
these principles, applied both uniformly across all layers and in hybrid
architectures where only some layers retain standard attention. Our empirical
analysis reveals that mechanisms for mixing tokens are indispensable, as their
absence collapses models to near-random behavior, while the exact mathematical
form and sequence dependency can be substantially relaxed, especially when
preserved in just a subset of layers. Surprisingly, even variants that fail in
isolation can achieve robust performance when interleaved with standard
attention, highlighting a cooperative effect. These findings deepen our
understanding of what truly underpins attention's effectiveness and open new
avenues for simplifying language models without sacrificing performance.