萬物相連:探索測試時記憶、注意力偏差、記憶保持與線上優化之旅
It's All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization
April 17, 2025
作者: Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, Vahab Mirrokni
cs.AI
摘要
設計高效且強大的架構骨幹一直是提升基礎模型能力的核心研究方向。受人類認知現象中的注意力偏差(即自然傾向於優先處理某些事件或刺激)的啟發,我們重新構思了神經網路架構,包括Transformer、Titans以及現代線性遞歸神經網路,將其視為關聯記憶模組,這些模組通過內部目標(稱為注意力偏差)學習鍵與值的映射。令人驚訝的是,我們觀察到大多數現有的序列模型要麼依賴於(1)點積相似度,要麼依賴於(2)L2回歸目標作為其注意力偏差。超越這些目標,我們提出了一組替代的注意力偏差配置及其有效近似方法,以穩定其訓練過程。隨後,我們將現代深度學習架構中的遺忘機制重新解釋為一種保留正則化形式,為序列模型提供了一組新穎的遺忘門。基於這些洞見,我們提出了Miras,這是一個基於四種選擇來設計深度學習架構的通用框架:(i)關聯記憶架構,(ii)注意力偏差目標,(iii)保留門,以及(iv)記憶學習算法。我們提出了三種新穎的序列模型——Moneta、Yaad和Memora,它們超越了現有線性RNN的能力,同時保持了快速可並行化的訓練過程。我們的實驗表明,Miras中的不同設計選擇會產生具有不同優勢的模型。例如,某些Miras實例在特定任務(如語言建模、常識推理和記憶密集型任務)中表現卓越,甚至超越了Transformer和其他現代線性遞歸模型。
English
Designing efficient and effective architectural backbones has been in the
core of research efforts to enhance the capability of foundation models.
Inspired by the human cognitive phenomenon of attentional bias-the natural
tendency to prioritize certain events or stimuli-we reconceptualize neural
architectures, including Transformers, Titans, and modern linear recurrent
neural networks as associative memory modules that learn a mapping of keys and
values using an internal objective, referred to as attentional bias.
Surprisingly, we observed that most existing sequence models leverage either
(1) dot-product similarity, or (2) L2 regression objectives as their
attentional bias. Going beyond these objectives, we present a set of
alternative attentional bias configurations along with their effective
approximations to stabilize their training procedure. We then reinterpret
forgetting mechanisms in modern deep learning architectures as a form of
retention regularization, providing a novel set of forget gates for sequence
models. Building upon these insights, we present Miras, a general framework to
design deep learning architectures based on four choices of: (i) associative
memory architecture, (ii) attentional bias objective, (iii) retention gate, and
(iv) memory learning algorithm. We present three novel sequence models-Moneta,
Yaad, and Memora-that go beyond the power of existing linear RNNs while
maintaining a fast parallelizable training process. Our experiments show
different design choices in Miras yield models with varying strengths. For
example, certain instances of Miras achieve exceptional performance in special
tasks such as language modeling, commonsense reasoning, and recall intensive
tasks, even outperforming Transformers and other modern linear recurrent
models.Summary
AI-Generated Summary