不要分心
Don't Pay Attention
June 12, 2025
作者: Mohammad Hammoud, Devang Acharya
cs.AI
摘要
Transformer已成為大型語言模型及跨多領域下游任務的事實標準。儘管其具備如內在訓練並行性等眾多優勢,Transformer仍面臨關鍵挑戰,主要源於其無法有效處理超出固定上下文窗口的序列,以及其注意力機制的二次方複雜度。這些挑戰重新激發了對RNN類架構的興趣,該類架構雖因固有的遞歸性質而並行性受限,卻能在序列長度上實現線性擴展,並更好地處理長程依賴。本文中,我們提出了Avey,一種新型神經基礎架構,它既摒棄了注意力機制,也擺脫了遞歸結構。Avey由一個排序器和一個自回歸神經處理器組成,二者協同工作,針對序列中任意位置的token,僅識別並上下文化最相關的token。具體而言,Avey將序列長度與上下文寬度解耦,從而實現了對任意長度序列的有效處理。實驗結果表明,Avey在多種標準短程自然語言處理基準測試中與Transformer相比表現優異,尤其在捕捉長程依賴方面表現突出。
English
The Transformer has become the de facto standard for large language models
and a wide range of downstream tasks across various domains. Despite its
numerous advantages like inherent training parallelism, the Transformer still
faces key challenges due to its inability to effectively process sequences
beyond a fixed context window and the quadratic complexity of its attention
mechanism. These challenges have renewed interest in RNN-like architectures,
which offer linear scaling with sequence length and improved handling of
long-range dependencies, albeit with limited parallelism due to their
inherently recurrent nature. In this paper, we propose Avey, a new neural
foundational architecture that breaks away from both attention and recurrence.
Avey comprises a ranker and an autoregressive neural processor, which
collaboratively identify and contextualize only the most relevant tokens for
any given token, regardless of their positions in the sequence. Specifically,
Avey decouples sequence length from context width, thus enabling effective
processing of arbitrarily long sequences. Experimental results show that Avey
compares favorably to the Transformer across a variety of standard short-range
NLP benchmarks, while notably excelling at capturing long-range dependencies.