注意を払わないで

要旨

Transformerは、大規模言語モデルや様々なドメインにおける下流タスクにおいて、事実上の標準となっています。内在的な訓練の並列性といった数多くの利点があるにもかかわらず、Transformerは固定されたコンテキストウィンドウを超えるシーケンスを効果的に処理できないことや、その注意機構の二次的な複雑さといった重要な課題に直面しています。これらの課題は、シーケンス長に対して線形にスケールし、長距離依存関係の処理が改善されるRNNのようなアーキテクチャへの関心を再び高めています。ただし、RNNはその本質的に再帰的な性質により並列性が制限されます。本論文では、注意機構と再帰性の両方から脱却する新しいニューラル基盤アーキテクチャであるAveyを提案します。Aveyは、ランカーと自己回帰型ニューラルプロセッサで構成され、シーケンス内の位置に関係なく、任意のトークンに対して最も関連性の高いトークンを特定し、文脈化します。具体的には、Aveyはシーケンス長とコンテキスト幅を分離することで、任意の長さのシーケンスを効果的に処理できるようにします。実験結果は、Aveyが様々な標準的な短距離NLPベンチマークにおいてTransformerに匹敵する性能を示し、特に長距離依存関係の捕捉において優れていることを示しています。

English

The Transformer has become the de facto standard for large language models and a wide range of downstream tasks across various domains. Despite its numerous advantages like inherent training parallelism, the Transformer still faces key challenges due to its inability to effectively process sequences beyond a fixed context window and the quadratic complexity of its attention mechanism. These challenges have renewed interest in RNN-like architectures, which offer linear scaling with sequence length and improved handling of long-range dependencies, albeit with limited parallelism due to their inherently recurrent nature. In this paper, we propose Avey, a new neural foundational architecture that breaks away from both attention and recurrence. Avey comprises a ranker and an autoregressive neural processor, which collaboratively identify and contextualize only the most relevant tokens for any given token, regardless of their positions in the sequence. Specifically, Avey decouples sequence length from context width, thus enabling effective processing of arbitrarily long sequences. Experimental results show that Avey compares favorably to the Transformer across a variety of standard short-range NLP benchmarks, while notably excelling at capturing long-range dependencies.

注意を払わないで

Don't Pay Attention

要旨

Support