Transformerの課題を解決するための共線制約付きアテンション

要旨

大規模言語モデルに基づく実用的なアプリケーションの急速な進展に伴い、研究領域において性能の外挿の重要性が指数関数的に高まっています。本研究では、Transformerモデルにおいてこれまで見過ごされていた異常な挙動を特定し、最も重要な情報を運ぶ近接トークン周辺に混乱が生じていることを明らかにしました。この発見を「Transformerの頭痛」と名付けました。この問題を根本的に解決するため、Collinear Constrained Attention (CoCA) という新しいセルフアテンション構造を導入しました。この構造は、既存の外挿法、補間法、および従来のTransformerモデル向けに設計された他の最適化戦略とシームレスに統合可能です。モデルの微調整なしに、推論時にシーケンス長の16倍から24倍にわたる優れた外挿性能を達成しました。また、CoCAの計算効率と空間効率を向上させ、実用性を確保しました。近くCoCAをオープンソース化する予定です。それまでの間、再現実験のためのコードを付録に公開しています。

English

As the rapid progression of practical applications based on Large Language Models continues, the importance of extrapolating performance has grown exponentially in the research domain. In our study, we identified an anomalous behavior in Transformer models that had been previously overlooked, leading to a chaos around closest tokens which carried the most important information. We've coined this discovery the "headache of Transformers". To address this at its core, we introduced a novel self-attention structure named Collinear Constrained Attention (CoCA). This structure can be seamlessly integrated with existing extrapolation, interpolation methods, and other optimization strategies designed for traditional Transformer models. We have achieved excellent extrapolating performance even for 16 times to 24 times of sequence lengths during inference without any fine-tuning on our model. We have also enhanced CoCA's computational and spatial efficiency to ensure its practicality. We plan to open-source CoCA shortly. In the meantime, we've made our code available in the appendix for reappearing experiments.

Transformerの課題を解決するための共線制約付きアテンション

Cure the headache of Transformers via Collinear Constrained Attention

要旨

Support