Lighthouse Attentionを用いた長文脈事前学習

要旨

極めて長い系列長での因果トランスフォーマーの学習は、スケーリングドット積注意機構（SDPA）の二次的な時間とメモリによって妨げられている。本研究では、通常のSDPAをラップし、学習終盤で容易に除去可能な、学習専用の対称選択型階層注意アルゴリズムであるLighthouse Attentionを提案する。本手法の階層的選択は勾配を必要とせず、複雑で非効率になりがちな逆伝播カーネルを扱う必要がない。本研究の貢献は以下の3点である：(i) 系列の適応的圧縮・伸張を行う二乗未満の階層的前処理・後処理ステップ、(ii) クエリ、キー、バリューを同時にプールしつつ左から右への因果性を維持する対称圧縮戦略により並列性を大幅に向上させる点、(iii) 学習の大部分をLighthouse Attentionで事前学習し、終盤の短い学習で完全注意モデルを回復する2段階学習アプローチ。他の設定をすべて一致させた完全注意学習と比較し、本手法の有効性を示す小規模LLM事前学習の予備実験を実施した。その結果、回復フェーズ後において学習時間の短縮と最終損失の低減を達成した。完全なコードは https://github.com/ighoshsubho/lighthouse-attention で入手可能である。

English

Training causal transformers at extreme sequence lengths is bottlenecked by the quadratic time and memory of scaled dot-product attention (SDPA). In this work, we propose Lighthouse Attention, a training-only symmetrical selection-based hierarchical attention algorithm that wraps around ordinary SDPA and can be easily removed towards the end of the training. Our hierarchical selection is also gradient-free, which exempts us from dealing with a complicated and potentially inefficient backward pass kernel. Our contribution is three-fold: (i) A subquadratic hierarchical pre- and post-processing step that does adaptive compression and decompression of the sequence. (ii) A symmetrical compression strategy that pools queries, keys and values at the same time, while preserving left-to-right causality, which greatly improves parallelism. (iii) A two stage training approach which we pre-train for the majority of the time with Lighthouse Attention and recover a full attention model at the end with a short training. We run preliminary small scale LLM pre-training experiments that show the effectiveness of our method compared to full attention training with all other settings matched, where we achieve a faster total training time and lower final loss after the recovery phase. Full code is available at: https://github.com/ighoshsubho/lighthouse-attention