基於燈塔注意力的長上下文預訓練

摘要

在極端序列長度下訓練因果變換器時，縮放點積注意力（SDPA）的二次方時間與記憶體複雜度會成為效能瓶頸。本研究提出「燈塔注意力」（Lighthouse Attention），這是一種僅用於訓練階段的對稱式分層選擇型注意力演算法，能包覆標準SDPA運算，並在訓練後期輕鬆移除。我們的分層選擇機制亦屬於無梯度方法，因此無需處理複雜且可能低效率的反向傳播內核。本研究的貢獻有三：(i) 亞二次方複雜度的分層預處理與後處理步驟，可對序列進行自適應壓縮與解壓縮；(ii) 對稱式壓縮策略，能在保留從左到右因果性的同時，同步池化查詢、鍵與值，大幅提升平行化效率；(iii) 二階段訓練方法：前期主要使用燈塔注意力進行預訓練，後期則透過短時間訓練恢復為完整注意力模型。我們進行初步小規模大型語言模型預訓練實驗，在所有其他設定匹配的條件下，與完整注意力訓練相比，本方法能實現更快的總訓練時間，並在恢復階段後達到更低的最終損失。完整程式碼請見：https://github.com/ighoshsubho/lighthouse-attention

English

Training causal transformers at extreme sequence lengths is bottlenecked by the quadratic time and memory of scaled dot-product attention (SDPA). In this work, we propose Lighthouse Attention, a training-only symmetrical selection-based hierarchical attention algorithm that wraps around ordinary SDPA and can be easily removed towards the end of the training. Our hierarchical selection is also gradient-free, which exempts us from dealing with a complicated and potentially inefficient backward pass kernel. Our contribution is three-fold: (i) A subquadratic hierarchical pre- and post-processing step that does adaptive compression and decompression of the sequence. (ii) A symmetrical compression strategy that pools queries, keys and values at the same time, while preserving left-to-right causality, which greatly improves parallelism. (iii) A two stage training approach which we pre-train for the majority of the time with Lighthouse Attention and recover a full attention model at the end with a short training. We run preliminary small scale LLM pre-training experiments that show the effectiveness of our method compared to full attention training with all other settings matched, where we achieve a faster total training time and lower final loss after the recovery phase. Full code is available at: https://github.com/ighoshsubho/lighthouse-attention