基於燈塔注意力的長上下文預訓練
Long Context Pre-Training with Lighthouse Attention
May 7, 2026
作者: Bowen Peng, Subho Ghosh, Jeffrey Quesnelle
cs.AI
摘要
在極端序列長度下訓練因果變換器時,縮放點積注意力(SDPA)的二次方時間與記憶體複雜度會成為效能瓶頸。本研究提出「燈塔注意力」(Lighthouse Attention),這是一種僅用於訓練階段的對稱式分層選擇型注意力演算法,能包覆標準SDPA運算,並在訓練後期輕鬆移除。我們的分層選擇機制亦屬於無梯度方法,因此無需處理複雜且可能低效率的反向傳播內核。本研究的貢獻有三:(i) 亞二次方複雜度的分層預處理與後處理步驟,可對序列進行自適應壓縮與解壓縮;(ii) 對稱式壓縮策略,能在保留從左到右因果性的同時,同步池化查詢、鍵與值,大幅提升平行化效率;(iii) 二階段訓練方法:前期主要使用燈塔注意力進行預訓練,後期則透過短時間訓練恢復為完整注意力模型。我們進行初步小規模大型語言模型預訓練實驗,在所有其他設定匹配的條件下,與完整注意力訓練相比,本方法能實現更快的總訓練時間,並在恢復階段後達到更低的最終損失。完整程式碼請見:https://github.com/ighoshsubho/lighthouse-attention
English
Training causal transformers at extreme sequence lengths is bottlenecked by the quadratic time and memory of scaled dot-product attention (SDPA). In this work, we propose Lighthouse Attention, a training-only symmetrical selection-based hierarchical attention algorithm that wraps around ordinary SDPA and can be easily removed towards the end of the training. Our hierarchical selection is also gradient-free, which exempts us from dealing with a complicated and potentially inefficient backward pass kernel. Our contribution is three-fold: (i) A subquadratic hierarchical pre- and post-processing step that does adaptive compression and decompression of the sequence. (ii) A symmetrical compression strategy that pools queries, keys and values at the same time, while preserving left-to-right causality, which greatly improves parallelism. (iii) A two stage training approach which we pre-train for the majority of the time with Lighthouse Attention and recover a full attention model at the end with a short training. We run preliminary small scale LLM pre-training experiments that show the effectiveness of our method compared to full attention training with all other settings matched, where we achieve a faster total training time and lower final loss after the recovery phase. Full code is available at: https://github.com/ighoshsubho/lighthouse-attention