Lange Context Pre-Training met Lighthouse Attention

Samenvatting

Het trainen van causale transformatoren bij extreme sequentielengten wordt beperkt door de kwadratische tijd en het geheugen van geschaalde puntproduct-aandacht (SDPA). In dit werk stellen wij Lighthouse Attention voor, een uitsluitend voor training bedoeld, symmetrisch, op selectie gebaseerd hiërarchisch aandachtsalgoritme dat om gewone SDPA heen functioneert en eenvoudig tegen het einde van de training verwijderd kan worden. Onze hiërarchische selectie is ook gradiëntvrij, wat ons vrijwaart van het omgaan met een complexe en mogelijk inefficiënte backward pass-kernel. Onze bijdrage is drieërlei: (i) Een subkwadratische hiërarchische voor- en nabewerkingsstap die adaptieve compressie en decompressie van de sequentie uitvoert. (ii) Een symmetrische compressiestrategie die tegelijkertijd queries, keys en values poolet, met behoud van links-naar-rechts causaliteit, wat het parallelisme aanzienlijk verbetert. (iii) Een tweefasentrainingsaanpak waarbij we het grootste deel van de tijd vooraf trainen met Lighthouse Attention en aan het einde met een korte training een volledig aandachtsmodel herstellen. We voeren voorlopige kleinschalige LLM-pre-trainingsexperimenten uit die de effectiviteit van onze methode aantonen in vergelijking met volledige aandachtstraining waarbij alle overige instellingen gelijk zijn, waarbij we een snellere totale trainingstijd en een lager eindverlies na de herstelfase behalen. Volledige code is beschikbaar op: https://github.com/ighoshsubho/lighthouse-attention

English

Training causal transformers at extreme sequence lengths is bottlenecked by the quadratic time and memory of scaled dot-product attention (SDPA). In this work, we propose Lighthouse Attention, a training-only symmetrical selection-based hierarchical attention algorithm that wraps around ordinary SDPA and can be easily removed towards the end of the training. Our hierarchical selection is also gradient-free, which exempts us from dealing with a complicated and potentially inefficient backward pass kernel. Our contribution is three-fold: (i) A subquadratic hierarchical pre- and post-processing step that does adaptive compression and decompression of the sequence. (ii) A symmetrical compression strategy that pools queries, keys and values at the same time, while preserving left-to-right causality, which greatly improves parallelism. (iii) A two stage training approach which we pre-train for the majority of the time with Lighthouse Attention and recover a full attention model at the end with a short training. We run preliminary small scale LLM pre-training experiments that show the effectiveness of our method compared to full attention training with all other settings matched, where we achieve a faster total training time and lower final loss after the recovery phase. Full code is available at: https://github.com/ighoshsubho/lighthouse-attention