忘却型トランスフォーマーのための適応的計算プルーニング

要旨

最近提案されたForgetting Transformer（FoX）は、ソフトマックスアテンションにフォーゲットゲートを組み込み、標準的なRoPEベースのTransformerと比較して一貫して優れた、または同等の性能を示しています。特に、FoXの多くのアテンションヘッドは急速に忘却する傾向があり、各タイムステップでの出力が主にローカルコンテキストに依存するようになります。この観察に基づいて、FoXに対するAdaptive Computation Pruning（ACP）を提案します。これは、フォーゲットゲートによって強く減衰された入力-出力依存関係を含む計算を動的に刈り込む方法です。これは、刈り込まれたアテンション重みが無視できる程度に保たれるように、動的に設定された刈り込み閾値を使用して実現されます。FoXを用いた言語モデルの事前学習にACPを適用し、さまざまなモデルサイズとコンテキスト長にわたってソフトマックスアテンションのFLOPs数を約70％削減し、トレーニングスループットを約10％から35％向上させることを示します。さらに、コンテキスト長が長いほど、計算上の節約が大きくなります。これらの速度向上は、性能の低下を伴わずに達成されます。また、刈り込みパターンを調査したり、異なるアテンションヘッド間でのFLOPs節約の分布を分析するなど、この方法についてより深い洞察を提供するためのいくつかの分析を行います。私たちのコードはhttps://github.com/zhixuan-lin/arctic-foxで公開されています。

English

The recently proposed Forgetting Transformer (FoX) incorporates a forget gate into softmax attention and has shown consistently better or on-par performance compared to the standard RoPE-based Transformer. Notably, many attention heads in FoX tend to forget quickly, causing their output at each timestep to rely primarily on the local context. Based on this observation, we propose Adaptive Computation Pruning (ACP) for FoX, a method that dynamically prunes computations involving input-output dependencies that are strongly decayed by the forget gate. This is achieved using a dynamically set pruning threshold that ensures that the pruned attention weights remain negligible. We apply ACP to language model pretraining with FoX and show it consistently reduces the number of FLOPs in softmax attention by around 70% across different model sizes and context lengths, resulting in a roughly 10% to 35% improvement in training throughput. Furthermore, longer context lengths yield greater computational savings. All these speed improvements are achieved without any performance degradation. We also perform several analyses to provide deeper insights into our method, such as examining the pruning patterns and analyzing the distribution of FLOP savings across different attention heads. Our code is available at https://github.com/zhixuan-lin/arctic-fox.

忘却型トランスフォーマーのための適応的計算プルーニング

Adaptive Computation Pruning for the Forgetting Transformer

要旨

Support