効率的な事前学習の長さスケーリング

要旨

大規模言語モデルの最近の進展は、ポストトレーニングにおける長さスケーリングの有効性を示しているが、プレトレーニングにおけるその可能性はまだ十分に探究されていない。本論文では、プレトレーニング中に効率的な長さスケーリングを可能にしつつ、推論効率を維持する新しいフレームワークであるParallel Hidden Decoding Transformer（PHD-Transformer）を提案する。PHD-Transformerは、オリジナルトークンと隠れデコードトークンを区別する革新的なKVキャッシュ管理戦略を通じてこれを実現する。長距離依存性のためにオリジナルトークンのKVキャッシュのみを保持し、隠れデコードトークンは使用後すぐに破棄することで、我々のアプローチはバニラTransformerと同じKVキャッシュサイズを維持しつつ、効果的な長さスケーリングを可能にする。さらに性能を向上させるため、2つの最適化バリアントを導入する。PHD-SWAはスライディングウィンドウアテンションを用いて局所的な依存性を保持し、PHD-CSWAはチャンク単位のスライディングウィンドウアテンションを実装してプレフィル時間の線形増加を排除する。大規模な実験により、複数のベンチマークで一貫した改善が実証された。

English

Recent advances in large language models have demonstrated the effectiveness of length scaling during post-training, yet its potential in pre-training remains underexplored. We present the Parallel Hidden Decoding Transformer (PHD-Transformer), a novel framework that enables efficient length scaling during pre-training while maintaining inference efficiency. PHD-Transformer achieves this through an innovative KV cache management strategy that distinguishes between original tokens and hidden decoding tokens. By retaining only the KV cache of original tokens for long-range dependencies while immediately discarding hidden decoding tokens after use, our approach maintains the same KV cache size as the vanilla transformer while enabling effective length scaling. To further enhance performance, we introduce two optimized variants: PHD-SWA employs sliding window attention to preserve local dependencies, while PHD-CSWA implements chunk-wise sliding window attention to eliminate linear growth in pre-filling time. Extensive experiments demonstrate consistent improvements across multiple benchmarks.

効率的な事前学習の長さスケーリング

Efficient Pretraining Length Scaling

要旨

Support