DiJiang: コンパクトなカーネル化による効率的な大規模言語モデル

要旨

Transformerの計算負荷を軽減する取り組みとして、線形アテンションに関する研究が大きな勢いを得ています。しかし、アテンションメカニズムの改善戦略は通常、大規模な再学習を必要とし、膨大なパラメータを持つ大規模言語モデルでは非現実的です。本論文では、事前学習済みの標準Transformerを、わずかな学習コストで線形計算量モデルに変換可能にする、新しい周波数領域カーネル化手法「DiJiang」を提案します。重み付き準モンテカルロ法を用いたサンプリングにより、提案手法は理論的に優れた近似効率を提供します。さらに、学習の計算複雑性を低減するため、離散コサイン変換（DCT）操作に基づくカーネル化を採用しています。大規模な実験により、提案手法は元のTransformerと同等の性能を達成しつつ、学習コストを大幅に削減し、推論速度を大幅に向上させることが実証されました。我々のDiJiang-7Bは、様々なベンチマークにおいてLLaMA2-7Bと同等の性能を発揮しつつ、学習コストは約1/50しか必要としません。コードはhttps://github.com/YuchuanTian/DiJiangで公開されています。

English

In an effort to reduce the computational load of Transformers, research on linear attention has gained significant momentum. However, the improvement strategies for attention mechanisms typically necessitate extensive retraining, which is impractical for large language models with a vast array of parameters. In this paper, we present DiJiang, a novel Frequency Domain Kernelization approach that enables the transformation of a pre-trained vanilla Transformer into a linear complexity model with little training costs. By employing a weighted Quasi-Monte Carlo method for sampling, the proposed approach theoretically offers superior approximation efficiency. To further reduce the training computational complexity, our kernelization is based on Discrete Cosine Transform (DCT) operations. Extensive experiments demonstrate that the proposed method achieves comparable performance to the original Transformer, but with significantly reduced training costs and much faster inference speeds. Our DiJiang-7B achieves comparable performance with LLaMA2-7B on various benchmark while requires only about 1/50 training cost. Code is available at https://github.com/YuchuanTian/DiJiang.

DiJiang: コンパクトなカーネル化による効率的な大規模言語モデル

DiJiang: Efficient Large Language Models through Compact Kernelization

要旨

Summary

Support

Support