狄江：通过紧凑内核化实现高效大型语言模型

摘要

为了减轻Transformer的计算负担，线性注意力研究取得了显著进展。然而，注意力机制的改进策略通常需要大量重新训练，这对于拥有庞大参数数量的大型语言模型来说是不切实际的。本文提出了DiJiang，一种新颖的频域核化方法，能够在极少训练成本的情况下，将预训练的普通Transformer转化为线性复杂度模型。通过采用加权准蒙特卡罗方法进行采样，该方法在理论上提供了更高的近似效率。为进一步降低训练计算复杂度，我们的核化基于离散余弦变换（DCT）操作。大量实验表明，所提出的方法在性能上与原始Transformer相当，但训练成本显著降低，推理速度大幅提升。我们的DiJiang-7B在各项基准测试中与LLaMA2-7B表现相当，而训练成本仅需约1/50。代码可在https://github.com/YuchuanTian/DiJiang获取。

English

In an effort to reduce the computational load of Transformers, research on linear attention has gained significant momentum. However, the improvement strategies for attention mechanisms typically necessitate extensive retraining, which is impractical for large language models with a vast array of parameters. In this paper, we present DiJiang, a novel Frequency Domain Kernelization approach that enables the transformation of a pre-trained vanilla Transformer into a linear complexity model with little training costs. By employing a weighted Quasi-Monte Carlo method for sampling, the proposed approach theoretically offers superior approximation efficiency. To further reduce the training computational complexity, our kernelization is based on Discrete Cosine Transform (DCT) operations. Extensive experiments demonstrate that the proposed method achieves comparable performance to the original Transformer, but with significantly reduced training costs and much faster inference speeds. Our DiJiang-7B achieves comparable performance with LLaMA2-7B on various benchmark while requires only about 1/50 training cost. Code is available at https://github.com/YuchuanTian/DiJiang.

狄江：通过紧凑内核化实现高效大型语言模型

DiJiang: Efficient Large Language Models through Compact Kernelization

摘要

Support