科尔莫戈洛夫-阿诺德变换器

摘要

Transformer模型是现代深度学习的基石。传统上，这些模型依赖多层感知器（MLP）层来在通道之间混合信息。在本文中，我们介绍了Kolmogorov-Arnold Transformer（KAT），这是一种新颖的架构，用Kolmogorov-Arnold Network（KAN）层取代MLP层，以增强模型的表达能力和性能。然而，将KAN整合到Transformer中并不容易，特别是在规模扩大时。具体而言，我们确定了三个关键挑战：（C1）基本函数。KAN中使用的标准B样条函数并未针对现代硬件上的并行计算进行优化，导致推理速度较慢。（C2）参数和计算效率低。KAN需要为每个输入-输出对设计一个独特的函数，使得计算量极大。（C3）权重初始化。由于KAN中的可学习激活函数对于在深度神经网络中实现收敛至关重要，因此权重的初始化尤为具有挑战性。为了克服上述挑战，我们提出了三个关键解决方案：（S1）有理基础。我们将B样条函数替换为有理函数，以提高与现代GPU的兼容性。通过在CUDA中实现这一点，我们实现了更快的计算速度。（S2）组KAN。我们通过一组神经元共享激活权重，以减少计算负载而不降低性能。（S3）保持方差的初始化。我们精心初始化激活权重，以确保激活方差在各层之间保持一致。通过这些设计，KAT能够有效扩展并轻松胜过传统基于MLP的Transformer模型。

English

Transformers stand as the cornerstone of mordern deep learning. Traditionally, these models rely on multi-layer perceptron (MLP) layers to mix the information between channels. In this paper, we introduce the Kolmogorov-Arnold Transformer (KAT), a novel architecture that replaces MLP layers with Kolmogorov-Arnold Network (KAN) layers to enhance the expressiveness and performance of the model. Integrating KANs into transformers, however, is no easy feat, especially when scaled up. Specifically, we identify three key challenges: (C1) Base function. The standard B-spline function used in KANs is not optimized for parallel computing on modern hardware, resulting in slower inference speeds. (C2) Parameter and Computation Inefficiency. KAN requires a unique function for each input-output pair, making the computation extremely large. (C3) Weight initialization. The initialization of weights in KANs is particularly challenging due to their learnable activation functions, which are critical for achieving convergence in deep neural networks. To overcome the aforementioned challenges, we propose three key solutions: (S1) Rational basis. We replace B-spline functions with rational functions to improve compatibility with modern GPUs. By implementing this in CUDA, we achieve faster computations. (S2) Group KAN. We share the activation weights through a group of neurons, to reduce the computational load without sacrificing performance. (S3) Variance-preserving initialization. We carefully initialize the activation weights to make sure that the activation variance is maintained across layers. With these designs, KAT scales effectively and readily outperforms traditional MLP-based transformers.