コルモゴロフ・アーノルド・トランスフォーマ

要旨

トランスフォーマーは、現代の深層学習の基礎となっています。従来、これらのモデルは、チャンネル間の情報を混合するために、多層パーセプトロン（MLP）層に依存していました。本論文では、MLP層をコルモゴロフ・アーノルド・ネットワーク（KAN）層で置き換え、モデルの表現力と性能を向上させる革新的なアーキテクチャであるコルモゴロフ・アーノルド・トランスフォーマー（KAT）を紹介します。ただし、トランスフォーマーにKANを統合することは容易ではなく、特にスケーリングアップする場合にはさらなる困難が伴います。具体的には、3つの主要な課題を特定しています：（C1）基本関数。KANで使用される標準のBスプライン関数は、現代のハードウェアでの並列計算に最適化されておらず、推論速度が遅くなる結果となります。（C2）パラメータおよび計算の非効率性。KANは、各入出力ペアごとに固有の関数を必要とし、計算量が非常に大きくなります。（C3）重みの初期化。KANの重みの初期化は特に困難であり、深層ニューラルネットワークで収束を達成するために重要な学習可能な活性化関数が含まれています。上記の課題を克服するために、3つの主要な解決策を提案します：（S1）有理基底。Bスプライン関数を有理関数に置き換え、現代のGPUとの互換性を向上させます。これをCUDAで実装することで、より高速な計算が可能となります。（S2）グループKAN。一群のニューロンを介して活性化重みを共有し、性能を犠牲にすることなく計算負荷を軽減します。（S3）分散保存初期化。活性化重みを注意深く初期化し、層を横断して活性化の分散が維持されるようにします。これらの設計により、KATは効果的にスケーリングされ、従来のMLPベースのトランスフォーマーを容易に凌駕します。

English

Transformers stand as the cornerstone of mordern deep learning. Traditionally, these models rely on multi-layer perceptron (MLP) layers to mix the information between channels. In this paper, we introduce the Kolmogorov-Arnold Transformer (KAT), a novel architecture that replaces MLP layers with Kolmogorov-Arnold Network (KAN) layers to enhance the expressiveness and performance of the model. Integrating KANs into transformers, however, is no easy feat, especially when scaled up. Specifically, we identify three key challenges: (C1) Base function. The standard B-spline function used in KANs is not optimized for parallel computing on modern hardware, resulting in slower inference speeds. (C2) Parameter and Computation Inefficiency. KAN requires a unique function for each input-output pair, making the computation extremely large. (C3) Weight initialization. The initialization of weights in KANs is particularly challenging due to their learnable activation functions, which are critical for achieving convergence in deep neural networks. To overcome the aforementioned challenges, we propose three key solutions: (S1) Rational basis. We replace B-spline functions with rational functions to improve compatibility with modern GPUs. By implementing this in CUDA, we achieve faster computations. (S2) Group KAN. We share the activation weights through a group of neurons, to reduce the computational load without sacrificing performance. (S3) Variance-preserving initialization. We carefully initialize the activation weights to make sure that the activation variance is maintained across layers. With these designs, KAT scales effectively and readily outperforms traditional MLP-based transformers.