Kolmogorov-Arnold Attention: 学習可能なAttentionはVision Transformerにとってより優れているのか？

要旨

Kolmogorov-Arnoldネットワーク（KANs）は、データからより複雑な関係を捉える可能性を秘めた学習可能な活性化関数から成る画期的なイノベーションです。KANsは、1次元関数のシンボリック表現の発見や継続学習において有用ですが、視覚タスクなど多様な機械学習（ML）タスクでの有効性は疑問視されています。現在、KANsは、ビジョントランスフォーマー（ViTs）のような高度なアーキテクチャを含む深層ネットワークアーキテクチャにおいて、多層パーセプトロン（MLPs）を置き換える形で導入されています。本論文では、任意の基底選択で動作可能な汎用的な学習可能なKolmogorov-Arnold Attention（KArAt）を、初めてバニラViTs向けに設計しました。しかし、そのトレーニングにおける計算コストとメモリコストが課題となり、よりモジュール化されたバージョンを提案するに至りました。そこで、Fourier-KArAtと呼ばれる特定の学習可能なアテンションを設計しました。Fourier-KArAtとその変種は、CIFAR-10、CIFAR-100、ImageNet-1Kデータセットにおいて、ViTの対応モデルを上回るか、同等の性能を示しています。これらのアーキテクチャの性能と汎化能力を、損失ランドスケープ、重み分布、オプティマイザの経路、アテンションの可視化、スペクトル挙動を分析することで解明し、バニラViTsと比較しました。本論文の目的は、パラメータ効率や計算効率の高いアテンションを生み出すことではなく、学習可能な活性化関数を慎重に理解する必要があるより高度なアーキテクチャとKANsを組み合わせることをコミュニティに促すことです。オープンソースのコードと実装の詳細は、https://subhajitmaity.me/KArAt で公開しています。

English

Kolmogorov-Arnold networks (KANs) are a remarkable innovation consisting of learnable activation functions with the potential to capture more complex relationships from data. Although KANs are useful in finding symbolic representations and continual learning of one-dimensional functions, their effectiveness in diverse machine learning (ML) tasks, such as vision, remains questionable. Presently, KANs are deployed by replacing multilayer perceptrons (MLPs) in deep network architectures, including advanced architectures such as vision Transformers (ViTs). In this paper, we are the first to design a general learnable Kolmogorov-Arnold Attention (KArAt) for vanilla ViTs that can operate on any choice of basis. However, the computing and memory costs of training them motivated us to propose a more modular version, and we designed particular learnable attention, called Fourier-KArAt. Fourier-KArAt and its variants either outperform their ViT counterparts or show comparable performance on CIFAR-10, CIFAR-100, and ImageNet-1K datasets. We dissect these architectures' performance and generalization capacity by analyzing their loss landscapes, weight distributions, optimizer path, attention visualization, and spectral behavior, and contrast them with vanilla ViTs. The goal of this paper is not to produce parameter- and compute-efficient attention, but to encourage the community to explore KANs in conjunction with more advanced architectures that require a careful understanding of learnable activations. Our open-source code and implementation details are available on: https://subhajitmaity.me/KArAt

Kolmogorov-Arnold Attention: 学習可能なAttentionはVision Transformerにとってより優れているのか？

Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers?

要旨

Support