Kolmogorov-Arnold注意力机制：可学习的注意力是否更适合视觉Transformer？

摘要

Kolmogorov-Arnold网络（KANs）是一项引人注目的创新，其核心在于可学习的激活函数，这些函数具备捕捉数据中更复杂关系的潜力。尽管KANs在寻找符号表示及一维函数的持续学习方面表现出色，但它们在多样化机器学习（ML）任务，如视觉任务中的有效性仍存疑。目前，KANs通过替换深度网络架构中的多层感知机（MLPs）得以应用，包括视觉Transformer（ViTs）等先进架构。本文首次设计了一种通用的可学习Kolmogorov-Arnold注意力机制（KArAt），适用于基础ViTs，能够在任意基函数选择下运作。然而，训练过程中的计算与内存成本促使我们提出了一种更为模块化的版本，并设计了特定的可学习注意力机制，称为傅里叶-KArAt。傅里叶-KArAt及其变体在CIFAR-10、CIFAR-100及ImageNet-1K数据集上，要么超越了其对应的ViT模型，要么展现了相当的性能。我们通过分析这些架构的损失景观、权重分布、优化器路径、注意力可视化及频谱行为，深入剖析了它们的性能与泛化能力，并与基础ViTs进行了对比。本文的目的并非创造参数与计算效率均优的注意力机制，而是鼓励社区在理解可学习激活函数的基础上，探索KANs与更先进架构的结合。我们的开源代码及实现细节可在以下网址获取：https://subhajitmaity.me/KArAt。

English

Kolmogorov-Arnold networks (KANs) are a remarkable innovation consisting of learnable activation functions with the potential to capture more complex relationships from data. Although KANs are useful in finding symbolic representations and continual learning of one-dimensional functions, their effectiveness in diverse machine learning (ML) tasks, such as vision, remains questionable. Presently, KANs are deployed by replacing multilayer perceptrons (MLPs) in deep network architectures, including advanced architectures such as vision Transformers (ViTs). In this paper, we are the first to design a general learnable Kolmogorov-Arnold Attention (KArAt) for vanilla ViTs that can operate on any choice of basis. However, the computing and memory costs of training them motivated us to propose a more modular version, and we designed particular learnable attention, called Fourier-KArAt. Fourier-KArAt and its variants either outperform their ViT counterparts or show comparable performance on CIFAR-10, CIFAR-100, and ImageNet-1K datasets. We dissect these architectures' performance and generalization capacity by analyzing their loss landscapes, weight distributions, optimizer path, attention visualization, and spectral behavior, and contrast them with vanilla ViTs. The goal of this paper is not to produce parameter- and compute-efficient attention, but to encourage the community to explore KANs in conjunction with more advanced architectures that require a careful understanding of learnable activations. Our open-source code and implementation details are available on: https://subhajitmaity.me/KArAt

Kolmogorov-Arnold注意力机制：可学习的注意力是否更适合视觉Transformer？

Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers?

摘要

Support