科爾莫戈羅夫-阿諾德注意力：可學習的注意力機制是否更適合視覺Transformer？

摘要

科爾莫戈羅夫-阿諾德網絡（KANs）是一項顯著的創新，其由可學習的激活函數構成，具有捕捉數據中更複雜關係的潛力。儘管KANs在尋找符號表示和持續學習一維函數方面頗具效用，但它在多樣化的機器學習（ML）任務，如視覺領域中的有效性仍存疑。目前，KANs通過在深度網絡架構中替代多層感知器（MLPs）來部署，包括視覺Transformer（ViTs）等先進架構。在本論文中，我們首次設計了一種通用的可學習科爾莫戈羅夫-阿諾德注意力（KArAt），適用於基礎ViTs，能在任意基函數選擇下運作。然而，訓練它們的計算和記憶體成本促使我們提出了一個更模塊化的版本，並設計了特定的可學習注意力，稱為傅立葉-KArAt。傅立葉-KArAt及其變體在CIFAR-10、CIFAR-100和ImageNet-1K數據集上要麼超越了其ViT對應物，要麼表現出相當的性能。我們通過分析這些架構的損失景觀、權重分佈、優化器路徑、注意力視覺化及頻譜行為，並與基礎ViTs進行對比，剖析了它們的性能和泛化能力。本文的目的並非創造參數和計算效率高的注意力機制，而是鼓勵社區探索KANs與需要深入理解可學習激活函數的更先進架構的結合。我們的開源代碼和實現細節可在以下網址獲取：https://subhajitmaity.me/KArAt

English

Kolmogorov-Arnold networks (KANs) are a remarkable innovation consisting of learnable activation functions with the potential to capture more complex relationships from data. Although KANs are useful in finding symbolic representations and continual learning of one-dimensional functions, their effectiveness in diverse machine learning (ML) tasks, such as vision, remains questionable. Presently, KANs are deployed by replacing multilayer perceptrons (MLPs) in deep network architectures, including advanced architectures such as vision Transformers (ViTs). In this paper, we are the first to design a general learnable Kolmogorov-Arnold Attention (KArAt) for vanilla ViTs that can operate on any choice of basis. However, the computing and memory costs of training them motivated us to propose a more modular version, and we designed particular learnable attention, called Fourier-KArAt. Fourier-KArAt and its variants either outperform their ViT counterparts or show comparable performance on CIFAR-10, CIFAR-100, and ImageNet-1K datasets. We dissect these architectures' performance and generalization capacity by analyzing their loss landscapes, weight distributions, optimizer path, attention visualization, and spectral behavior, and contrast them with vanilla ViTs. The goal of this paper is not to produce parameter- and compute-efficient attention, but to encourage the community to explore KANs in conjunction with more advanced architectures that require a careful understanding of learnable activations. Our open-source code and implementation details are available on: https://subhajitmaity.me/KArAt

科爾莫戈羅夫-阿諾德注意力：可學習的注意力機制是否更適合視覺Transformer？

Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers?

摘要

Support