科爾莫戈羅夫-阿諾德注意力:可學習的注意力機制是否更適合視覺Transformer?
Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers?
March 13, 2025
作者: Subhajit Maity, Killian Hitsman, Xin Li, Aritra Dutta
cs.AI
摘要
科爾莫戈羅夫-阿諾德網絡(KANs)是一項顯著的創新,其由可學習的激活函數構成,具有捕捉數據中更複雜關係的潛力。儘管KANs在尋找符號表示和持續學習一維函數方面頗具效用,但它在多樣化的機器學習(ML)任務,如視覺領域中的有效性仍存疑。目前,KANs通過在深度網絡架構中替代多層感知器(MLPs)來部署,包括視覺Transformer(ViTs)等先進架構。在本論文中,我們首次設計了一種通用的可學習科爾莫戈羅夫-阿諾德注意力(KArAt),適用於基礎ViTs,能在任意基函數選擇下運作。然而,訓練它們的計算和記憶體成本促使我們提出了一個更模塊化的版本,並設計了特定的可學習注意力,稱為傅立葉-KArAt。傅立葉-KArAt及其變體在CIFAR-10、CIFAR-100和ImageNet-1K數據集上要麼超越了其ViT對應物,要麼表現出相當的性能。我們通過分析這些架構的損失景觀、權重分佈、優化器路徑、注意力視覺化及頻譜行為,並與基礎ViTs進行對比,剖析了它們的性能和泛化能力。本文的目的並非創造參數和計算效率高的注意力機制,而是鼓勵社區探索KANs與需要深入理解可學習激活函數的更先進架構的結合。我們的開源代碼和實現細節可在以下網址獲取:https://subhajitmaity.me/KArAt
English
Kolmogorov-Arnold networks (KANs) are a remarkable innovation consisting of
learnable activation functions with the potential to capture more complex
relationships from data. Although KANs are useful in finding symbolic
representations and continual learning of one-dimensional functions, their
effectiveness in diverse machine learning (ML) tasks, such as vision, remains
questionable. Presently, KANs are deployed by replacing multilayer perceptrons
(MLPs) in deep network architectures, including advanced architectures such as
vision Transformers (ViTs). In this paper, we are the first to design a general
learnable Kolmogorov-Arnold Attention (KArAt) for vanilla ViTs that can operate
on any choice of basis. However, the computing and memory costs of training
them motivated us to propose a more modular version, and we designed particular
learnable attention, called Fourier-KArAt. Fourier-KArAt and its variants
either outperform their ViT counterparts or show comparable performance on
CIFAR-10, CIFAR-100, and ImageNet-1K datasets. We dissect these architectures'
performance and generalization capacity by analyzing their loss landscapes,
weight distributions, optimizer path, attention visualization, and spectral
behavior, and contrast them with vanilla ViTs. The goal of this paper is not to
produce parameter- and compute-efficient attention, but to encourage the
community to explore KANs in conjunction with more advanced architectures that
require a careful understanding of learnable activations. Our open-source code
and implementation details are available on: https://subhajitmaity.me/KArAtSummary
AI-Generated Summary