明暗對比注意力:在黑暗中消耗運算
Chiaroscuro Attention: Spending Compute in the Dark
June 6, 2026
作者: Prateek Kumar Sikdar
cs.AI
摘要
標準Transformer在每一層和每一個token上都均勻地應用自注意力機制,無論輸入是否需要動態的跨token交互。我們提出CHIAR-Former(明暗對比注意力),這是一個4層混合Transformer,根據每個token的頻譜熵(一種理論上有依據的複雜度信號)將每個token路由到三個運算符之一:DCT頻譜混合、RBF核混合或完整自注意力。通過在WikiText-103上進行系統性的消融研究,我們發現了路由崩潰現象:路由 consistently 拒絕RBF,而傾向於選擇DCT和注意力,這表明頻譜混合與動態注意力是互補且足夠的。一個專門設計的僅含DCT+注意力的變體在WikiText-103上達到了Val PPL 36.54——與完整注意力基線(PPL 66.62)相比,提升了45%,同時減少了62.5%的注意力FLOPs。我們將評估擴展到WikiText-2、IMDB情感分類以及合成ListOps運算任務,從而確立了一個明確的運作區間:CHIAR-Former在token多樣性能夠支持頻譜專門化的大規模自然文本上表現優異,而完整注意力在小型數據集和合成模式匹配任務上仍具優勢。這些發現——無論是成功還是失敗——共同定義了頻譜路由在何時以及為何值得採用。
English
Standard transformers apply self-attention uniformly at every layer and token, regardless of whether the input requires dynamic cross-token interaction. We propose CHIAR-Former (Chiaroscuro Attention), a 4-layer hybrid transformer that routes each token to one of three operators - DCT spectral mixing, RBF kernel mixing, or full self-attention - based on per-token spectral entropy, a theoretically justified complexity signal. Through systematic ablation on WikiText-103, we discover routing collapse: the router consistently rejects RBF in favour of DCT and attention, revealing that spectral mixing and dynamic attention are complementary and sufficient. A purpose-designed DCT+Attention-only variant achieves Val PPL 36.54 on WikiText-103 - a 45% improvement over a full-attention baseline (PPL 66.62) at 62.5% fewer attention FLOPs. We extend evaluation to WikiText-2, IMDB sentiment classification, and synthetic ListOps operations, establishing a clear operating regime: CHIAR-Former excels on large-scale naturalistic text where token diversity supports spectral specialisation, while full attention retains an edge on small datasets and synthetic pattern-matching tasks. These findings - both the wins and the losses - together define when and why spectral routing earns its keep.