明暗對比注意力：在黑暗中消耗運算

摘要

標準Transformer在每一層和每一個token上都均勻地應用自注意力機制，無論輸入是否需要動態的跨token交互。我們提出CHIAR-Former（明暗對比注意力），這是一個4層混合Transformer，根據每個token的頻譜熵（一種理論上有依據的複雜度信號）將每個token路由到三個運算符之一：DCT頻譜混合、RBF核混合或完整自注意力。通過在WikiText-103上進行系統性的消融研究，我們發現了路由崩潰現象：路由 consistently 拒絕RBF，而傾向於選擇DCT和注意力，這表明頻譜混合與動態注意力是互補且足夠的。一個專門設計的僅含DCT+注意力的變體在WikiText-103上達到了Val PPL 36.54——與完整注意力基線（PPL 66.62）相比，提升了45%，同時減少了62.5%的注意力FLOPs。我們將評估擴展到WikiText-2、IMDB情感分類以及合成ListOps運算任務，從而確立了一個明確的運作區間：CHIAR-Former在token多樣性能夠支持頻譜專門化的大規模自然文本上表現優異，而完整注意力在小型數據集和合成模式匹配任務上仍具優勢。這些發現——無論是成功還是失敗——共同定義了頻譜路由在何時以及為何值得採用。

English

Standard transformers apply self-attention uniformly at every layer and token, regardless of whether the input requires dynamic cross-token interaction. We propose CHIAR-Former (Chiaroscuro Attention), a 4-layer hybrid transformer that routes each token to one of three operators - DCT spectral mixing, RBF kernel mixing, or full self-attention - based on per-token spectral entropy, a theoretically justified complexity signal. Through systematic ablation on WikiText-103, we discover routing collapse: the router consistently rejects RBF in favour of DCT and attention, revealing that spectral mixing and dynamic attention are complementary and sufficient. A purpose-designed DCT+Attention-only variant achieves Val PPL 36.54 on WikiText-103 - a 45% improvement over a full-attention baseline (PPL 66.62) at 62.5% fewer attention FLOPs. We extend evaluation to WikiText-2, IMDB sentiment classification, and synthetic ListOps operations, establishing a clear operating regime: CHIAR-Former excels on large-scale naturalistic text where token diversity supports spectral specialisation, while full attention retains an edge on small datasets and synthetic pattern-matching tasks. These findings - both the wins and the losses - together define when and why spectral routing earns its keep.