ChatPaper.aiChatPaper

明暗对比注意力:在黑暗中投入计算

Chiaroscuro Attention: Spending Compute in the Dark

June 6, 2026
作者: Prateek Kumar Sikdar
cs.AI

摘要

标准Transformer在每一层和每一个token上统一应用自注意力机制,无论输入是否需要动态的跨token交互。我们提出CHIAR-Former(明暗注意力),一种4层混合Transformer,它基于每个token的谱熵(一种理论上有依据的复杂度信号)将每个token路由至三种算子之一:DCT谱混合、RBF核混合或全自注意力。通过在WikiText-103上的系统性消融实验,我们发现了路由坍塌现象:路由器一致地拒绝RBF,而倾向于DCT和注意力,这表明谱混合与动态注意力是互补且充分的。专门设计的DCT+注意力变体在WikiText-103上达到了Val PPL 36.54——相比全注意力基线(PPL 66.62)提升了45%,同时减少了62.5%的注意力FLOPs。我们将评估扩展到WikiText-2、IMDB情感分类以及合成型ListOps操作,明确了有效的运行区间:CHIAR-Former在大规模自然语言文本上表现优异(此时token多样性支持谱专业化),而全注意力在小数据集和合成模式匹配任务上仍保持优势。这些发现——无论是成功还是不足——共同定义了谱路由在何时以及为何能够发挥其价值。
English
Standard transformers apply self-attention uniformly at every layer and token, regardless of whether the input requires dynamic cross-token interaction. We propose CHIAR-Former (Chiaroscuro Attention), a 4-layer hybrid transformer that routes each token to one of three operators - DCT spectral mixing, RBF kernel mixing, or full self-attention - based on per-token spectral entropy, a theoretically justified complexity signal. Through systematic ablation on WikiText-103, we discover routing collapse: the router consistently rejects RBF in favour of DCT and attention, revealing that spectral mixing and dynamic attention are complementary and sufficient. A purpose-designed DCT+Attention-only variant achieves Val PPL 36.54 on WikiText-103 - a 45% improvement over a full-attention baseline (PPL 66.62) at 62.5% fewer attention FLOPs. We extend evaluation to WikiText-2, IMDB sentiment classification, and synthetic ListOps operations, establishing a clear operating regime: CHIAR-Former excels on large-scale naturalistic text where token diversity supports spectral specialisation, while full attention retains an edge on small datasets and synthetic pattern-matching tasks. These findings - both the wins and the losses - together define when and why spectral routing earns its keep.