Chiaroscuro Attention: Rekenkracht besteden in het donker

Samenvatting

Standaard transformators passen overal in elke laag en voor elk token dezelfde self-attention toe, ongeacht of de invoer dynamische interactie tussen tokens vereist. Wij stellen CHIAR-Former (Chiaroscuro Attention) voor, een hybride transformator met 4 lagen die elk token routeert naar een van drie operatoren – DCT-spectrale menging, RBF-kernelmenging of volledige self-attention – op basis van per-token spectrale entropie, een theoretisch onderbouwd complexiteitssignaal. Door systematische ablatie op WikiText-103 ontdekken we routingcollapse: de router wijst RBF consequent af ten gunste van DCT en attention, wat onthult dat spectrale menging en dynamische attention complementair en voldoende zijn. Een speciaal ontworpen DCT+Attention-only variant behaalt Val PPL 36,54 op WikiText-103 – een verbetering van 45% ten opzichte van een full-attention baseline (PPL 66,62) bij 62,5% minder attention-FLOPs. We breiden de evaluatie uit naar WikiText-2, IMDB-sentimentclassificatie en synthetische ListOps-bewerkingen, en stellen een duidelijk werkingsregime vast: CHIAR-Former blinkt uit op grootschalige naturalistische tekst waar tokendiversiteit spectrale specialisatie ondersteunt, terwijl full-attention zijn voordeel behoudt op kleine datasets en synthetische patroonherkenningstaken. Deze bevindingen – zowel de winsten als de verliezen – definiëren samen wanneer en waarom spectrale routering zijn waarde bewijst.

English

Standard transformers apply self-attention uniformly at every layer and token, regardless of whether the input requires dynamic cross-token interaction. We propose CHIAR-Former (Chiaroscuro Attention), a 4-layer hybrid transformer that routes each token to one of three operators - DCT spectral mixing, RBF kernel mixing, or full self-attention - based on per-token spectral entropy, a theoretically justified complexity signal. Through systematic ablation on WikiText-103, we discover routing collapse: the router consistently rejects RBF in favour of DCT and attention, revealing that spectral mixing and dynamic attention are complementary and sufficient. A purpose-designed DCT+Attention-only variant achieves Val PPL 36.54 on WikiText-103 - a 45% improvement over a full-attention baseline (PPL 66.62) at 62.5% fewer attention FLOPs. We extend evaluation to WikiText-2, IMDB sentiment classification, and synthetic ListOps operations, establishing a clear operating regime: CHIAR-Former excels on large-scale naturalistic text where token diversity supports spectral specialisation, while full attention retains an edge on small datasets and synthetic pattern-matching tasks. These findings - both the wins and the losses - together define when and why spectral routing earns its keep.