Delta-aandachtsresiduen

Samenvatting

Aandachtresiduals vervangen standaard additieve residualverbindingen door aangeleerde softmax-aandacht over uitgangen van voorgaande lagen, wat selectieve kruislaagroutering mogelijk maakt. Standaard Aandachtresiduals letten echter nog steeds op cumulatieve verborgen toestanden in voorgaande lagen, die sterk redundant zijn. We tonen aan dat deze redundantie leidt tot routeringsinstorting in diepere lagen: aandachtsgewichten worden laagcontrast en naderen uniform (max gewicht ≈0,2), wat het vermogen van het model om informatieve toestanden in voorgaande lagen te selecteren beperkt. Dit roept een belangrijke maar onderbelichte ontwerpvraag op: welke laagsgewijze representaties moeten worden gerouteerd in Aandachtresiduals? Om deze vraag te beantwoorden, stellen we Delta-Aandachtresiduals voor, die letten op delta's – de verandering die door elke sublaag wordt geïntroduceerd (v_i = h_{i+1} - h_i) – in plaats van cumulatieve toestanden. Deltarepresentaties zijn structureel divers en leveren hogercontrast-aandachtsverdelingen op (max gewicht ≈0,6), wat selectievere en effectievere routering over lagen mogelijk maakt. Dit principe is van toepassing op zowel per-sublaag- als blokgranulariteit. Op alle geteste schalen (220M–7,6B) presteren Delta-Aandachtresiduals consequent beter dan zowel standaard residualverbindingen als Aandachtresiduals, met 1,7–8,2% winst in validatieperplexiteit. Delta-Aandachtresiduals maken het ook mogelijk om voortgetrainde controlepuntbestanden via standaard fijnafstemming om te zetten naar Delta-Aandachtresiduals. Code is beschikbaar op https://github.com/wdlctc/delta-attention-residuals-code.

English

Attention Residuals replace standard additive residual connections with learned softmax attention over previous layer outputs, enabling selective cross-layer routing. However, standard Attention Residuals still attend over cumulative hidden states in previous layers, which are highly redundant. We show that this redundancy leads to routing collapse in deeper layers: attention weights become low-contrast and closer to uniform (max weight {approx}0.2), limiting the model's ability to select informative states in previous layers. This raises a key but underexplored design question: what layer-wise representations should be routed in Attention Residuals? To answer this question, we propose Delta Attention Residuals, which attend over deltas -- the change introduced by each sublayer (v_i = h_{i+1} - h_i) -- instead of cumulative states. Delta representations are structurally diverse and yield higher-contrast attention distributions (max weight {approx}0.6), enabling more selective and effective routing across layers. This principle applies at both per-sublayer and block granularity. Across all tested scales (220M--7.6B), Delta Attention Residuals consistently outperform both standard residuals and Attention Residuals, with 1.7--8.2\% validation perplexity gains. Delta Attention Residuals also enables converting pretrained checkpoints into Delta Attention Residuals via standard fine-tuning. Code is available at https://github.com/wdlctc/delta-attention-residuals-code.