透過共線約束注意力來解決Transformer的頭疼問題

摘要

隨著基於大型語言模型的實際應用快速發展，對於在研究領域中指數增長的性能外推的重要性也日益凸顯。在我們的研究中，我們發現了Transformer模型中一個先前被忽略的異常行為，導致最接近的標記之間出現混亂，而這些標記攜帶著最重要的信息。我們將這一發現命名為「Transformer的頭痛」。為了徹底解決這個問題，我們引入了一種新型的自注意結構，名為Collinear Constrained Attention（CoCA）。這種結構可以與現有的外推、內插方法以及其他針對傳統Transformer模型設計的優化策略無縫集成。我們在推論過程中實現了出色的外推性能，即使對於16至24倍序列長度，也無需對我們的模型進行任何微調。我們還增強了CoCA的計算和空間效率，以確保其實用性。我們計劃很快開源CoCA。與此同時，我們已在附錄中提供了我們的代碼，以便重現實驗。

English

As the rapid progression of practical applications based on Large Language Models continues, the importance of extrapolating performance has grown exponentially in the research domain. In our study, we identified an anomalous behavior in Transformer models that had been previously overlooked, leading to a chaos around closest tokens which carried the most important information. We've coined this discovery the "headache of Transformers". To address this at its core, we introduced a novel self-attention structure named Collinear Constrained Attention (CoCA). This structure can be seamlessly integrated with existing extrapolation, interpolation methods, and other optimization strategies designed for traditional Transformer models. We have achieved excellent extrapolating performance even for 16 times to 24 times of sequence lengths during inference without any fine-tuning on our model. We have also enhanced CoCA's computational and spatial efficiency to ensure its practicality. We plan to open-source CoCA shortly. In the meantime, we've made our code available in the appendix for reappearing experiments.

透過共線約束注意力來解決Transformer的頭疼問題

Cure the headache of Transformers via Collinear Constrained Attention

摘要

Support