通过共线约束注意力来解决Transformer的头痛问题。

摘要

随着基于大型语言模型的实际应用的快速发展，对性能外推的重要性在研究领域呈指数增长。在我们的研究中，我们发现了Transformer模型中一个先前被忽视的异常行为，导致了最重要信息的最近标记之间的混乱。我们将这一发现命名为“Transformer的头痛”。为了从根本上解决这个问题，我们引入了一种名为共线约束注意力（Collinear Constrained Attention，CoCA）的新型自注意力结构。该结构可以与现有的外推、插值方法以及为传统Transformer模型设计的其他优化策略无缝集成。即使在推理过程中对我们的模型没有进行任何微调，我们也取得了出色的外推性能，即使是16到24倍的序列长度。我们还增强了CoCA的计算和空间效率，以确保其实用性。我们计划很快开源CoCA。与此同时，我们已经在附录中提供了我们的代码，以便重现实验。

English

As the rapid progression of practical applications based on Large Language Models continues, the importance of extrapolating performance has grown exponentially in the research domain. In our study, we identified an anomalous behavior in Transformer models that had been previously overlooked, leading to a chaos around closest tokens which carried the most important information. We've coined this discovery the "headache of Transformers". To address this at its core, we introduced a novel self-attention structure named Collinear Constrained Attention (CoCA). This structure can be seamlessly integrated with existing extrapolation, interpolation methods, and other optimization strategies designed for traditional Transformer models. We have achieved excellent extrapolating performance even for 16 times to 24 times of sequence lengths during inference without any fine-tuning on our model. We have also enhanced CoCA's computational and spatial efficiency to ensure its practicality. We plan to open-source CoCA shortly. In the meantime, we've made our code available in the appendix for reappearing experiments.

通过共线约束注意力来解决Transformer的头痛问题。

Cure the headache of Transformers via Collinear Constrained Attention

摘要

Support