你的Transformer其实是线性的
Your Transformer is Secretly Linear
May 19, 2024
作者: Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Goncharova, Nikolai Gerasimenko, Ivan Oseledets, Denis Dimitrov, Andrey Kuznetsov
cs.AI
摘要
本文揭示了一种新颖的线性特征,仅适用于变压器解码器,包括GPT、LLaMA、OPT、BLOOM等模型。我们分析了顺序层之间的嵌入变换,揭示了一种接近完美的线性关系(普洛克鲁斯特相似度得分为0.99)。然而,当去除残差部分时,由于变压器层的输出范数一贯较低,线性性会降低。我们的实验表明,去除或线性逼近一些最线性的变压器块并不显著影响损失或模型性能。此外,在我们针对较小模型进行的预训练实验中,我们引入了基于余弦相似度的正则化,旨在减少层的线性性。这种正则化改善了像Tiny Stories和SuperGLUE这样的基准测试中的性能指标,成功降低了模型的线性性。这项研究挑战了人们对变压器架构的现有理解,表明它们的运行可能比先前假设的更线性。
English
This paper reveals a novel linear characteristic exclusive to transformer
decoders, including models such as GPT, LLaMA, OPT, BLOOM and others. We
analyze embedding transformations between sequential layers, uncovering a
near-perfect linear relationship (Procrustes similarity score of 0.99).
However, linearity decreases when the residual component is removed due to a
consistently low output norm of the transformer layer. Our experiments show
that removing or linearly approximating some of the most linear blocks of
transformers does not affect significantly the loss or model performance.
Moreover, in our pretraining experiments on smaller models we introduce a
cosine-similarity-based regularization, aimed at reducing layer linearity. This
regularization improves performance metrics on benchmarks like Tiny Stories and
SuperGLUE and as well successfully decreases the linearity of the models. This
study challenges the existing understanding of transformer architectures,
suggesting that their operation may be more linear than previously assumed.Summary
AI-Generated Summary