당신의 트랜스포머는 비밀리에 선형적입니다.

초록

본 논문은 GPT, LLaMA, OPT, BLOOM 등의 모델을 포함한 트랜스포머 디코더에 특화된 새로운 선형 특성을 밝혀냅니다. 우리는 순차적 레이어 간의 임베딩 변환을 분석하여 거의 완벽한 선형 관계(Procrustes 유사도 점수 0.99)를 발견했습니다. 그러나 트랜스포머 레이어의 출력 노름이 지속적으로 낮기 때문에 잔차 성분을 제거하면 선형성이 감소합니다. 실험 결과, 트랜스포머의 가장 선형적인 블록 일부를 제거하거나 선형적으로 근사하더라도 손실이나 모델 성능에 큰 영향을 미치지 않는 것으로 나타났습니다. 또한, 더 작은 모델에 대한 사전 학습 실험에서 레이어 선형성을 감소시키기 위한 코사인 유사도 기반 정규화를 도입했습니다. 이 정규화는 Tiny Stories 및 SuperGLUE와 같은 벤치마크에서 성능 지표를 개선할 뿐만 아니라 모델의 선형성을 성공적으로 감소시켰습니다. 이 연구는 트랜스포머 아키텍처에 대한 기존의 이해에 도전하며, 그 동작이 이전에 가정했던 것보다 더 선형적일 수 있음을 시사합니다.

English

This paper reveals a novel linear characteristic exclusive to transformer decoders, including models such as GPT, LLaMA, OPT, BLOOM and others. We analyze embedding transformations between sequential layers, uncovering a near-perfect linear relationship (Procrustes similarity score of 0.99). However, linearity decreases when the residual component is removed due to a consistently low output norm of the transformer layer. Our experiments show that removing or linearly approximating some of the most linear blocks of transformers does not affect significantly the loss or model performance. Moreover, in our pretraining experiments on smaller models we introduce a cosine-similarity-based regularization, aimed at reducing layer linearity. This regularization improves performance metrics on benchmarks like Tiny Stories and SuperGLUE and as well successfully decreases the linearity of the models. This study challenges the existing understanding of transformer architectures, suggesting that their operation may be more linear than previously assumed.

당신의 트랜스포머는 비밀리에 선형적입니다.

Your Transformer is Secretly Linear

초록

Support