콜리니어 제약 어텐션을 통해 트랜스포머의 문제점 해결하기

초록

대규모 언어 모델(Large Language Models)을 기반으로 한 실용적 응용 분야의 급속한 발전이 계속됨에 따라, 연구 영역에서 외삽(extrapolation) 성능의 중요성이 기하급수적으로 증가하고 있습니다. 본 연구에서 우리는 이전에 간과되었던 트랜스포머(Transformer) 모델의 이상 동작을 발견했으며, 이는 가장 중요한 정보를 담고 있는 근접 토큰들 주변에서 혼란을 일으키는 것으로 나타났습니다. 우리는 이 발견을 "트랜스포머의 두통(headache of Transformers)"이라고 명명했습니다. 이 문제를 근본적으로 해결하기 위해, 우리는 새로운 자기 주의(self-attention) 구조인 Collinear Constrained Attention(CoCA)을 제안했습니다. 이 구조는 기존의 외삽, 내삽(interpolation) 방법 및 전통적인 트랜스포머 모델을 위해 설계된 다른 최적화 전략과 원활하게 통합될 수 있습니다. 우리는 모델에 대한 미세 조정(fine-tuning) 없이도 추론(inference) 시퀀스 길이를 16배에서 24배까지 늘렸을 때도 우수한 외삽 성능을 달성했습니다. 또한 CoCA의 계산 및 공간 효율성을 향상시켜 실용성을 보장했습니다. 우리는 곧 CoCA를 오픈소스로 공개할 계획입니다. 그동안 재현 실험을 위해 부록에 코드를 공개했습니다.

English

As the rapid progression of practical applications based on Large Language Models continues, the importance of extrapolating performance has grown exponentially in the research domain. In our study, we identified an anomalous behavior in Transformer models that had been previously overlooked, leading to a chaos around closest tokens which carried the most important information. We've coined this discovery the "headache of Transformers". To address this at its core, we introduced a novel self-attention structure named Collinear Constrained Attention (CoCA). This structure can be seamlessly integrated with existing extrapolation, interpolation methods, and other optimization strategies designed for traditional Transformer models. We have achieved excellent extrapolating performance even for 16 times to 24 times of sequence lengths during inference without any fine-tuning on our model. We have also enhanced CoCA's computational and spatial efficiency to ensure its practicality. We plan to open-source CoCA shortly. In the meantime, we've made our code available in the appendix for reappearing experiments.

콜리니어 제약 어텐션을 통해 트랜스포머의 문제점 해결하기

Cure the headache of Transformers via Collinear Constrained Attention

초록

Support