다층 트랜스포머 그래디언트는 거의 선형 시간으로 근사될 수 있습니다.

초록

인기있는 트랜스포머 아키텍처의 self-attention 메커니즘에서의 이차 계산 복잡성은 훈련 및 추론에서 효율성과 메모리 요구 사항 측면에서 중요한 도전을 제기합니다. 이러한 도전에 대응하기 위해 본 논문은 멀티레이어 트랜스포머 모델에서의 기울기 계산을 위한 새로운 빠른 계산 방법을 소개합니다. 우리의 접근법은 입력 시퀀스 길이인 n에 대해 거의 선형 시간 n^{1+o(1)} 내에 전체 멀티레이어 트랜스포머 모델의 기울기를 계산할 수 있게 합니다. 이 혁신은 전통적인 이차 시간 복잡성과 관련된 계산 병목 현상을 크게 줄입니다. 우리의 이론은 모든 손실 함수에 대해 유효하며 전체 모델에서 한정된 근사 오차를 유지합니다. 게다가, 우리의 분석은 멀티레이어 트랜스포머 모델이 residual connection, casual mask, multi-head attention과 같은 다양한 실용적인 서브 모듈을 포함할 때에도 유효합니다. 대형 언어 모델에서의 기울기 계산 효율성을 향상시킴으로써, 우리의 연구가 우리의 이론적 결과를 기반으로 한 장기적인 문맥 언어 모델의 보다 효과적인 훈련과 배포를 용이하게 할 것으로 기대합니다.

English

The quadratic computational complexity in the self-attention mechanism of popular transformer architectures poses significant challenges for training and inference, particularly in terms of efficiency and memory requirements. Towards addressing these challenges, this paper introduces a novel fast computation method for gradient calculation in multi-layer transformer models. Our approach enables the computation of gradients for the entire multi-layer transformer model in almost linear time n^{1+o(1)}, where n is the input sequence length. This breakthrough significantly reduces the computational bottleneck associated with the traditional quadratic time complexity. Our theory holds for any loss function and maintains a bounded approximation error across the entire model. Furthermore, our analysis can hold when the multi-layer transformer model contains many practical sub-modules, such as residual connection, casual mask, and multi-head attention. By improving the efficiency of gradient computation in large language models, we hope that our work will facilitate the more effective training and deployment of long-context language models based on our theoretical results.

다층 트랜스포머 그래디언트는 거의 선형 시간으로 근사될 수 있습니다.

Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time

초록

Support