긴 문맥 대형 모델을 위한 블록 단위 병렬 트랜스포머

초록

트랜스포머(Transformers)는 최첨단 자연어 처리 모델의 초석으로 자리 잡으며, 다양한 AI 애플리케이션에서 탁월한 성능을 보여주고 있습니다. 그러나 트랜스포머의 자기 주의(self-attention) 메커니즘과 대규모 피드포워드 네트워크(feedforward network)가 요구하는 메모리로 인해, 긴 시퀀스를 처리하는 능력이 제한되어 다중 긴 시퀀스나 장기 의존성을 포함하는 작업에 어려움이 발생합니다. 본 논문에서는 블록 단위 계산을 활용한 자기 주의와 피드포워드 네트워크 융합을 통해 메모리 비용을 최소화하는 새로운 접근 방식인 블록 단위 병렬 트랜스포머(Blockwise Parallel Transformer, BPT)를 제안합니다. BPT는 더 긴 입력 시퀀스를 처리하면서도 메모리 효율성을 유지함으로써, 기존 트랜스포머보다 최대 32배, 그리고 기존의 메모리 효율적인 방법들보다 2~4배 더 긴 시퀀스를 학습할 수 있게 합니다. 언어 모델링 및 강화 학습 작업에 대한 광범위한 실험을 통해 BPT가 메모리 요구 사항을 줄이고 성능을 개선하는 데 효과적임을 입증합니다.

English

Transformers have emerged as the cornerstone of state-of-the-art natural language processing models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands posed by the self-attention mechanism and the large feedforward network in Transformers limit their ability to handle long sequences, thereby creating challenges for tasks involving multiple long sequences or long-term dependencies. We present a distinct approach, Blockwise Parallel Transformer (BPT), that leverages blockwise computation of self-attention and feedforward network fusion to minimize memory costs. By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences up to 32 times longer than vanilla Transformers and 2 to 4 times longer than previous memory-efficient methods. Extensive experiments on language modeling and reinforcement learning tasks demonstrate the effectiveness of BPT in reducing memory requirements and improving performance.

긴 문맥 대형 모델을 위한 블록 단위 병렬 트랜스포머

Blockwise Parallel Transformer for Long Context Large Models

초록

Support