대각선 배칭이 장기 문맥을 위한 순환 메모리 트랜스포머에서 병렬성을 해제하다

초록

Transformer 모델은 2차 시간 복잡도와 선형 메모리 복잡도로 인해 장문맥 추론에 어려움을 겪는다. Recurrent Memory Transformers(RMTs)는 이러한 문제를 해결하기 위해 점근적 비용을 선형 시간과 상수 메모리 사용량으로 줄이는 방법을 제안한다. 그러나 RMT의 메모리 업데이트 메커니즘은 순차적 실행을 유발하여 성능 병목 현상을 초래한다. 본 연구에서는 Diagonal Batching이라는 스케줄링 기법을 소개한다. 이 기법은 RMT에서 세그먼트 간 병렬성을 활성화하면서도 정확한 재귀 구조를 유지한다. 이 접근법은 순차적 제약을 제거함으로써 복잡한 배칭 및 파이프라이닝 기법 없이도 단일 장문맥 입력에 대해 효율적인 GPU 추론을 가능하게 한다. 이 기술은 순수하게 런타임 계산 재배열에 기반하므로, 기존 RMT 모델은 재학습 없이도 이를 적용할 수 있다. LLaMA-1B ARMT 모델에 Diagonal Batching을 적용한 결과, 131,072 토큰 시퀀스에서 표준 full-attention LLaMA-1B 대비 3.3배, 순차적 RMT 구현 대비 1.8배의 속도 향상을 달성했다. Diagonal Batching은 순차적 병목 현상을 제거함으로써 추론 비용과 지연 시간을 줄여, RMT를 실용적인 장문맥 애플리케이션 솔루션으로 더욱 강화한다.

English

Transformer models struggle with long-context inference due to their quadratic time and linear memory complexity. Recurrent Memory Transformers (RMTs) offer a solution by reducing the asymptotic cost to linear time and constant memory usage. However, their memory update mechanism leads to sequential execution, causing a performance bottleneck. We introduce Diagonal Batching, a scheduling scheme that unlocks parallelism across segments in RMTs while preserving exact recurrence. This approach eliminates the sequential constraint, enabling efficient GPU inference even for single long-context inputs without complex batching and pipelining techniques. Because the technique is purely a run-time computation reordering, existing RMT models adopt it with no retraining. Applied to a LLaMA-1B ARMT model, Diagonal Batching yields a 3.3x speedup over standard full-attention LLaMA-1B and a 1.8x speedup over the sequential RMT implementation on 131,072-token sequences. By removing sequential bottleneck, Diagonal Batching reduces inference cost and latency, thereby strengthening RMTs as a practical solution for real-world, long-context applications.

대각선 배칭이 장기 문맥을 위한 순환 메모리 트랜스포머에서 병렬성을 해제하다

Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts

초록

Support