코어 어텐션 분리를 통한 효율적인 장문맥 언어 모델 학습

초록

우리는 코어 어텐션 분리(Core Attention Disaggregation, CAD) 기법을 제안합니다. 이 기법은 코어 어텐션 계산인 softmax(QK^T)V를 모델의 나머지 부분과 분리하여 별도의 디바이스 풀에서 실행함으로써 장문맥 대규모 언어 모델 학습을 개선합니다. 기존 시스템에서는 코어 어텐션이 다른 레이어와 동일한 위치에서 실행됩니다. 장문맥 길이에서 코어 어텐션의 이차적 계산 증가는 다른 구성 요소의 거의 선형적인 증가와 비교하여 데이터 및 파이프라인 병렬 그룹 간의 부하 불균형과 지연 문제를 초래합니다. CAD는 두 가지 관찰을 통해 가능해졌습니다. 첫째, 코어 어텐션은 상태가 없습니다: 학습 가능한 매개변수가 없고 최소한의 일시적 데이터만 있으므로, 부하 분산은 계산 중심 작업의 스케줄링으로 축소됩니다. 둘째, 코어 어텐션은 구성 가능합니다: 현대의 어텐션 커널은 임의 길이의 토큰 수준 분할을 융합된 배치로 처리할 때도 높은 효율성을 유지합니다. CAD는 코어 어텐션을 토큰 수준 작업으로 분할하고 이를 전용 어텐션 서버에 배치하며, 동적으로 작업을 재배치하여 커널 효율성을 희생하지 않고 계산을 균등화합니다. 우리는 DistCA라는 시스템에서 CAD를 구현했습니다. DistCA는 핑퐁 실행 방식을 사용하여 통신과 계산을 완전히 중첩시키고, 어텐션 서버에서의 인플레이스 실행을 통해 메모리 사용을 줄입니다. 512개의 H200 GPU와 최대 512k 토큰의 문맥 길이에서 DistCA는 종단 간 학습 처리량을 최대 1.35배 향상시키고, 데이터 및 파이프라인 병렬 지연 문제를 제거하며, 거의 완벽한 계산 및 메모리 균형을 달성합니다.

English

We present core attention disaggregation (CAD), a technique that improves long-context large language model training by decoupling the core attention computation, softmax(QK^T)V, from the rest of the model and executing it on a separate pool of devices. In existing systems, core attention is colocated with other layers; at long context lengths, its quadratic compute growth compared to the near-linear growth of other components causes load imbalance and stragglers across data and pipeline parallel groups. CAD is enabled by two observations. First, core attention is stateless: it has no trainable parameters and only minimal transient data, so balancing reduces to scheduling compute-bound tasks. Second, it is composable: modern attention kernels retain high efficiency when processing fused batches of token-level shards with arbitrary lengths. CAD partitions core attention into token-level tasks and dispatches them to dedicated attention servers, which dynamically rebatch tasks to equalize compute without sacrificing kernel efficiency. We implement CAD in a system called DistCA, which uses a ping-pong execution scheme to fully overlap communication with computation and in-place execution on attention servers to reduce memory use. On 512 H200 GPUs and context lengths up to 512k tokens, DistCA improves end-to-end training throughput by up to 1.35x, eliminates data and pipeline parallel stragglers, and achieves near-perfect compute and memory balance.

코어 어텐션 분리를 통한 효율적인 장문맥 언어 모델 학습

Efficient Long-context Language Model Training by Core Attention Disaggregation

초록

Support