Mixture-of-Depths: 트랜스포머 기반 언어 모델에서 동적으로 계산 자원 할당하기

초록

Transformer 기반 언어 모델은 입력 시퀀스 전체에 걸쳐 FLOPs를 균일하게 분배합니다. 본 연구에서는 Transformer가 대신 시퀀스 내 특정 위치에 FLOPs(또는 계산 자원)를 동적으로 할당하도록 학습할 수 있음을 보여줍니다. 이를 통해 모델 깊이에 걸쳐 서로 다른 레이어에서 시퀀스에 따른 자원 할당을 최적화할 수 있습니다. 우리의 방법은 주어진 레이어에서 self-attention 및 MLP 계산에 참여할 수 있는 토큰 수(k)를 제한함으로써 총 계산 예산을 강제합니다. 처리될 토큰은 네트워크가 top-k 라우팅 메커니즘을 사용하여 결정합니다. k가 사전에 정의되기 때문에, 이 간단한 절차는 다른 조건부 계산 기법과 달리 알려진 텐서 크기를 가진 정적 계산 그래프를 사용합니다. 그러나 k개의 토큰의 정체성이 유동적이기 때문에, 이 방법은 시간 및 모델 깊이 차원에 걸쳐 FLOPs를 비균일하게 소비할 수 있습니다. 따라서 계산 소비는 총합적으로는 완전히 예측 가능하지만, 토큰 수준에서는 동적이고 문맥에 민감합니다. 이러한 방식으로 학습된 모델은 계산 자원을 동적으로 할당하는 방법을 효율적으로 학습할 뿐만 아니라, 동등한 FLOPs 및 학습 시간 대비 기준 성능을 달성합니다. 또한 순방향 패스당 필요한 FLOPs는 일부에 불과하며, 학습 후 샘플링 단계에서 최대 50% 이상 빠를 수 있습니다.

English

Transformer-based language models spread FLOPs uniformly across input sequences. In this work we demonstrate that transformers can instead learn to dynamically allocate FLOPs (or compute) to specific positions in a sequence, optimising the allocation along the sequence for different layers across the model depth. Our method enforces a total compute budget by capping the number of tokens (k) that can participate in the self-attention and MLP computations at a given layer. The tokens to be processed are determined by the network using a top-k routing mechanism. Since k is defined a priori, this simple procedure uses a static computation graph with known tensor sizes, unlike other conditional computation techniques. Nevertheless, since the identities of the k tokens are fluid, this method can expend FLOPs non-uniformly across the time and model depth dimensions. Thus, compute expenditure is entirely predictable in sum total, but dynamic and context-sensitive at the token-level. Not only do models trained in this way learn to dynamically allocate compute, they do so efficiently. These models match baseline performance for equivalent FLOPS and wall-clock times to train, but require a fraction of the FLOPs per forward pass, and can be upwards of 50\% faster to step during post-training sampling.

Mixture-of-Depths: 트랜스포머 기반 언어 모델에서 동적으로 계산 자원 할당하기

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

초록

Support