기하급수적으로 빠른 언어 모델링

초록

언어 모델은 실제로 개별 추론을 위해 뉴런의 지수적 비율만 사용하면 됩니다. 이를 증명하기 위해, 우리는 추론 과정에서 뉴런의 0.3%만 사용하면서도 유사한 BERT 모델과 동등한 성능을 보이는 FastBERT를 제시합니다. FastBERT는 각 계층 추론 시 4095개 뉴런 중 단 12개만 선택적으로 활성화합니다. 이는 피드포워드 네트워크를 고속 피드포워드 네트워크(FFFs)로 대체함으로써 달성됩니다. 조건부 신경 실행의 전체 가속 잠재력을 완전히 활용할 수 있는 진정한 효율적인 구현은 현재 존재하지 않지만, 우리는 최적화된 기본 피드포워드 구현 대비 78배의 속도 향상을 달성하는 고수준 CPU 코드와, 동등한 배치 처리 피드포워드 추론 대비 40배의 속도 향상을 제공하는 PyTorch 구현을 제공합니다. 우리는 학습 코드, 벤치마킹 설정, 그리고 모델 가중치를 공개합니다.

English

Language models only really need to use an exponential fraction of their neurons for individual inferences. As proof, we present FastBERT, a BERT variant that uses 0.3\% of its neurons during inference while performing on par with similar BERT models. FastBERT selectively engages just 12 out of 4095 neurons for each layer inference. This is achieved by replacing feedforward networks with fast feedforward networks (FFFs). While no truly efficient implementation currently exists to unlock the full acceleration potential of conditional neural execution, we provide high-level CPU code achieving 78x speedup over the optimized baseline feedforward implementation, and a PyTorch implementation delivering 40x speedup over the equivalent batched feedforward inference. We publish our training code, benchmarking setup, and model weights.

기하급수적으로 빠른 언어 모델링

Exponentially Faster Language Modelling

초록

Support