단어 전문가 혼합을 통한 메모리 증강 언어 모델

초록

언어 모델의 파라미터 수를 확장하는 것이 성능 향상에 효과적인 접근 방식임이 입증되었습니다. 밀집 모델(dense model)의 경우, 모델 크기를 늘리면 모델의 계산 부하가 비례적으로 증가합니다. 본 연구에서는 대규모 지식 기반 어휘 라우팅 함수와 전문가(expert)를 활용한 전문가 혼합(Mixture-of-Experts, MoE) 스타일 모델을 통해 학습 용량과 FLOPs를 적극적으로 분리하는 방법을 탐구합니다. 우리가 제안한 접근 방식인 단어 전문가 혼합(Mixture of Word Experts, MoWE)은 대규모 단어별 전문가 집합이 희소 메모리(sparse memory)의 역할을 수행하는 메모리 증강 모델로 볼 수 있습니다. 우리는 MoWE가 다양한 NLP 작업에서 유사한 FLOPs 수를 가진 T5 모델군보다 훨씬 우수한 성능을 보임을 입증합니다. 또한, MoWE는 지식 집약적 작업에서 일반적인 MoE 모델을 능가하며, 희소 메모리를 검색하기 위해 사용자 정의 메커니즘을 호출해야 하는 더 복잡한 메모리 증강 접근 방식과 유사한 성능을 보입니다.

English

Scaling up the number of parameters of language models has proven to be an effective approach to improve performance. For dense models, increasing model size proportionally increases the model's computation footprint. In this work, we seek to aggressively decouple learning capacity and FLOPs through Mixture-of-Experts (MoE) style models with large knowledge-rich vocabulary based routing functions and experts. Our proposed approach, dubbed Mixture of Word Experts (MoWE), can be seen as a memory augmented model, where a large set of word-specific experts play the role of a sparse memory. We demonstrate that MoWE performs significantly better than the T5 family of models with similar number of FLOPs in a variety of NLP tasks. Additionally, MoWE outperforms regular MoE models on knowledge intensive tasks and has similar performance to more complex memory augmented approaches that often require to invoke custom mechanisms to search the sparse memory.

단어 전문가 혼합을 통한 메모리 증강 언어 모델

Memory Augmented Language Models through Mixture of Word Experts

초록

Support