通過混合詞專家實現記憶增強語言模型。
Memory Augmented Language Models through Mixture of Word Experts
November 15, 2023
作者: Cicero Nogueira dos Santos, James Lee-Thorp, Isaac Noble, Chung-Ching Chang, David Uthus
cs.AI
摘要
增加語言模型參數的規模已被證明是提高性能的有效方法。對於密集模型,增加模型大小會成比例地增加模型的計算占用。在這項工作中,我們試圖通過具有大型知識豐富詞彙的路由函數和專家的混合專家(MoE)風格模型,積極地將學習能力和FLOPs分離開來。我們提出的方法被稱為詞專家混合(MoWE),可以被視為一種記憶增強模型,其中一大套特定於單詞的專家扮演稀疏記憶的角色。我們展示MoWE在各種自然語言處理任務中比具有相似FLOPs數量的T5模型系列表現顯著更好。此外,MoWE在知識密集型任務上優於常規MoE模型,並且與通常需要調用自定義機制來搜索稀疏記憶的更複雜的記憶增強方法表現相似。
English
Scaling up the number of parameters of language models has proven to be an
effective approach to improve performance. For dense models, increasing model
size proportionally increases the model's computation footprint. In this work,
we seek to aggressively decouple learning capacity and FLOPs through
Mixture-of-Experts (MoE) style models with large knowledge-rich vocabulary
based routing functions and experts. Our proposed approach, dubbed Mixture of
Word Experts (MoWE), can be seen as a memory augmented model, where a large set
of word-specific experts play the role of a sparse memory. We demonstrate that
MoWE performs significantly better than the T5 family of models with similar
number of FLOPs in a variety of NLP tasks. Additionally, MoWE outperforms
regular MoE models on knowledge intensive tasks and has similar performance to
more complex memory augmented approaches that often require to invoke custom
mechanisms to search the sparse memory.