通过词专家混合实现的记忆增强语言模型
Memory Augmented Language Models through Mixture of Word Experts
November 15, 2023
作者: Cicero Nogueira dos Santos, James Lee-Thorp, Isaac Noble, Chung-Ching Chang, David Uthus
cs.AI
摘要
增加语言模型参数的规模已被证明是提高性能的有效方法。对于密集模型,增加模型大小会成比例地增加模型的计算量。在这项工作中,我们试图通过具有大型知识丰富词汇的路由函数和专家的混合专家(MoE)风格模型,积极地将学习能力和FLOPs进行解耦。我们提出的方法被称为词专家混合(MoWE),可以看作是一种记忆增强模型,其中一大组特定于单词的专家扮演稀疏内存的角色。我们证明MoWE在各种自然语言处理任务中的表现明显优于具有相似FLOPs数量的T5系列模型。此外,MoWE在知识密集型任务上优于常规MoE模型,并且与通常需要调用自定义机制来搜索稀疏内存的更复杂的记忆增强方法具有类似的性能。
English
Scaling up the number of parameters of language models has proven to be an
effective approach to improve performance. For dense models, increasing model
size proportionally increases the model's computation footprint. In this work,
we seek to aggressively decouple learning capacity and FLOPs through
Mixture-of-Experts (MoE) style models with large knowledge-rich vocabulary
based routing functions and experts. Our proposed approach, dubbed Mixture of
Word Experts (MoWE), can be seen as a memory augmented model, where a large set
of word-specific experts play the role of a sparse memory. We demonstrate that
MoWE performs significantly better than the T5 family of models with similar
number of FLOPs in a variety of NLP tasks. Additionally, MoWE outperforms
regular MoE models on knowledge intensive tasks and has similar performance to
more complex memory augmented approaches that often require to invoke custom
mechanisms to search the sparse memory.