分支-训练-混合:将专家LLM混合到一个专家混合LLM中
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM
March 12, 2024
作者: Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Rozière, Jacob Kahn, Daniel Li, Wen-tau Yih, Jason Weston, Xian Li
cs.AI
摘要
我们研究了训练大型语言模型(LLMs)在多个专业领域具备能力的高效方法,例如编码、数学推理和世界知识。我们的方法名为Branch-Train-MiX(BTX),从一个种子模型开始,通过尴尬并行训练专家,具有高吞吐量和降低通信成本。在单独训练专家后,BTX将它们的前向参数作为专家汇集在混合专家(MoE)层中,并平均剩余参数,然后进行MoE微调阶段以学习标记级别的路由。BTX推广了两种特殊情况,即Branch-Train-Merge方法,它没有MoE微调阶段来学习路由,以及稀疏升级,它省略了异步训练专家的阶段。与其他方法相比,BTX实现了最佳的准确性和效率的权衡。
English
We investigate efficient methods for training Large Language Models (LLMs) to
possess capabilities in multiple specialized domains, such as coding, math
reasoning and world knowledge. Our method, named Branch-Train-MiX (BTX), starts
from a seed model, which is branched to train experts in embarrassingly
parallel fashion with high throughput and reduced communication cost. After
individual experts are asynchronously trained, BTX brings together their
feedforward parameters as experts in Mixture-of-Expert (MoE) layers and
averages the remaining parameters, followed by an MoE-finetuning stage to learn
token-level routing. BTX generalizes two special cases, the Branch-Train-Merge
method, which does not have the MoE finetuning stage to learn routing, and
sparse upcycling, which omits the stage of training experts asynchronously.
Compared to alternative approaches, BTX achieves the best accuracy-efficiency
tradeoff.Summary
AI-Generated Summary