ChatPaper.aiChatPaper

分支-訓練-混合:將專家LLM混合到一個專家混合LLM中。

Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

March 12, 2024
作者: Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Rozière, Jacob Kahn, Daniel Li, Wen-tau Yih, Jason Weston, Xian Li
cs.AI

摘要

我們研究了訓練大型語言模型(LLMs)在多個專業領域具有能力的高效方法,例如編碼、數學推理和世界知識。我們的方法名為Branch-Train-MiX(BTX),從一個種子模型開始,將其分支為專家,以尷尬地並行的方式進行高通量訓練,並降低通信成本。在個別專家異步訓練後,BTX將它們的前向參數作為專家組合在Mixture-of-Expert(MoE)層中,並平均剩餘參數,然後進行MoE微調階段以學習基於標記的路由。BTX泛化了兩種特殊情況,即Branch-Train-Merge方法,該方法沒有MoE微調階段來學習路由,以及稀疏升級,該方法省略了異步訓練專家的階段。與替代方法相比,BTX實現了最佳的準確性和效率的折衷。
English
We investigate efficient methods for training Large Language Models (LLMs) to possess capabilities in multiple specialized domains, such as coding, math reasoning and world knowledge. Our method, named Branch-Train-MiX (BTX), starts from a seed model, which is branched to train experts in embarrassingly parallel fashion with high throughput and reduced communication cost. After individual experts are asynchronously trained, BTX brings together their feedforward parameters as experts in Mixture-of-Expert (MoE) layers and averages the remaining parameters, followed by an MoE-finetuning stage to learn token-level routing. BTX generalizes two special cases, the Branch-Train-Merge method, which does not have the MoE finetuning stage to learn routing, and sparse upcycling, which omits the stage of training experts asynchronously. Compared to alternative approaches, BTX achieves the best accuracy-efficiency tradeoff.

Summary

AI-Generated Summary

PDF422December 15, 2024