讓專家專注於其專長:針對稀疏結構大型語言模型的專家專用微調
Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models
July 2, 2024
作者: Zihan Wang, Deli Chen, Damai Dai, Runxin Xu, Zhuoshu Li, Y. Wu
cs.AI
摘要
參數高效微調(PEFT)對於在資源受限情況下定製大型語言模型(LLMs)至關重要。儘管已經有各種針對密集結構LLMs的PEFT方法,但對於稀疏結構LLMs的PEFT仍未得到充分探索。在這項研究中,我們研究了Mixture-of-Experts(MoE)結構LLMs的PEFT方法,本研究的內容主要有三個方面:(1)我們調查了在定製任務中激活專家的分散程度,發現特定任務的路由分佈往往高度集中,而激活專家的分佈在不同任務之間變化顯著。(2)我們提出了專家專用微調,或稱ESFT,該方法調整與下游任務最相關的專家,同時凍結其他專家和模塊;實驗結果表明,我們的方法不僅提高了調整效率,還與完整參數微調的性能相匹敵甚至超越。(3)我們進一步分析了MoE結構對專家專用微調的影響。我們發現,具有更細粒度專家的MoE模型在選擇與下游任務最相關的專家組合方面更具優勢,從而提高了訓練效率和效果。
English
Parameter-efficient fine-tuning (PEFT) is crucial for customizing Large
Language Models (LLMs) with constrained resources. Although there have been
various PEFT methods for dense-architecture LLMs, PEFT for sparse-architecture
LLMs is still underexplored. In this work, we study the PEFT method for LLMs
with the Mixture-of-Experts (MoE) architecture and the contents of this work
are mainly threefold: (1) We investigate the dispersion degree of the activated
experts in customized tasks, and found that the routing distribution for a
specific task tends to be highly concentrated, while the distribution of
activated experts varies significantly across different tasks. (2) We propose
Expert-Specialized Fine-Tuning, or ESFT, which tunes the experts most relevant
to downstream tasks while freezing the other experts and modules; experimental
results demonstrate that our method not only improves the tuning efficiency,
but also matches or even surpasses the performance of full-parameter
fine-tuning. (3) We further analyze the impact of the MoE architecture on
expert-specialized fine-tuning. We find that MoE models with finer-grained
experts are more advantageous in selecting the combination of experts that are
most relevant to downstream tasks, thereby enhancing both the training
efficiency and effectiveness.Summary
AI-Generated Summary