ChatPaper.aiChatPaper

讓專家專注於其專長:針對稀疏結構大型語言模型的專家專用微調

Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models

July 2, 2024
作者: Zihan Wang, Deli Chen, Damai Dai, Runxin Xu, Zhuoshu Li, Y. Wu
cs.AI

摘要

參數高效微調(PEFT)對於在資源受限情況下定製大型語言模型(LLMs)至關重要。儘管已經有各種針對密集結構LLMs的PEFT方法,但對於稀疏結構LLMs的PEFT仍未得到充分探索。在這項研究中,我們研究了Mixture-of-Experts(MoE)結構LLMs的PEFT方法,本研究的內容主要有三個方面:(1)我們調查了在定製任務中激活專家的分散程度,發現特定任務的路由分佈往往高度集中,而激活專家的分佈在不同任務之間變化顯著。(2)我們提出了專家專用微調,或稱ESFT,該方法調整與下游任務最相關的專家,同時凍結其他專家和模塊;實驗結果表明,我們的方法不僅提高了調整效率,還與完整參數微調的性能相匹敵甚至超越。(3)我們進一步分析了MoE結構對專家專用微調的影響。我們發現,具有更細粒度專家的MoE模型在選擇與下游任務最相關的專家組合方面更具優勢,從而提高了訓練效率和效果。
English
Parameter-efficient fine-tuning (PEFT) is crucial for customizing Large Language Models (LLMs) with constrained resources. Although there have been various PEFT methods for dense-architecture LLMs, PEFT for sparse-architecture LLMs is still underexplored. In this work, we study the PEFT method for LLMs with the Mixture-of-Experts (MoE) architecture and the contents of this work are mainly threefold: (1) We investigate the dispersion degree of the activated experts in customized tasks, and found that the routing distribution for a specific task tends to be highly concentrated, while the distribution of activated experts varies significantly across different tasks. (2) We propose Expert-Specialized Fine-Tuning, or ESFT, which tunes the experts most relevant to downstream tasks while freezing the other experts and modules; experimental results demonstrate that our method not only improves the tuning efficiency, but also matches or even surpasses the performance of full-parameter fine-tuning. (3) We further analyze the impact of the MoE architecture on expert-specialized fine-tuning. We find that MoE models with finer-grained experts are more advantageous in selecting the combination of experts that are most relevant to downstream tasks, thereby enhancing both the training efficiency and effectiveness.

Summary

AI-Generated Summary

PDF431November 28, 2024