讓專家專注於其專長：針對稀疏結構大型語言模型的專家專用微調

摘要

參數高效微調（PEFT）對於在資源受限情況下定製大型語言模型（LLMs）至關重要。儘管已經有各種針對密集結構LLMs的PEFT方法，但對於稀疏結構LLMs的PEFT仍未得到充分探索。在這項研究中，我們研究了Mixture-of-Experts（MoE）結構LLMs的PEFT方法，本研究的內容主要有三個方面：（1）我們調查了在定製任務中激活專家的分散程度，發現特定任務的路由分佈往往高度集中，而激活專家的分佈在不同任務之間變化顯著。（2）我們提出了專家專用微調，或稱ESFT，該方法調整與下游任務最相關的專家，同時凍結其他專家和模塊；實驗結果表明，我們的方法不僅提高了調整效率，還與完整參數微調的性能相匹敵甚至超越。（3）我們進一步分析了MoE結構對專家專用微調的影響。我們發現，具有更細粒度專家的MoE模型在選擇與下游任務最相關的專家組合方面更具優勢，從而提高了訓練效率和效果。

English

Parameter-efficient fine-tuning (PEFT) is crucial for customizing Large Language Models (LLMs) with constrained resources. Although there have been various PEFT methods for dense-architecture LLMs, PEFT for sparse-architecture LLMs is still underexplored. In this work, we study the PEFT method for LLMs with the Mixture-of-Experts (MoE) architecture and the contents of this work are mainly threefold: (1) We investigate the dispersion degree of the activated experts in customized tasks, and found that the routing distribution for a specific task tends to be highly concentrated, while the distribution of activated experts varies significantly across different tasks. (2) We propose Expert-Specialized Fine-Tuning, or ESFT, which tunes the experts most relevant to downstream tasks while freezing the other experts and modules; experimental results demonstrate that our method not only improves the tuning efficiency, but also matches or even surpasses the performance of full-parameter fine-tuning. (3) We further analyze the impact of the MoE architecture on expert-specialized fine-tuning. We find that MoE models with finer-grained experts are more advantageous in selecting the combination of experts that are most relevant to downstream tasks, thereby enhancing both the training efficiency and effectiveness.

讓專家專注於其專長：針對稀疏結構大型語言模型的專家專用微調

Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models

摘要

Support