让专家发挥专长：专家专业化微调稀疏架构大型语言模型

摘要

对于具有受限资源的定制大型语言模型（LLMs），参数高效微调（PEFT）至关重要。尽管针对密集架构LLMs存在各种PEFT方法，但对于稀疏架构LLMs的PEFT研究仍未深入。本研究探讨了Mixture-of-Experts（MoE）架构LLMs的PEFT方法，主要内容包括三个方面：（1）我们研究了在定制任务中激活专家的分散程度，发现特定任务的路由分布往往高度集中，而激活的专家分布在不同任务之间变化显著。（2）我们提出了专家专用微调（ESFT）方法，调整与下游任务最相关的专家，同时冻结其他专家和模块；实验结果表明，我们的方法不仅提高了调整效率，还与全参数微调的性能相匹敌甚至超越。（3）我们进一步分析了MoE架构对专家专用微调的影响。我们发现，具有更精细专家的MoE模型更有利于选择与下游任务最相关的专家组合，从而提高训练效率和效果。

English

Parameter-efficient fine-tuning (PEFT) is crucial for customizing Large Language Models (LLMs) with constrained resources. Although there have been various PEFT methods for dense-architecture LLMs, PEFT for sparse-architecture LLMs is still underexplored. In this work, we study the PEFT method for LLMs with the Mixture-of-Experts (MoE) architecture and the contents of this work are mainly threefold: (1) We investigate the dispersion degree of the activated experts in customized tasks, and found that the routing distribution for a specific task tends to be highly concentrated, while the distribution of activated experts varies significantly across different tasks. (2) We propose Expert-Specialized Fine-Tuning, or ESFT, which tunes the experts most relevant to downstream tasks while freezing the other experts and modules; experimental results demonstrate that our method not only improves the tuning efficiency, but also matches or even surpasses the performance of full-parameter fine-tuning. (3) We further analyze the impact of the MoE architecture on expert-specialized fine-tuning. We find that MoE models with finer-grained experts are more advantageous in selecting the combination of experts that are most relevant to downstream tasks, thereby enhancing both the training efficiency and effectiveness.

让专家发挥专长：专家专业化微调稀疏架构大型语言模型

Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models

摘要

Support