ChatPaper.aiChatPaper

让专家发挥专长:专家专业化微调稀疏架构大型语言模型

Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models

July 2, 2024
作者: Zihan Wang, Deli Chen, Damai Dai, Runxin Xu, Zhuoshu Li, Y. Wu
cs.AI

摘要

对于具有受限资源的定制大型语言模型(LLMs),参数高效微调(PEFT)至关重要。尽管针对密集架构LLMs存在各种PEFT方法,但对于稀疏架构LLMs的PEFT研究仍未深入。本研究探讨了Mixture-of-Experts(MoE)架构LLMs的PEFT方法,主要内容包括三个方面:(1)我们研究了在定制任务中激活专家的分散程度,发现特定任务的路由分布往往高度集中,而激活的专家分布在不同任务之间变化显著。(2)我们提出了专家专用微调(ESFT)方法,调整与下游任务最相关的专家,同时冻结其他专家和模块;实验结果表明,我们的方法不仅提高了调整效率,还与全参数微调的性能相匹敌甚至超越。(3)我们进一步分析了MoE架构对专家专用微调的影响。我们发现,具有更精细专家的MoE模型更有利于选择与下游任务最相关的专家组合,从而提高训练效率和效果。
English
Parameter-efficient fine-tuning (PEFT) is crucial for customizing Large Language Models (LLMs) with constrained resources. Although there have been various PEFT methods for dense-architecture LLMs, PEFT for sparse-architecture LLMs is still underexplored. In this work, we study the PEFT method for LLMs with the Mixture-of-Experts (MoE) architecture and the contents of this work are mainly threefold: (1) We investigate the dispersion degree of the activated experts in customized tasks, and found that the routing distribution for a specific task tends to be highly concentrated, while the distribution of activated experts varies significantly across different tasks. (2) We propose Expert-Specialized Fine-Tuning, or ESFT, which tunes the experts most relevant to downstream tasks while freezing the other experts and modules; experimental results demonstrate that our method not only improves the tuning efficiency, but also matches or even surpasses the performance of full-parameter fine-tuning. (3) We further analyze the impact of the MoE architecture on expert-specialized fine-tuning. We find that MoE models with finer-grained experts are more advantageous in selecting the combination of experts that are most relevant to downstream tasks, thereby enhancing both the training efficiency and effectiveness.

Summary

AI-Generated Summary

PDF431November 28, 2024