Lascia che l'esperto si attenga al suo campo: Ottimizzazione Specializzata per Esperti nei Modelli Linguistici di Grande Dimensione con Architettura Sparsa

Abstract

Il fine-tuning efficiente in termini di parametri (PEFT) è cruciale per personalizzare i Large Language Models (LLM) con risorse limitate. Sebbene esistano vari metodi PEFT per LLM con architettura densa, il PEFT per LLM con architettura sparsa è ancora poco esplorato. In questo lavoro, studiamo il metodo PEFT per LLM con architettura Mixture-of-Experts (MoE) e i contenuti di questo lavoro si articolano principalmente in tre punti: (1) Indaghiamo il grado di dispersione degli esperti attivati nei task personalizzati e scopriamo che la distribuzione del routing per un task specifico tende a essere altamente concentrata, mentre la distribuzione degli esperti attivati varia significativamente tra task diversi. (2) Proponiamo il Fine-Tuning Specializzato sugli Esperti, o ESFT, che ottimizza gli esperti più rilevanti per i task downstream mentre congela gli altri esperti e moduli; i risultati sperimentali dimostrano che il nostro metodo non solo migliora l'efficienza del tuning, ma eguaglia o addirittura supera le prestazioni del fine-tuning completo dei parametri. (3) Analizziamo ulteriormente l'impatto dell'architettura MoE sul fine-tuning specializzato sugli esperti. Scopriamo che i modelli MoE con esperti a granularità più fine sono più vantaggiosi nella selezione della combinazione di esperti più rilevanti per i task downstream, migliorando così sia l'efficienza che l'efficacia dell'addestramento.

English

Parameter-efficient fine-tuning (PEFT) is crucial for customizing Large Language Models (LLMs) with constrained resources. Although there have been various PEFT methods for dense-architecture LLMs, PEFT for sparse-architecture LLMs is still underexplored. In this work, we study the PEFT method for LLMs with the Mixture-of-Experts (MoE) architecture and the contents of this work are mainly threefold: (1) We investigate the dispersion degree of the activated experts in customized tasks, and found that the routing distribution for a specific task tends to be highly concentrated, while the distribution of activated experts varies significantly across different tasks. (2) We propose Expert-Specialized Fine-Tuning, or ESFT, which tunes the experts most relevant to downstream tasks while freezing the other experts and modules; experimental results demonstrate that our method not only improves the tuning efficiency, but also matches or even surpasses the performance of full-parameter fine-tuning. (3) We further analyze the impact of the MoE architecture on expert-specialized fine-tuning. We find that MoE models with finer-grained experts are more advantageous in selecting the combination of experts that are most relevant to downstream tasks, thereby enhancing both the training efficiency and effectiveness.

Lascia che l'esperto si attenga al suo campo: Ottimizzazione Specializzata per Esperti nei Modelli Linguistici di Grande Dimensione con Architettura Sparsa

Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models

Abstract

Support