SiRA: スパース混合低ランク適応

要旨

パラメータ効率チューニングは、大規模言語モデルを下流タスクに適応させるための主要なアプローチとなってきました。これまでの研究の多くは、特定のタスクに適応するためにすべてのパラメータを使用する密な学習可能パラメータを追加することを検討してきました。しかし、LoRAの例を用いた実証的な検証から、学習可能パラメータを増やすことが必ずしも有効ではないことがわかりました。この知見に基づき、我々は「スパース」な計算の重要性を調査し、SiRA: スパース混合低ランク適応を提案します。SiRAは、Sparse Mixture of Expert (SMoE) を活用してLoRAの性能を向上させます。具体的には、各エキスパートが処理できるトークンの最大数を制限するキャパシティ制限を設けたトップkエキスパートルーティングを強制します。さらに、ゲーティングネットワークの上に新規でシンプルなエキスパートドロップアウトを提案し、過学習の問題を軽減します。広範な実験を通じて、SiRAがLoRAや他のエキスパート混合アプローチよりも、さまざまな単一タスクおよびマルチタスク設定において優れた性能を発揮することを検証しました。

English

Parameter Efficient Tuning has been an prominent approach to adapt the Large Language Model to downstream tasks. Most previous works considers adding the dense trainable parameters, where all parameters are used to adapt certain task. We found this less effective empirically using the example of LoRA that introducing more trainable parameters does not help. Motivated by this we investigate the importance of leveraging "sparse" computation and propose SiRA: sparse mixture of low rank adaption. SiRA leverages the Sparse Mixture of Expert(SMoE) to boost the performance of LoRA. Specifically it enforces the top k experts routing with a capacity limit restricting the maximum number of tokens each expert can process. We propose a novel and simple expert dropout on top of gating network to reduce the over-fitting issue. Through extensive experiments, we verify SiRA performs better than LoRA and other mixture of expert approaches across different single tasks and multitask settings.

SiRA: スパース混合低ランク適応

SiRA: Sparse Mixture of Low Rank Adaptation

要旨

Support