ReMix: LLMファインチューニングにおけるLoRA混合のための強化学習ルーティング

要旨

低ランクアダプタ（LoRA）は、事前学習済みモデルに学習可能な低ランク行列を注入し、新たなタスクに適応させるパラメータ効率の良いファインチューニング技術である。LoRA混合モデルは、各層の入力をその層の専門化されたLoRAの小さなサブセットに振り分けることで、ニューラルネットワークを効率的に拡張する。既存のLoRA混合ルータは、各LoRAに学習されたルーティング重みを割り当てることで、ルータのエンドツーエンド学習を可能にしている。実証的な有望さにもかかわらず、実際にはルーティング重みがLoRA間で極端に不均衡になることが観察され、しばしば1つまたは2つのLoRAのみがルーティング重みを支配している。これにより、実質的に有効なLoRAの数が制限され、既存のLoRA混合モデルの表現力が大きく阻害されている。本研究では、この弱点を学習可能なルーティング重みの性質に帰因させ、ルータの根本的な設計を再考する。この重要な課題に対処するため、我々はReMix（Reinforcement Routing for Mixture-of-LoRAs）と呼ぶ新しいルータ設計を提案する。核心となるアイデアは、非学習型のルーティング重みを使用し、いずれのLoRAもルーティング重みを支配することなく、全てのアクティブなLoRAが同等に効果的であることを保証することである。しかし、非学習型のルーティング重みにより、我々のルータは勾配降下法で直接学習することができない。そこで我々は、強化学習において監督損失を報酬、ルータを方策と見なし、Reinforce Leave-One-Out（RLOO）技術を適用した、ルータのための不偏勾配推定器をさらに提案する。我々の勾配推定器は、訓練計算量をスケールアップしてReMixの予測性能を向上させることも可能にする。大規模な実験により、提案するReMixが、同程度の活性化パラメータ数において、既存の最先端パラメータ効率型ファインチューニング手法を大幅に上回ることを実証する。

English

Low-rank adapters (LoRAs) are a parameter-efficient finetuning technique that injects trainable low-rank matrices into pretrained models to adapt them to new tasks. Mixture-of-LoRAs models expand neural networks efficiently by routing each layer input to a small subset of specialized LoRAs of the layer. Existing Mixture-of-LoRAs routers assign a learned routing weight to each LoRA to enable end-to-end training of the router. Despite their empirical promise, we observe that the routing weights are typically extremely imbalanced across LoRAs in practice, where only one or two LoRAs often dominate the routing weights. This essentially limits the number of effective LoRAs and thus severely hinders the expressive power of existing Mixture-of-LoRAs models. In this work, we attribute this weakness to the nature of learnable routing weights and rethink the fundamental design of the router. To address this critical issue, we propose a new router designed that we call Reinforcement Routing for Mixture-of-LoRAs (ReMix). Our key idea is using non-learnable routing weights to ensure all active LoRAs to be equally effective, with no LoRA dominating the routing weights. However, our routers cannot be trained directly via gradient descent due to our non-learnable routing weights. Hence, we further propose an unbiased gradient estimator for the router by employing the reinforce leave-one-out (RLOO) technique, where we regard the supervision loss as the reward and the router as the policy in reinforcement learning. Our gradient estimator also enables to scale up training compute to boost the predictive performance of our ReMix. Extensive experiments demonstrate that our proposed ReMix significantly outperform state-of-the-art parameter-efficient finetuning methods under a comparable number of activated parameters.

ReMix: LLMファインチューニングにおけるLoRA混合のための強化学習ルーティング

ReMix: Reinforcement routing for mixtures of LoRAs in LLM finetuning

要旨

Support