ReMix：大型語言模型微調中LoRA混合體的強化路由機制

ReMix: Reinforcement routing for mixtures of LoRAs in LLM finetuning

March 10, 2026

作者: Ruizhong Qiu, Hanqing Zeng, Yinglong Xia, Yiwen Meng, Ren Chen, Jiarui Feng, Dongqi Fu, Qifan Wang, Jiayi Liu, Jun Xiao, Xiangjun Fan, Benyu Zhang, Hong Li, Zhining Liu, Hyunsik Yoo, Zhichen Zeng, Tianxin Wei, Hanghang Tong

cs.AI

摘要

低秩適配器（LoRA）是一種參數高效的微調技術，通過向預訓練模型注入可訓練的低秩矩陣，使其適應新任務。混合LoRA模型通過將每層輸入路由至該層專用LoRA的小型子集，實現神經網絡的高效擴展。現有混合LoRA路由器為每個LoRA分配可學習的路由權重，以實現路由器的端到端訓練。儘管具備實證潛力，我們觀察到實際應用中路由權重通常極度不平衡，往往僅有一兩個LoRA主導路由權重。這本質上限制了有效LoRA的數量，從而嚴重制約現有混合LoRA模型的表達能力。本文將此缺陷歸因於可學習路由權重的本質，並重新思考路由器的基礎設計。為解決此關鍵問題，我們提出名為「混合LoRA強化路由」（ReMix）的新型路由器設計，其核心思想是採用不可學習的路由權重，確保所有活躍LoRA均能等效發揮作用，避免單個LoRA壟斷路由權重。然而，由於採用不可學習路由權重，我們的路由器無法直接通過梯度下降進行訓練。為此，我們進一步基於強化學習中的留一法獎勵（RLOO）技術，提出無偏梯度估計器：將監督損失視為獎勵，路由器視為策略。該梯度估計器還能擴展訓練計算規模，從而提升ReMix的預測性能。大量實驗表明，在激活參數數量相當的條件下，我們提出的ReMix顯著優於當前最先進的參數高效微調方法。

English

Low-rank adapters (LoRAs) are a parameter-efficient finetuning technique that injects trainable low-rank matrices into pretrained models to adapt them to new tasks. Mixture-of-LoRAs models expand neural networks efficiently by routing each layer input to a small subset of specialized LoRAs of the layer. Existing Mixture-of-LoRAs routers assign a learned routing weight to each LoRA to enable end-to-end training of the router. Despite their empirical promise, we observe that the routing weights are typically extremely imbalanced across LoRAs in practice, where only one or two LoRAs often dominate the routing weights. This essentially limits the number of effective LoRAs and thus severely hinders the expressive power of existing Mixture-of-LoRAs models. In this work, we attribute this weakness to the nature of learnable routing weights and rethink the fundamental design of the router. To address this critical issue, we propose a new router designed that we call Reinforcement Routing for Mixture-of-LoRAs (ReMix). Our key idea is using non-learnable routing weights to ensure all active LoRAs to be equally effective, with no LoRA dominating the routing weights. However, our routers cannot be trained directly via gradient descent due to our non-learnable routing weights. Hence, we further propose an unbiased gradient estimator for the router by employing the reinforce leave-one-out (RLOO) technique, where we regard the supervision loss as the reward and the router as the policy in reinforcement learning. Our gradient estimator also enables to scale up training compute to boost the predictive performance of our ReMix. Extensive experiments demonstrate that our proposed ReMix significantly outperform state-of-the-art parameter-efficient finetuning methods under a comparable number of activated parameters.