ReMix:大语言模型微调中LoRA混合体的强化路由策略
ReMix: Reinforcement routing for mixtures of LoRAs in LLM finetuning
March 10, 2026
作者: Ruizhong Qiu, Hanqing Zeng, Yinglong Xia, Yiwen Meng, Ren Chen, Jiarui Feng, Dongqi Fu, Qifan Wang, Jiayi Liu, Jun Xiao, Xiangjun Fan, Benyu Zhang, Hong Li, Zhining Liu, Hyunsik Yoo, Zhichen Zeng, Tianxin Wei, Hanghang Tong
cs.AI
摘要
低秩适配器(LoRA)是一种参数高效的微调技术,通过向预训练模型注入可训练的低秩矩阵使其适配新任务。混合LoRA模型通过将每层输入路由至该层少量专用LoRA子集,实现了神经网络的高效扩展。现有混合LoRA路由器通过为每个LoRA分配可学习路由权重来实现路由器的端到端训练。尽管这类方法展现出实用潜力,但我们发现实际应用中路由权重通常在LoRA间呈现极端不平衡,往往仅有一两个LoRA主导路由权重。这本质上限制了有效LoRA的数量,从而严重制约了现有混合LoRA模型的表达能力。本文我们将此缺陷归因于可学习路由权重的固有特性,并重新思考了路由器的根本设计。针对这一关键问题,我们提出名为"混合LoRA强化路由"(ReMix)的新型路由器设计方案。其核心思想是采用不可学习路由权重确保所有活跃LoRA具有同等效力,避免任何LoRA主导路由权重。然而,由于不可学习路由权重的特性,我们的路由器无法直接通过梯度下降进行训练。为此,我们进一步引入强化学习中的留一法奖励估计技术,将监督损失视为奖励、路由器视为策略,构建了无偏梯度估计器。该梯度估计器还能通过扩展训练计算量来提升ReMix的预测性能。大量实验表明,在激活参数量相当的条件下,我们提出的ReMix显著优于当前最先进的参数高效微调方法。
English
Low-rank adapters (LoRAs) are a parameter-efficient finetuning technique that injects trainable low-rank matrices into pretrained models to adapt them to new tasks. Mixture-of-LoRAs models expand neural networks efficiently by routing each layer input to a small subset of specialized LoRAs of the layer. Existing Mixture-of-LoRAs routers assign a learned routing weight to each LoRA to enable end-to-end training of the router. Despite their empirical promise, we observe that the routing weights are typically extremely imbalanced across LoRAs in practice, where only one or two LoRAs often dominate the routing weights. This essentially limits the number of effective LoRAs and thus severely hinders the expressive power of existing Mixture-of-LoRAs models. In this work, we attribute this weakness to the nature of learnable routing weights and rethink the fundamental design of the router. To address this critical issue, we propose a new router designed that we call Reinforcement Routing for Mixture-of-LoRAs (ReMix). Our key idea is using non-learnable routing weights to ensure all active LoRAs to be equally effective, with no LoRA dominating the routing weights. However, our routers cannot be trained directly via gradient descent due to our non-learnable routing weights. Hence, we further propose an unbiased gradient estimator for the router by employing the reinforce leave-one-out (RLOO) technique, where we regard the supervision loss as the reward and the router as the policy in reinforcement learning. Our gradient estimator also enables to scale up training compute to boost the predictive performance of our ReMix. Extensive experiments demonstrate that our proposed ReMix significantly outperform state-of-the-art parameter-efficient finetuning methods under a comparable number of activated parameters.