SiRA：稀疏低秩混合自适应

摘要

参数高效调整已成为使大型语言模型适应下游任务的一个重要方法。大多数先前的研究考虑添加密集可训练参数，其中所有参数用于适应特定任务。我们在LoRA的示例中经验性地发现，引入更多可训练参数并不会有所帮助。受此启发，我们研究了利用“稀疏”计算的重要性，并提出了SiRA：稀疏低秩适应混合。SiRA利用稀疏专家混合（SMoE）来提升LoRA的性能。具体而言，它通过强制执行具有容量限制的前k个专家路由来限制每个专家可以处理的最大标记数。我们提出了一种新颖且简单的专家辍学方法，用于减少过拟合问题。通过大量实验，我们验证了SiRA在不同单一任务和多任务设置下的表现优于LoRA和其他专家混合方法。

English

Parameter Efficient Tuning has been an prominent approach to adapt the Large Language Model to downstream tasks. Most previous works considers adding the dense trainable parameters, where all parameters are used to adapt certain task. We found this less effective empirically using the example of LoRA that introducing more trainable parameters does not help. Motivated by this we investigate the importance of leveraging "sparse" computation and propose SiRA: sparse mixture of low rank adaption. SiRA leverages the Sparse Mixture of Expert(SMoE) to boost the performance of LoRA. Specifically it enforces the top k experts routing with a capacity limit restricting the maximum number of tokens each expert can process. We propose a novel and simple expert dropout on top of gating network to reduce the over-fitting issue. Through extensive experiments, we verify SiRA performs better than LoRA and other mixture of expert approaches across different single tasks and multitask settings.

SiRA：稀疏低秩混合自适应

SiRA: Sparse Mixture of Low Rank Adaptation

摘要

Support