SiRA：稀疏低秩混合適應

摘要

參數高效調整一直是適應大型語言模型到下游任務的一種重要方法。大多數先前的研究考慮添加密集可訓練參數，其中所有參數都用於適應特定任務。我們在實證中發現，以 LoRA 為例，引入更多可訓練參數並不有助於提高效果。受此啟發，我們探討了利用「稀疏」計算的重要性，並提出 SiRA：稀疏低秩適應的混合。SiRA 利用稀疏專家混合（SMoE）來提升 LoRA 的性能。具體而言，它強制實施頂部 k 專家路由，並通過容量限制來限制每個專家可以處理的最大標記數。我們提出了一種新穎且簡單的專家輸出層放棄機制，以減少過度擬合問題。通過大量實驗，我們驗證了 SiRA 在不同單一任務和多任務設置中表現優於 LoRA 和其他專家混合方法。

English

Parameter Efficient Tuning has been an prominent approach to adapt the Large Language Model to downstream tasks. Most previous works considers adding the dense trainable parameters, where all parameters are used to adapt certain task. We found this less effective empirically using the example of LoRA that introducing more trainable parameters does not help. Motivated by this we investigate the importance of leveraging "sparse" computation and propose SiRA: sparse mixture of low rank adaption. SiRA leverages the Sparse Mixture of Expert(SMoE) to boost the performance of LoRA. Specifically it enforces the top k experts routing with a capacity limit restricting the maximum number of tokens each expert can process. We propose a novel and simple expert dropout on top of gating network to reduce the over-fitting issue. Through extensive experiments, we verify SiRA performs better than LoRA and other mixture of expert approaches across different single tasks and multitask settings.

SiRA：稀疏低秩混合適應

SiRA: Sparse Mixture of Low Rank Adaptation

摘要

Support