SiRA: 희소 저순위 적응의 혼합 모델

초록

파라미터 효율적 튜닝(Parameter Efficient Tuning)은 대규모 언어 모델을 다운스트림 작업에 적응시키기 위한 주요 접근법으로 자리 잡아 왔다. 대부분의 기존 연구는 모든 파라미터가 특정 작업에 적응하도록 설계된 밀집 학습 가능 파라미터를 추가하는 방식을 고려해 왔다. 그러나 우리는 LoRA(Low Rank Adaptation)의 예를 통해 더 많은 학습 가능 파라미터를 도입하는 것이 효과적이지 않다는 것을 실증적으로 발견했다. 이에 동기를 받아 우리는 "희소" 계산의 중요성을 탐구하고 SiRA(Sparse Mixture of Low Rank Adaptation)를 제안한다. SiRA는 희소 전문가 혼합(Sparse Mixture of Experts, SMoE)을 활용하여 LoRA의 성능을 향상시킨다. 구체적으로, SiRA는 각 전문가가 처리할 수 있는 최대 토큰 수를 제한하는 용량 한계와 함께 상위 k개 전문가 라우팅을 강제한다. 또한, 우리는 게이팅 네트워크 위에 새로운 간단한 전문가 드롭아웃을 제안하여 과적합 문제를 줄인다. 다양한 실험을 통해, SiRA가 단일 작업 및 다중 작업 설정에서 LoRA와 다른 전문가 혼합 접근법보다 더 나은 성능을 보임을 검증한다.

English

Parameter Efficient Tuning has been an prominent approach to adapt the Large Language Model to downstream tasks. Most previous works considers adding the dense trainable parameters, where all parameters are used to adapt certain task. We found this less effective empirically using the example of LoRA that introducing more trainable parameters does not help. Motivated by this we investigate the importance of leveraging "sparse" computation and propose SiRA: sparse mixture of low rank adaption. SiRA leverages the Sparse Mixture of Expert(SMoE) to boost the performance of LoRA. Specifically it enforces the top k experts routing with a capacity limit restricting the maximum number of tokens each expert can process. We propose a novel and simple expert dropout on top of gating network to reduce the over-fitting issue. Through extensive experiments, we verify SiRA performs better than LoRA and other mixture of expert approaches across different single tasks and multitask settings.

SiRA: 희소 저순위 적응의 혼합 모델

SiRA: Sparse Mixture of Low Rank Adaptation

초록

Support