REAM: LLM 전문가 가지치기에서 병합의 효과

초록

혼합 전문가(MoE) 대규모 언어 모델(LLM)은 최고 성능 아키텍처 중 하나입니다. 수천억 개의 매개변수를 가진 가장 큰 모델들은 배포 시 심각한 메모리 문제를 제기합니다. 메모리 요구량을 줄이기 위한 기존 접근법으로는 가중치 프루닝과 양자화가 있습니다. 전문가를 제거하는 REAP(라우터 가중 전문가 활성화 프루닝)에 착안하여, 우리는 전문가를 제거하는 대신 그룹화하고 가중치를 병합하여 원본 성능을 더 잘 보존하는 새로운 방법인 REAM(라우터 가중 전문가 활성화 병합)을 제안합니다. 다양한 다중 선택 질의응답 및 생성 벤치마크에서 여러 MoE LLM에 대해 REAM을 REAP 및 다른 기준 방법들과 비교 평가했습니다. 우리의 결과는 보정 데이터의 조합에 따라 다중 선택과 생성 성능 간에 트레이드오프가 존재함을 보여줍니다. 일반, 수학, 코딩 데이터의 조합을 제어함으로써 이 트레이드오프의 파레토 최적선을 검토하였으며, REAM이 기준 방법들을 종종 능가하고 많은 경우 압축되지 않은 원본 모델과 비슷한 성능을 보임을 입증했습니다.

English

Mixture-of-Experts (MoE) large language models (LLMs) are among the top-performing architectures. The largest models, often with hundreds of billions of parameters, pose significant memory challenges for deployment. Traditional approaches to reduce memory requirements include weight pruning and quantization. Motivated by the Router-weighted Expert Activation Pruning (REAP) that prunes experts, we propose a novel method, Router-weighted Expert Activation Merging (REAM). Instead of removing experts, REAM groups them and merges their weights, better preserving original performance. We evaluate REAM against REAP and other baselines across multiple MoE LLMs on diverse multiple-choice (MC) question answering and generative (GEN) benchmarks. Our results reveal a trade-off between MC and GEN performance that depends on the mix of calibration data. By controlling the mix of general, math and coding data, we examine the Pareto frontier of this trade-off and show that REAM often outperforms the baselines and in many cases is comparable to the original uncompressed models.

REAM: LLM 전문가 가지치기에서 병합의 효과

REAM: Merging Improves Pruning of Experts in LLMs

초록

Support