REAM: 大規模言語モデルにおける専門家刈り込みの改善としてのマージ手法

要旨

Mixture-of-Experts (MoE) 大規模言語モデル (LLM) は、最高レベルの性能を発揮するアーキテクチャの一つである。数百億ものパラメータを持つ最大級のモデルは、デプロイにおいて深刻なメモリ課題をもたらす。メモリ要件を削減する従来のアプローチには、重みプルーニングと量子化がある。専門家を刈り込む Router-weighted Expert Activation Pruning (REAP) に着想を得て、我々は新しい手法である Router-weighted Expert Activation Merging (REAM) を提案する。REAM は専門家を削除する代わりに、それらをグループ化し重みを統合することで、元の性能をより良く維持する。複数の MoE LLM において、多様な多肢選択 (MC) 質問応答および生成 (GEN) ベンチマークで、REAM を REAP およびその他のベースラインと比較評価する。結果は、較正データの混合比に依存する MC 性能と GEN 性能の間のトレードオフを明らかにする。一般的なデータ、数学データ、コードデータの混合比を制御することで、このトレードオフのパレートフロンティアを検証し、REAM がベースラインをしばしば上回り、多くの場合で圧縮されていない元のモデルに匹敵する性能を示すことを実証する。

English

Mixture-of-Experts (MoE) large language models (LLMs) are among the top-performing architectures. The largest models, often with hundreds of billions of parameters, pose significant memory challenges for deployment. Traditional approaches to reduce memory requirements include weight pruning and quantization. Motivated by the Router-weighted Expert Activation Pruning (REAP) that prunes experts, we propose a novel method, Router-weighted Expert Activation Merging (REAM). Instead of removing experts, REAM groups them and merges their weights, better preserving original performance. We evaluate REAM against REAP and other baselines across multiple MoE LLMs on diverse multiple-choice (MC) question answering and generative (GEN) benchmarks. Our results reveal a trade-off between MC and GEN performance that depends on the mix of calibration data. By controlling the mix of general, math and coding data, we examine the Pareto frontier of this trade-off and show that REAM often outperforms the baselines and in many cases is comparable to the original uncompressed models.

REAM: 大規模言語モデルにおける専門家刈り込みの改善としてのマージ手法

REAM: Merging Improves Pruning of Experts in LLMs

要旨

Support