C3PO: テスト時エキスパート再混合のためのクリティカル層・コアエキスパート・協調的パス最適化

要旨

Mixture-of-Experts (MoE) 大規模言語モデル (LLMs) は、深刻な最適化不足のエキスパート経路に悩まされています。本研究では、事前学習から得られた単純なエキスパート選択が、驚くべき10-20%の精度向上の余地を残していることを明らかにしました。この観察に基づき、我々は新しいクラスのテスト時最適化手法を開発し、各テストサンプルに対して異なる層のエキスパートを再重み付けまたは「再混合」することを目指します。テストサンプルの正解が未知であるため、参照サンプルセットからの「成功した近傍」に基づく代理目的関数を最適化することを提案します。我々は、モード探索、カーネル回帰、および類似した参照サンプル/タスクの平均損失に基づく3つの代理手法とアルゴリズムを導入します。経路全体の最適化コストを削減するため、我々のアルゴリズムを重要な層のコアエキスパートの混合重みにのみ適用し、同様の性能を維持しながら大幅な計算コストを節約します。これにより、「Critical-Layer, Core-Expert, Collaborative Pathway Optimization (C3PO)」が導かれます。C3POを2つの最近のMoE LLMに適用し、6つの広く使用されているベンチマークで検証しました。C3POはベースモデルの精度を7-15%向上させ、広く使用されているテスト時学習のベースライン（例：インコンテキスト学習やプロンプト/プレフィックスチューニング）を大きく上回りました。さらに、C3POは1-3Bのアクティブパラメータを持つMoE LLMが7-9BパラメータのLLMを上回ることを可能にし、MoEの効率性の利点をさらに高めます。我々の詳細なアブレーション研究は、MoEにおけるテスト時改善を達成するための新たな洞察を提供します。

English

Mixture-of-Experts (MoE) Large Language Models (LLMs) suffer from severely sub-optimal expert pathways-our study reveals that naive expert selection learned from pretraining leaves a surprising 10-20% accuracy gap for improvement. Motivated by this observation, we develop a novel class of test-time optimization methods to re-weight or "re-mixing" the experts in different layers jointly for each test sample. Since the test sample's ground truth is unknown, we propose to optimize a surrogate objective defined by the sample's "successful neighbors" from a reference set of samples. We introduce three surrogates and algorithms based on mode-finding, kernel regression, and the average loss of similar reference samples/tasks. To reduce the cost of optimizing whole pathways, we apply our algorithms merely to the core experts' mixing weights in critical layers, which enjoy similar performance but save significant computation. This leads to "Critical-Layer, Core-Expert, Collaborative Pathway Optimization (C3PO)". We apply C3PO to two recent MoE LLMs and examine it on six widely-used benchmarks. It consistently improves the base model by 7-15% in accuracy and outperforms widely used test-time learning baselines, e.g., in-context learning and prompt/prefix tuning, by a large margin. Moreover, C3PO enables MoE LLMs with 1-3B active parameters to outperform LLMs of 7-9B parameters, hence improving MoE's advantages on efficiency. Our thorough ablation study further sheds novel insights on achieving test-time improvement on MoE.

C3PO: テスト時エキスパート再混合のためのクリティカル層・コアエキスパート・協調的パス最適化

C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing

要旨

Support