専門家へのルーティング：効率的な報酬誘導型大規模言語モデルアンサンブル

要旨

大規模言語モデル（LLM）の補完的潜在能力は、既存のLLMが多様なドメインやタスクにおいて異質な専門性を持っていることを前提とし、複数のLLMをアンサンブルすることで一貫して優れた性能を達成できると仮定しています。既存のLLMアンサンブル手法は、主に出力の報酬モデルランキングに焦点を当てており、これにより計算コストが大幅に増加します。この問題に対処するため、我々はLLMの補完的潜在能力を再検討し、既存の報酬モデルを用いて潜在的な専門性を掘り下げることでこれをさらに詳細化します。我々は、Zooterという報酬誘導型ルーティング手法を提案します。これは、トレーニングクエリに対する報酬を蒸留してルーティング関数を訓練し、各クエリをその専門性を持つLLMに正確に分配するものです。また、報酬をシルバー監視として使用する際の不確実性によるノイズを軽減するため、タグベースのラベル拡張を統合しています。Zooterは、推論時の計算効率が高く、報酬モデルランキング手法と比較してルーティング関数のわずかな計算オーバーヘッドしか導入しません。我々は、異なるドメインとタスクにわたる26のサブセットを含む包括的なベンチマークコレクションでZooterを評価しました。Zooterは、平均して最良の単一モデルを上回り、44%のタスクで首位を獲得し、複数の報酬モデルランキング手法をも凌駕する結果を示しました。

English

The complementary potential of Large Language Models (LLM) assumes off-the-shelf LLMs have heterogeneous expertise in a wide range of domains and tasks so that an ensemble of LLMs can achieve consistently better performance. Existing ensemble methods for LLMs mainly focus on reward model ranking of outputs, leading to significant computation overhead. To combat this issue, we revisit the complementary potential of LLMs and further elaborate it by mining latent expertise with off-the-shelf reward models. We propose Zooter, a reward-guided routing method distilling rewards on training queries to train a routing function, which can precisely distribute each query to the LLM with expertise about it. We also integrate a tag-based label enhancement to mitigate noise from uncertainty when using rewards as silver supervision. Zooter shows computation efficiency in inference as it introduces only a minor computation overhead of a routing function compared with reward model ranking methods. We evaluate Zooter on a comprehensive benchmark collection with 26 subsets on different domains and tasks. Zooter outperforms the best single model on average and ranks first on 44% of tasks, even surpassing multiple reward model ranking methods.

専門家へのルーティング：効率的な報酬誘導型大規模言語モデルアンサンブル

Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models

要旨

Support