エージェントの混合を再考する：異なる大規模言語モデルの混合は有益か？

要旨

多様なソースからの出力をアンサンブルすることは、性能を向上させるための直感的で効果的なアプローチです。Mixture-of-Agents（MoA）は、複数の異なる大規模言語モデル（LLM）からの出力を集約する人気のあるアンサンブル手法の1つです。本論文は、言語モデルの文脈で次の問いを提起します：異なるLLMを混合することは本当に有益なのでしょうか？私たちはSelf-MoAを提案します。これは、単一の最も性能の高いLLMからの出力を集約するアンサンブル手法です。私たちの包括的な実験によると、驚くべきことに、Self-MoAは多くのシナリオで異なるLLMを混合する標準的なMoAよりも優れた性能を発揮します：Self-MoAはAlpacaEval 2.0ベンチマークでMoAに比べて6.6%の改善を達成し、MMLU、CRUX、MATHを含むさまざまなベンチマークで平均3.8%の改善を達成します。AlpacaEval 2.0の上位モデルの1つにSelf-MoAを適用すると、リーダーボードで新たな最高性能を達成します。Self-MoAの効果を理解するために、さまざまなMoA設定下で出力の多様性と品質のトレードオフを系統的に調査します。MoAの性能は品質に非常に敏感であり、異なるLLMを混合することがしばしばモデルの平均品質を低下させることを確認します。研究を補完するために、異なるLLMを混合することが有益であるシナリオを特定します。さらに、本論文では、複数のラウンドでオンザフライで多数のLLM出力を集約できるSelf-MoAの逐次バージョンを紹介し、一度にすべての出力を集約するのと同じくらい効果的です。

English

Ensembling outputs from diverse sources is a straightforward yet effective approach to boost performance. Mixture-of-Agents (MoA) is one such popular ensemble method that aggregates outputs from multiple different Large Language Models (LLMs). This paper raises the question in the context of language models: is mixing different LLMs truly beneficial? We propose Self-MoA -- an ensemble method that aggregates outputs from only the single top-performing LLM. Our extensive experiments reveal that, surprisingly, Self-MoA outperforms standard MoA that mixes different LLMs in a large number of scenarios: Self-MoA achieves 6.6% improvement over MoA on the AlpacaEval 2.0 benchmark, and an average of 3.8% improvement across various benchmarks, including MMLU, CRUX, and MATH. Applying Self-MoA to one of the top-ranking models in AlpacaEval 2.0 directly achieves the new state-of-the-art performance on the leaderboard. To understand the effectiveness of Self-MoA, we systematically investigate the trade-off between diversity and quality of outputs under various MoA settings. We confirm that the MoA performance is rather sensitive to the quality, and mixing different LLMs often lowers the average quality of the models. To complement the study, we identify the scenarios where mixing different LLMs could be helpful. This paper further introduces a sequential version of Self-MoA, that is capable of aggregating a large number of LLM outputs on-the-fly over multiple rounds, and is as effective as aggregating all outputs at once.

エージェントの混合を再考する：異なる大規模言語モデルの混合は有益か？

Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?

要旨

Support