重新思考代理人混合：混合不同的大型語言模型是否有益？

摘要

將來自不同來源的輸出進行整合是提升效能的一種直接而有效的方法。Mixture-of-Agents（MoA）是一種流行的集成方法，它匯總來自多個不同大型語言模型（LLMs）的輸出。本文在語言模型的背景下提出了一個問題：混合不同的LLMs是否真的有益處？我們提出了Self-MoA — 一種集成方法，僅匯總來自單一表現最佳的LLM的輸出。我們的廣泛實驗顯示，令人驚訝的是，Self-MoA在許多情況下優於混合不同LLMs的標準MoA：Self-MoA在AlpacaEval 2.0基準測試中比MoA提高了6.6％，在包括MMLU、CRUX和MATH在內的各種基準測試中平均提高了3.8％。將Self-MoA應用於AlpacaEval 2.0中排名靠前的模型之一，直接實現了排行榜上的最新最佳表現。為了了解Self-MoA的效果，我們系統地探討了在不同MoA設置下多樣性和輸出質量之間的權衡。我們確認MoA的表現對質量非常敏感，混合不同的LLMs通常會降低模型的平均質量。為了補充研究，我們確定了混合不同LLMs可能有助的情況。本文進一步介紹了Self-MoA的連續版本，能夠在多輪次中動態地匯總大量LLM輸出，並與一次性匯總所有輸出一樣有效。

English

Ensembling outputs from diverse sources is a straightforward yet effective approach to boost performance. Mixture-of-Agents (MoA) is one such popular ensemble method that aggregates outputs from multiple different Large Language Models (LLMs). This paper raises the question in the context of language models: is mixing different LLMs truly beneficial? We propose Self-MoA -- an ensemble method that aggregates outputs from only the single top-performing LLM. Our extensive experiments reveal that, surprisingly, Self-MoA outperforms standard MoA that mixes different LLMs in a large number of scenarios: Self-MoA achieves 6.6% improvement over MoA on the AlpacaEval 2.0 benchmark, and an average of 3.8% improvement across various benchmarks, including MMLU, CRUX, and MATH. Applying Self-MoA to one of the top-ranking models in AlpacaEval 2.0 directly achieves the new state-of-the-art performance on the leaderboard. To understand the effectiveness of Self-MoA, we systematically investigate the trade-off between diversity and quality of outputs under various MoA settings. We confirm that the MoA performance is rather sensitive to the quality, and mixing different LLMs often lowers the average quality of the models. To complement the study, we identify the scenarios where mixing different LLMs could be helpful. This paper further introduces a sequential version of Self-MoA, that is capable of aggregating a large number of LLM outputs on-the-fly over multiple rounds, and is as effective as aggregating all outputs at once.

重新思考代理人混合：混合不同的大型語言模型是否有益？

Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?

摘要

Support