Ripensando al Mixture-of-Agents: Mescolare Diversi Grandi Modelli Linguistici è Benefico?

Abstract

L'aggregazione delle uscite da fonti diverse è un approccio semplice ma efficace per migliorare le prestazioni. Il Mixture-of-Agents (MoA) è uno dei metodi di ensemble più popolari che aggrega le uscite di diversi Large Language Models (LLM). Questo articolo solleva la questione nel contesto dei modelli linguistici: mescolare diversi LLM è veramente vantaggioso? Proponiamo il Self-MoA, un metodo di ensemble che aggrega le uscite solo del miglior LLM. I nostri ampi esperimenti rivelano che, sorprendentemente, il Self-MoA supera il MoA standard che mescola diversi LLM in molti scenari: il Self-MoA ottiene un miglioramento del 6,6% rispetto al MoA nel benchmark AlpacaEval 2.0, e una media del 3,8% su vari benchmark, tra cui MMLU, CRUX e MATH. Applicando il Self-MoA a uno dei modelli più performanti in AlpacaEval 2.0, si raggiunge direttamente la nuova performance di primo piano nella classifica. Per comprendere l'efficacia del Self-MoA, investighiamo sistematicamente il compromesso tra diversità e qualità delle uscite in vari contesti MoA. Confermiamo che le prestazioni del MoA sono piuttosto sensibili alla qualità, e mescolare diversi LLM spesso abbassa la qualità media dei modelli. Per integrare lo studio, identifichiamo i casi in cui mescolare diversi LLM potrebbe essere utile. Questo articolo introduce inoltre una versione sequenziale del Self-MoA, in grado di aggregare un gran numero di uscite di LLM al volo in più round, ed è altrettanto efficace nell'aggregare tutte le uscite in una sola volta.

English

Ensembling outputs from diverse sources is a straightforward yet effective approach to boost performance. Mixture-of-Agents (MoA) is one such popular ensemble method that aggregates outputs from multiple different Large Language Models (LLMs). This paper raises the question in the context of language models: is mixing different LLMs truly beneficial? We propose Self-MoA -- an ensemble method that aggregates outputs from only the single top-performing LLM. Our extensive experiments reveal that, surprisingly, Self-MoA outperforms standard MoA that mixes different LLMs in a large number of scenarios: Self-MoA achieves 6.6% improvement over MoA on the AlpacaEval 2.0 benchmark, and an average of 3.8% improvement across various benchmarks, including MMLU, CRUX, and MATH. Applying Self-MoA to one of the top-ranking models in AlpacaEval 2.0 directly achieves the new state-of-the-art performance on the leaderboard. To understand the effectiveness of Self-MoA, we systematically investigate the trade-off between diversity and quality of outputs under various MoA settings. We confirm that the MoA performance is rather sensitive to the quality, and mixing different LLMs often lowers the average quality of the models. To complement the study, we identify the scenarios where mixing different LLMs could be helpful. This paper further introduces a sequential version of Self-MoA, that is capable of aggregating a large number of LLM outputs on-the-fly over multiple rounds, and is as effective as aggregating all outputs at once.

Ripensando al Mixture-of-Agents: Mescolare Diversi Grandi Modelli Linguistici è Benefico?

Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?

Abstract

Support