LLM-Blender: ペアワイズランキングと生成的融合による大規模言語モデルのアンサンブル

要旨

我々はLLM-Blenderを提案する。これは、複数のオープンソース大規模言語モデル（LLM）の多様な強みを活用し、一貫して優れた性能を実現するためのアンサンブルフレームワークである。本フレームワークは、異なる事例に対して最適なLLMが大きく異なるという観察に対処するため、PairRankerとGenFuserの2つのモジュールで構成されている。PairRankerは、候補出力間の微妙な差異を識別するために特別に設計されたペアワイズ比較手法を採用する。入力テキストと候補ペアを共同でエンコードし、クロスアテンションエンコーダーを用いて優れた候補を決定する。我々の結果は、PairRankerがChatGPTベースのランキングと最も高い相関を示すことを実証している。次に、GenFuserは、トップランクの候補を統合し、それらの強みを活かし弱点を軽減することで、改善された出力を生成することを目指す。大規模な評価を容易にするため、オラクルペアワイズ比較を特徴とする複数の指示データセットを混合したベンチマークデータセットMixInstructを導入した。我々のLLM-Blenderは、様々な指標において個々のLLMやベースライン手法を大幅に上回り、大きな性能差を確立している。

English

We present LLM-Blender, an ensembling framework designed to attain consistently superior performance by leveraging the diverse strengths of multiple open-source large language models (LLMs). Our framework consists of two modules: PairRanker and GenFuser, addressing the observation that optimal LLMs for different examples can significantly vary. PairRanker employs a specialized pairwise comparison method to distinguish subtle differences between candidate outputs. It jointly encodes the input text and a pair of candidates, using cross-attention encoders to determine the superior one. Our results demonstrate that PairRanker exhibits the highest correlation with ChatGPT-based ranking. Then, GenFuser aims to merge the top-ranked candidates, generating an improved output by capitalizing on their strengths and mitigating their weaknesses. To facilitate large-scale evaluation, we introduce a benchmark dataset, MixInstruct, which is a mixture of multiple instruction datasets featuring oracle pairwise comparisons. Our LLM-Blender significantly outperform individual LLMs and baseline methods across various metrics, establishing a substantial performance gap.

LLM-Blender: ペアワイズランキングと生成的融合による大規模言語モデルのアンサンブル

LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion

要旨

Support