LLM-Blender: 페어와이즈 랭킹과 생성적 융합을 통한 대형 언어 모델 앙상블

초록

우리는 다수의 오픈소스 대형 언어 모델(LLM)들의 다양한 강점을 활용하여 일관되게 우수한 성능을 달성하기 위해 설계된 앙상블 프레임워크인 LLM-Blender를 제안합니다. 우리의 프레임워크는 PairRanker와 GenFuser 두 가지 모듈로 구성되어 있으며, 이는 서로 다른 예제에 대해 최적의 LLM이 크게 달라질 수 있다는 관찰에 기반합니다. PairRanker는 특화된 pairwise 비교 방법을 사용하여 후보 출력들 간의 미묘한 차이를 구분합니다. 이 모듈은 입력 텍스트와 한 쌍의 후보를 공동으로 인코딩하며, cross-attention 인코더를 사용하여 더 우수한 후보를 결정합니다. 우리의 실험 결과는 PairRanker가 ChatGPT 기반 순위와 가장 높은 상관관계를 보인다는 것을 입증합니다. 이어서 GenFuser는 상위 순위의 후보들을 통합하여 각각의 강점을 극대화하고 약점을 보완함으로써 개선된 출력을 생성하는 것을 목표로 합니다. 대규모 평가를 용이하게 하기 위해, 우리는 오라클 pairwise 비교를 포함한 다중 명령어 데이터셋의 혼합체인 MixInstruct 벤치마크 데이터셋을 소개합니다. 우리의 LLM-Blender는 다양한 메트릭에서 개별 LLM 및 베이스라인 방법들을 크게 능가하며, 상당한 성능 격차를 확립합니다.

English

We present LLM-Blender, an ensembling framework designed to attain consistently superior performance by leveraging the diverse strengths of multiple open-source large language models (LLMs). Our framework consists of two modules: PairRanker and GenFuser, addressing the observation that optimal LLMs for different examples can significantly vary. PairRanker employs a specialized pairwise comparison method to distinguish subtle differences between candidate outputs. It jointly encodes the input text and a pair of candidates, using cross-attention encoders to determine the superior one. Our results demonstrate that PairRanker exhibits the highest correlation with ChatGPT-based ranking. Then, GenFuser aims to merge the top-ranked candidates, generating an improved output by capitalizing on their strengths and mitigating their weaknesses. To facilitate large-scale evaluation, we introduce a benchmark dataset, MixInstruct, which is a mixture of multiple instruction datasets featuring oracle pairwise comparisons. Our LLM-Blender significantly outperform individual LLMs and baseline methods across various metrics, establishing a substantial performance gap.

LLM-Blender: 페어와이즈 랭킹과 생성적 융합을 통한 대형 언어 모델 앙상블

LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion

초록

Support