最善を取るのではなく、最善を作り出すN

要旨

現代の大規模言語モデル（LLM）において高品質な生成を実現するためには、主に選択問題として捉えられてきた。つまり、多様なN個のサンプルプールから単一の最良の生成を選び出す、Best-of-N（BoN）というアプローチである。しかし、この方法は本質的にゼロサム的であり、プールから得られる多様で潜在的に有用な情報を捨て去ってしまう。代わりに、我々は協力的なセットアップを探求し、すべての候補が最終的な生成に貢献できる可能性を検討する。この目的のために、Fusion-of-N（FusioN）を提案する。これは、一般的なLLMジャッジを使用して、各サンプルの最も有益な要素を統合し、単一の最終的な回答を生成する手法である。我々はFusioNをBoNと比較し、2つの設定で評価する。(i) テスト時のスケーリング：テスト時に単一のモデルからサンプリングし、集約する。(ii) 合成データ生成：多様な教師モデルのプールからサンプルを融合し、学生モデルを改善する。我々は11言語、3つの多様なタスク、および様々なモデルスケールにわたって、両方の設定を広範にベンチマークする。ベンチマーク全体を通じて、FusioNは一貫してBoNを上回り、テスト時のスケーリングと合成データ生成による下流の利得の両方において、汎用性と堅牢性を示す。また、FusioNに関する詳細な分析を行い、挑戦的な設定下での驚くべき強さと堅牢性を明らかにする。これらの結果は、LLMの生成を評価し活用する方法を、単一の品質指標から、その多面的な性質を受け入れる方向にシフトすべきであることを示している。このシフトにより、多様な強みを統合し、潜在的な可能性を解き放ち、選択だけでは達成できなかった改善を実現することが可能となる。

English

Obtaining high-quality generations in modern LLMs has largely been framed as a selection problem: identifying a single winning generation from a diverse pool of N samples, the Best-of-N (BoN). Yet, this approach is inherently zero-sum, discarding diverse and potentially useful information from the pool. Instead, we explore a collaborative setup, where all candidates can potentially contribute to the final winning generation. To this end, we propose Fusion-of-N (FusioN): a method that uses a general LLM judge to synthesize the most informative elements of each sample into a single final answer. We compare FusioN to BoN in two settings, (i) test-time scaling, where we sample and aggregate from a single model at test-time (ii) synthetic data generation, where we fuse samples from a pool of diverse teachers to improve a student model. We extensively benchmark both setups across 11 languages, 3 diverse tasks and varying model scales. Across the bench, FusioN consistently outperforms BoN showing versatility and robustness both in test-time scaling and in downstream gains from synthetic data generation. We also perform extensive analysis on FusioN, where it shows surprising strengths and robustness under challenging settings. These results show that we should shift how we think about evaluating and utilizing LLM generations from a monolithic measure of quality, to embracing their polylithic nature. This shift allows us to integrate diverse strengths, unlock latent potential, and achieve improvements that were previously inaccessible through selection alone.

最善を取るのではなく、最善を作り出すN

Making, not Taking, the Best of N

要旨

Support