创造，而非利用，N的最佳价值

摘要

在现代大型语言模型（LLMs）中，获取高质量生成结果很大程度上被视作一个选择问题：从多样化的N个样本池中识别出单一的最佳生成，即“N选一”（Best-of-N, BoN）。然而，这种方法本质上是一种零和游戏，舍弃了样本池中多样且可能具有价值的信息。相反，我们探索了一种协作式框架，其中所有候选生成都有可能为最终胜出的生成做出贡献。为此，我们提出了“N融合”（Fusion-of-N, FusioN）：一种利用通用LLM评判者将每个样本中最具信息量的元素综合成单一最终答案的方法。我们在两种场景下将FusioN与BoN进行了对比：(i) 测试时扩展，即在测试时从单一模型采样并聚合；(ii) 合成数据生成，即融合来自多样化教师模型池的样本来提升学生模型。我们在11种语言、3项多样化任务及不同模型规模上对这两种设置进行了广泛基准测试。结果表明，FusioN在测试时扩展和合成数据生成带来的下游增益方面均持续超越BoN，展现了其多功能性和鲁棒性。我们还对FusioN进行了深入分析，发现其在挑战性环境下展现出令人惊讶的优势和稳健性。这些成果提示我们，应当转变对LLM生成评估与利用的思维方式，从单一的质量衡量转向接纳其多元本质。这一转变使我们能够整合多样优势，释放潜在能力，实现仅靠选择无法达成的改进。

English

Obtaining high-quality generations in modern LLMs has largely been framed as a selection problem: identifying a single winning generation from a diverse pool of N samples, the Best-of-N (BoN). Yet, this approach is inherently zero-sum, discarding diverse and potentially useful information from the pool. Instead, we explore a collaborative setup, where all candidates can potentially contribute to the final winning generation. To this end, we propose Fusion-of-N (FusioN): a method that uses a general LLM judge to synthesize the most informative elements of each sample into a single final answer. We compare FusioN to BoN in two settings, (i) test-time scaling, where we sample and aggregate from a single model at test-time (ii) synthetic data generation, where we fuse samples from a pool of diverse teachers to improve a student model. We extensively benchmark both setups across 11 languages, 3 diverse tasks and varying model scales. Across the bench, FusioN consistently outperforms BoN showing versatility and robustness both in test-time scaling and in downstream gains from synthetic data generation. We also perform extensive analysis on FusioN, where it shows surprising strengths and robustness under challenging settings. These results show that we should shift how we think about evaluating and utilizing LLM generations from a monolithic measure of quality, to embracing their polylithic nature. This shift allows us to integrate diverse strengths, unlock latent potential, and achieve improvements that were previously inaccessible through selection alone.

创造，而非利用，N的最佳价值

Making, not Taking, the Best of N

摘要

Support