当数据样本唾手可得:扩大多语言大模型推理计算的优势
When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs
June 25, 2025
作者: Ammar Khairi, Daniel D'souza, Ye Shen, Julia Kreutzer, Sara Hooker
cs.AI
摘要
近期,大型语言模型(LLMs)的进展已将焦点转向扩展推理时计算资源,以在不重新训练模型的情况下提升性能。一种常见的方法是并行采样多个输出,并从中选择一个作为最终结果。然而,迄今为止的研究主要集中在英语及少数领域如数学和代码上。相比之下,我们更关注那些能泛化至开放式任务、可形式化验证任务以及跨语言场景的技术。本研究中,我们探讨了在多语言、多任务环境下,如何稳健地扩展开放式生成任务的推理时计算。
我们的研究发现,基于温度变化的采样策略和选择策略均需调整,以适应不同领域和多样化的语言环境。我们评估了现有的选择方法,发现那些在英语中有效的策略往往难以跨语言泛化。为此,我们提出了专门针对多语言和多任务推理场景设计的新颖采样与选择策略,并展示了这些策略在多种语言和任务上带来的显著提升。特别是,我们结合采样与选择的方法,在m-ArenaHard-v2.0提示上,使8B模型相较于Gemini等专有模型,平均胜率提升了+6.8。在更大规模上,配备了我们方法的Command-A(111B模型),仅用五个样本相比单样本解码,在同一基准测试中胜率提高了+9.0,以最小成本实现了显著增长。这些结果强调了采用语言和任务感知的推理时计算方法的必要性,旨在促进在代表性不足语言中性能提升的普及。
English
Recent advancements in large language models (LLMs) have shifted focus toward
scaling inference-time compute, improving performance without retraining the
model. A common approach is to sample multiple outputs in parallel, and select
one of these as the final output. However, work to date has focused on English
and a handful of domains such as math and code. In contrast, we are most
interested in techniques that generalize across open-ended tasks, formally
verifiable tasks, and across languages. In this work, we study how to robustly
scale inference-time compute for open-ended generative tasks in a multilingual,
multi-task setting.
Our findings show that both sampling strategy based on temperature variation
and selection strategy must be adapted to account for diverse domains and
varied language settings. We evaluate existing selection methods, revealing
that strategies effective in English often fail to generalize across languages.
We propose novel sampling and selection strategies specifically adapted for
multilingual and multi-task inference scenarios, and show they yield notable
gains across languages and tasks. In particular, our combined sampling and
selection methods lead to an average +6.8 jump in win-rates for our 8B models
on m-ArenaHard-v2.0 prompts, against proprietary models such as Gemini. At
larger scale, Command-A (111B model) equipped with our methods, shows +9.0
improvement in win-rates on the same benchmark with just five samples against
single-sample decoding, a substantial increase at minimal cost. Our results
underscore the need for language- and task-aware approaches to inference-time
compute, aiming to democratize performance improvements in underrepresented
languages.