サンプルが与えられた時：多言語LLMにおける推論計算のスケールアップの利点

要旨

大規模言語モデル（LLM）の最近の進展により、モデルの再学習を行わずに性能を向上させるための推論時の計算リソースのスケーリングに焦点が当てられています。一般的なアプローチとして、複数の出力を並列にサンプリングし、その中から1つを最終出力として選択する方法があります。しかし、これまでの研究は英語や数学、コードなどの限られた領域に集中していました。これに対して、私たちはオープンエンドなタスク、形式的に検証可能なタスク、そして複数言語にわたって汎化する技術に最も関心を持っています。本研究では、多言語・多タスク設定におけるオープンエンドな生成タスクに対して、推論時の計算リソースを堅牢にスケーリングする方法を探ります。私たちの調査結果は、温度変動に基づくサンプリング戦略と選択戦略の両方が、多様なドメインと言語設定を考慮して適応される必要があることを示しています。既存の選択方法を評価した結果、英語で有効な戦略が他の言語に一般化できないことが明らかになりました。私たちは、多言語・多タスク推論シナリオに特化した新しいサンプリングおよび選択戦略を提案し、これらが言語やタスクを超えて顕著な改善をもたらすことを示します。特に、私たちの組み合わせたサンプリングと選択方法は、8Bモデルにおいてm-ArenaHard-v2.0プロンプトに対してGeminiなどのプロプライエタリモデルと比較して平均+6.8の勝率向上をもたらしました。さらに大規模なCommand-A（111Bモデル）では、単一サンプルデコードと比較してわずか5サンプルで同じベンチマークにおいて+9.0の勝率向上を示し、最小限のコストで大幅な改善を実現しました。これらの結果は、推論時の計算リソースに対して言語およびタスクを意識したアプローチの必要性を強調し、特に低リソース言語における性能向上の民主化を目指すものです。

English

Recent advancements in large language models (LLMs) have shifted focus toward scaling inference-time compute, improving performance without retraining the model. A common approach is to sample multiple outputs in parallel, and select one of these as the final output. However, work to date has focused on English and a handful of domains such as math and code. In contrast, we are most interested in techniques that generalize across open-ended tasks, formally verifiable tasks, and across languages. In this work, we study how to robustly scale inference-time compute for open-ended generative tasks in a multilingual, multi-task setting. Our findings show that both sampling strategy based on temperature variation and selection strategy must be adapted to account for diverse domains and varied language settings. We evaluate existing selection methods, revealing that strategies effective in English often fail to generalize across languages. We propose novel sampling and selection strategies specifically adapted for multilingual and multi-task inference scenarios, and show they yield notable gains across languages and tasks. In particular, our combined sampling and selection methods lead to an average +6.8 jump in win-rates for our 8B models on m-ArenaHard-v2.0 prompts, against proprietary models such as Gemini. At larger scale, Command-A (111B model) equipped with our methods, shows +9.0 improvement in win-rates on the same benchmark with just five samples against single-sample decoding, a substantial increase at minimal cost. Our results underscore the need for language- and task-aware approaches to inference-time compute, aiming to democratize performance improvements in underrepresented languages.

サンプルが与えられた時：多言語LLMにおける推論計算のスケールアップの利点

When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs

要旨

Support