SCALE: 数学的テスト時スケーリングにおける性能ボトルネック克服のための選択的リソース割り当て

要旨

テストタイム計算リソース拡張は、推論時に追加の計算資源を割り当てることで大規模言語モデル（LLM）の数学的推論能力を強化する有力なパラダイムとして登場した。しかし、現在の手法は全ての推論サブ問題に均一に資源を配分するため、困難なサブ問題には注意が不足し、日常的な操作には不釣り合いな資源が消費されるという根本的なボトルネックが生じている。この均一な配分は、追加の計算資源に対する収穫逓減を引き起こす性能ボトルネックを生み出す。二重過程理論に着想を得て、我々はサブ問題の難易度に基づいて計算資源を選択的に配分するフレームワークSCALE（Selective Resource Allocation）を提案する。SCALEは4つの段階で動作する：（1）問題を逐次的な推論サブ問題に分解、（2）各サブ問題の難易度評価により日常的操作と計算困難なサブ問題を区別、（3）単純なサブ問題にはシステム1、複雑な問題にはシステム2を割り当てる選択的処理モードの割り当て、（4）文脈伝播を伴う逐次実行。日常的操作を効率的に処理しつつ困難なサブ問題に資源を集中させることで、SCALEは優れた資源利用効率で大幅な性能向上を実現する。大規模な実験により、SCALEが均一拡張ベースラインを大幅に上回り、AIME25では57.50%から71.25%まで最大13.75ポイントの精度向上を達成しつつ計算コストを33%-53%削減できることが実証された。これは現在の手法の根本的限界に対処するテストタイム拡張技術の大きな進歩を示している。

English

Test-time compute scaling has emerged as a powerful paradigm for enhancing mathematical reasoning in large language models (LLMs) by allocating additional computational resources during inference. However, current methods employ uniform resource distribution across all reasoning sub-problems, creating fundamental bottlenecks where challenging sub-problems receive insufficient attention while routine operations consume disproportionate resources. This uniform allocation creates performance bottlenecks where additional computational resources yield diminishing returns. Inspired by dual-process theory, we propose SCALE (Selective Resource Allocation), a framework that selectively allocates computational resources based on sub-problem difficulty. SCALE operates through four stages: (1) problem decomposition into sequential reasoning sub-problems, (2) difficulty assessment of each sub-problem to distinguish between routine operations and computationally challenging sub-problems, (3) selective processing mode assignment between System 1 for simple sub-problems and System 2 for complex ones, and (4) sequential execution with context propagation. By concentrating resources on challenging sub-problems while processing routine operations efficiently, SCALE achieves substantial performance improvements with superior resource utilization. Extensive experiments demonstrate that SCALE significantly outperforms uniform scaling baselines, achieving accuracy improvements of up to 13.75 percentage points (57.50% to 71.25% on AIME25) while reducing computational costs by 33%-53%, representing a major advance in test-time scaling that addresses fundamental limitations of current approaches.

SCALE: 数学的テスト時スケーリングにおける性能ボトルネック克服のための選択的リソース割り当て

SCALE: Selective Resource Allocation for Overcoming Performance Bottlenecks in Mathematical Test-time Scaling

要旨

Support