Kinetics: テスト時スケーリング則の再考

要旨

実用的な効率性の観点からテスト時のスケーリング則を再考し、小型モデルの有効性が過大評価されていることを明らかにします。従来の研究は計算最適性に基づいていましたが、推論時の戦略（例：Best-of-N、長いCoT）によって導入される重要なメモリアクセスのボトルネックを見落としていました。0.6Bから32Bパラメータまでのモデルを網羅した我々の包括的な分析により、計算コストとメモリアクセスコストの両方を考慮した新しいKinetics Scaling Lawを発見し、リソース配分をより適切に導くことができます。Kinetics Scaling Lawは、テスト時の計算リソースは、ある閾値を超えたモデルに使用する方が小型モデルよりも効果的であることを示唆しています。その主な理由は、TTS（テスト時スケーリング）において、パラメータ数ではなくアテンションが主要なコスト要因として浮上するためです。これに基づき、我々はスパースアテンションを中心とした新しいスケーリングパラダイムを提案します。これにより、トークンあたりのコストが削減され、同じリソース予算内でより長い生成とより多くの並列サンプルが可能になります。実証的に、スパースアテンションモデルは密なモデルを一貫して上回り、AIMEにおける問題解決精度において、低コスト領域では60ポイント以上、高コスト領域では5ポイント以上の向上を達成しました。これには最先端のMoE（Mixture of Experts）の評価も含まれます。これらの結果は、スパースアテンションがテスト時スケーリングの真の可能性を実現するために不可欠であることを示唆しています。なぜなら、トレーニングではパラメータスケーリングが飽和するのに対し、テスト時の精度は生成量の増加を通じて向上し続けるからです。コードはhttps://github.com/Infini-AI-Lab/Kineticsで公開されています。

English

We rethink test-time scaling laws from a practical efficiency perspective, revealing that the effectiveness of smaller models is significantly overestimated. Prior work, grounded in compute-optimality, overlooks critical memory access bottlenecks introduced by inference-time strategies (e.g., Best-of-N, long CoTs). Our holistic analysis, spanning models from 0.6B to 32B parameters, reveals a new Kinetics Scaling Law that better guides resource allocation by incorporating both computation and memory access costs. Kinetics Scaling Law suggests that test-time compute is more effective when used on models above a threshold than smaller ones. A key reason is that in TTS, attention, rather than parameter count, emerges as the dominant cost factor. Motivated by this, we propose a new scaling paradigm centered on sparse attention, which lowers per-token cost and enables longer generations and more parallel samples within the same resource budget. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over 60 points gains in low-cost regimes and over 5 points gains in high-cost regimes for problem-solving accuracy on AIME, encompassing evaluations on state-of-the-art MoEs. These results suggest that sparse attention is essential for realizing the full potential of test-time scaling because, unlike training, where parameter scaling saturates, test-time accuracy continues to improve through increased generation. The code is available at https://github.com/Infini-AI-Lab/Kinetics.

Kinetics: テスト時スケーリング則の再考

Kinetics: Rethinking Test-Time Scaling Laws

要旨

Support