Kinetics:重新思考测试阶段的缩放法则
Kinetics: Rethinking Test-Time Scaling Laws
June 5, 2025
作者: Ranajoy Sadhukhan, Zhuoming Chen, Haizhong Zheng, Yang Zhou, Emma Strubell, Beidi Chen
cs.AI
摘要
我们从实际效率的角度重新审视了测试阶段的扩展规律,揭示出较小模型的有效性被显著高估。先前基于计算最优性的研究忽视了推理策略(如Best-of-N、长链思维)引入的关键内存访问瓶颈。我们的全面分析覆盖了从0.6B到32B参数的模型,发现了一种新的动力学扩展定律,该定律通过综合考虑计算和内存访问成本,更好地指导资源分配。动力学扩展定律表明,在模型规模超过某一阈值后,测试阶段的计算资源使用效率更高。一个关键原因在于,在测试阶段,注意力机制而非参数数量成为了主导成本因素。受此启发,我们提出了一种以稀疏注意力为核心的新扩展范式,它降低了每个token的成本,使得在相同资源预算下能够生成更长的序列和更多的并行样本。实证结果显示,稀疏注意力模型在低成本区域持续超越密集模型,在AIME问题解决准确率上实现了超过60分的提升,在高成本区域也取得了超过5分的增益,这一评估涵盖了当前最先进的混合专家模型(MoEs)。这些结果表明,稀疏注意力对于充分发挥测试阶段扩展潜力至关重要,因为与训练阶段参数扩展趋于饱和不同,测试阶段的准确率通过增加生成量持续提升。相关代码已发布于https://github.com/Infini-AI-Lab/Kinetics。
English
We rethink test-time scaling laws from a practical efficiency perspective,
revealing that the effectiveness of smaller models is significantly
overestimated. Prior work, grounded in compute-optimality, overlooks critical
memory access bottlenecks introduced by inference-time strategies (e.g.,
Best-of-N, long CoTs). Our holistic analysis, spanning models from 0.6B to
32B parameters, reveals a new Kinetics Scaling Law that better guides resource
allocation by incorporating both computation and memory access costs. Kinetics
Scaling Law suggests that test-time compute is more effective when used on
models above a threshold than smaller ones. A key reason is that in TTS,
attention, rather than parameter count, emerges as the dominant cost factor.
Motivated by this, we propose a new scaling paradigm centered on sparse
attention, which lowers per-token cost and enables longer generations and more
parallel samples within the same resource budget. Empirically, we show that
sparse attention models consistently outperform dense counterparts, achieving
over 60 points gains in low-cost regimes and over 5 points gains in high-cost
regimes for problem-solving accuracy on AIME, encompassing evaluations on
state-of-the-art MoEs. These results suggest that sparse attention is essential
for realizing the full potential of test-time scaling because, unlike training,
where parameter scaling saturates, test-time accuracy continues to improve
through increased generation. The code is available at
https://github.com/Infini-AI-Lab/Kinetics.