《SCALE:面向数学测试时性能瓶颈的选择性资源分配策略》
SCALE: Selective Resource Allocation for Overcoming Performance Bottlenecks in Mathematical Test-time Scaling
November 29, 2025
作者: Yang Xiao, Chunpu Xu, Ruifeng Yuan, Jiashuo Wang, Wenjie Li, Pengfei Liu
cs.AI
摘要
测试时计算缩放已成为通过推理阶段分配额外计算资源来增强大型语言模型数学推理能力的重要范式。然而,现有方法对所有推理子问题采用均匀的资源分配策略,这造成了根本性瓶颈:具有挑战性的子问题得不到足够关注,而常规运算却消耗了不成比例的资源。这种均匀分配导致性能瓶颈,使得额外计算资源的投入产生边际效益递减。受双过程理论启发,我们提出SCALE(选择性资源分配)框架,该框架基于子问题难度进行选择性计算资源分配。SCALE通过四个阶段运作:(1)将问题分解为顺序推理子问题;(2)评估每个子问题的难度以区分常规运算与计算密集型子问题;(3)在系统1(处理简单子问题)和系统2(处理复杂子问题)之间选择处理模式;(4)结合上下文传播的顺序执行。通过将资源集中于挑战性子问题同时高效处理常规运算,SCALE在显著提升资源利用率的同时实现了可观的性能改进。大量实验表明,SCALE显著优于均匀缩放基线,在AIME25数据集上准确率提升高达13.75个百分点(从57.50%至71.25%),同时降低33%-53%的计算成本,这标志着测试时缩放技术取得了重大突破,有效解决了现有方法的根本性局限。
English
Test-time compute scaling has emerged as a powerful paradigm for enhancing mathematical reasoning in large language models (LLMs) by allocating additional computational resources during inference. However, current methods employ uniform resource distribution across all reasoning sub-problems, creating fundamental bottlenecks where challenging sub-problems receive insufficient attention while routine operations consume disproportionate resources. This uniform allocation creates performance bottlenecks where additional computational resources yield diminishing returns. Inspired by dual-process theory, we propose SCALE (Selective Resource Allocation), a framework that selectively allocates computational resources based on sub-problem difficulty. SCALE operates through four stages: (1) problem decomposition into sequential reasoning sub-problems, (2) difficulty assessment of each sub-problem to distinguish between routine operations and computationally challenging sub-problems, (3) selective processing mode assignment between System 1 for simple sub-problems and System 2 for complex ones, and (4) sequential execution with context propagation. By concentrating resources on challenging sub-problems while processing routine operations efficiently, SCALE achieves substantial performance improvements with superior resource utilization. Extensive experiments demonstrate that SCALE significantly outperforms uniform scaling baselines, achieving accuracy improvements of up to 13.75 percentage points (57.50% to 71.25% on AIME25) while reducing computational costs by 33%-53%, representing a major advance in test-time scaling that addresses fundamental limitations of current approaches.