SCALE: 수학적 테스트 타임 스케일링의 성능 병목 현상 극복을 위한 선택적 자원 할당

초록

테스트 타임 계산 확장(Test-time compute scaling)은 추론 과정에서 추가적인 계산 자원을 할당함으로써 대규모 언어 모델(LLM)의 수학적 추론 능력을 향상시키는 강력한 패러다임으로 부상했습니다. 그러나 기존 방법들은 모든 추론 하위 문제에 균일하게 자원을 분배하여, 어려운 하위 문제는 충분한 주의를 받지 못하는 반면 일상적인 연산은 과도한 자원을 소모하는 근본적인 병목 현상을 야기합니다. 이러한 균일 할당 방식은 추가 계산 자원 투입에 따른 성능 향상이 점차 줄어드는 한계를 만듭니다. 이중 처리 이론에서 영감을 받아, 우리는 하위 문제 난이도에 따라 선택적으로 계산 자원을 할당하는 SCALE(Selective Resource Allocation) 프레임워크를 제안합니다. SCALE은 네 단계로 운영됩니다: (1) 문제를 순차적 추론 하위 문제로 분해, (2) 각 하위 문제의 난이도 평가를 통해 일상적 연산과 계산적으로 어려운 하위 문제를 구분, (3) 단순 문제에는 System 1, 복잡 문제에는 System 2를 할당하는 선택적 처리 모드 지정, (4) 문맥 전파를 통한 순차적 실행. SCALE은 일상적 연산은 효율적으로 처리하면서 어려운 하위 문제에 자원을 집중함으로써, 우수한 자원 활용과 함께 상당한 성능 향상을 달성합니다. 광범위한 실험을 통해 SCALE이 균일 확장 기준선을 크게 능가함을 입증했으며, AIME25 데이터셋에서 정확도가 57.50%에서 71.25%로 최대 13.75%p 향상되는 동시에 계산 비용을 33%-53% 절감했습니다. 이는 기존 접근법의 근본적 한계를 해결하는 테스트 타임 확장 분야의 중요한 진전을 나타냅니다.

English

Test-time compute scaling has emerged as a powerful paradigm for enhancing mathematical reasoning in large language models (LLMs) by allocating additional computational resources during inference. However, current methods employ uniform resource distribution across all reasoning sub-problems, creating fundamental bottlenecks where challenging sub-problems receive insufficient attention while routine operations consume disproportionate resources. This uniform allocation creates performance bottlenecks where additional computational resources yield diminishing returns. Inspired by dual-process theory, we propose SCALE (Selective Resource Allocation), a framework that selectively allocates computational resources based on sub-problem difficulty. SCALE operates through four stages: (1) problem decomposition into sequential reasoning sub-problems, (2) difficulty assessment of each sub-problem to distinguish between routine operations and computationally challenging sub-problems, (3) selective processing mode assignment between System 1 for simple sub-problems and System 2 for complex ones, and (4) sequential execution with context propagation. By concentrating resources on challenging sub-problems while processing routine operations efficiently, SCALE achieves substantial performance improvements with superior resource utilization. Extensive experiments demonstrate that SCALE significantly outperforms uniform scaling baselines, achieving accuracy improvements of up to 13.75 percentage points (57.50% to 71.25% on AIME25) while reducing computational costs by 33%-53%, representing a major advance in test-time scaling that addresses fundamental limitations of current approaches.