LIMOPro: 효율적이고 효과적인 테스트 시간 스케일링을 위한 추론 정제

초록

대규모 언어 모델(LLMs)은 테스트 시점 확장 접근법을 통해 특히 더 강력한 대규모 추론 모델(LRMs)에서 추출한 사고 연쇄(CoT) 데이터로 미세 조정될 때 놀라운 추론 능력을 보여주었습니다. 그러나 이러한 추론 연쇄는 종종 인간의 문제 해결 과정을 반영하는 장황한 요소를 포함하며, 이는 점진적 추론(필수적인 해결 경로 개발)과 기능적 요소(검증 과정, 대체 해결 접근법, 오류 수정)로 분류됩니다. 점진적 추론은 중요하지만, 기능적 요소는 테스트 시점 추론 중 계산 요구량을 크게 증가시킵니다. 우리는 PIR(Perplexity-based Importance Refinement)을 소개합니다. 이는 각 추론 단계의 중요도를 답변 예측 신뢰도에 미치는 영향에 따라 정량적으로 평가하는 원칙 기반 프레임워크입니다. PIR은 체계적으로 중요도가 낮은 기능적 단계만을 선택적으로 제거하면서 점진적 추론 요소를 보존하여, 핵심 해결 경로의 무결성을 유지하면서 장황함을 줄인 최적화된 학습 데이터를 생성합니다. PIR로 최적화된 데이터로 미세 조정된 모델은 테스트 시점 확장 특성이 우수하며, 더 간결한 추론 연쇄를 생성하면서도 정확도를 향상(+0.9\% ~ +6.6\%)시키고 토큰 사용량을 크게 줄임(-3\% ~ -41\%)으로써 도전적인 추론 벤치마크(AIME, AMC, GPQA Diamond)에서 뛰어난 성능을 보입니다. 우리의 접근법은 다양한 모델 크기, 데이터 소스, 토큰 예산에 걸쳐 강력한 일반화 능력을 보여주며, 효율적인 테스트 시점 확장, 응답 시간, 계산 효율성이 중요한 제약 조건인 시나리오에서 추론 가능한 LLMs를 배포하기 위한 실용적인 해결책을 제공합니다.

English

Large language models (LLMs) have demonstrated remarkable reasoning capabilities through test-time scaling approaches, particularly when fine-tuned with chain-of-thought (CoT) data distilled from more powerful large reasoning models (LRMs). However, these reasoning chains often contain verbose elements that mirror human problem-solving, categorized as progressive reasoning (the essential solution development path) and functional elements (verification processes, alternative solution approaches, and error corrections). While progressive reasoning is crucial, the functional elements significantly increase computational demands during test-time inference. We introduce PIR (Perplexity-based Importance Refinement), a principled framework that quantitatively evaluates the importance of each reasoning step based on its impact on answer prediction confidence. PIR systematically identifies and selectively prunes only low-importance functional steps while preserving progressive reasoning components, creating optimized training data that maintains the integrity of the core solution path while reducing verbosity. Models fine-tuned on PIR-optimized data exhibit superior test-time scaling properties, generating more concise reasoning chains while achieving improved accuracy (+0.9\% to +6.6\%) with significantly reduced token usage (-3\% to -41\%) across challenging reasoning benchmarks (AIME, AMC, and GPQA Diamond). Our approach demonstrates strong generalizability across different model sizes, data sources, and token budgets, offering a practical solution for deploying reasoning-capable LLMs in scenarios where efficient test-time scaling, response time, and computational efficiency are valuable constraints.

LIMOPro: 효율적이고 효과적인 테스트 시간 스케일링을 위한 추론 정제

LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling

초록

Support