계획 및 예산: 대규모 언어 모델 추론에서 효과적이고 효율적인 테스트 시간 스케일링

초록

대규모 언어 모델(LLMs)은 복잡한 추론 작업에서 놀라운 성과를 거두었지만, 그들의 추론 과정은 여전히 계산적으로 비효율적입니다. 우리는 많은 주요 LLMs에서 공통적으로 관찰되는 실패 모드인 '과도한 사고(overthinking)'를 발견했습니다. 이는 모델이 단순한 질문에도 불구하고 장황하고 관련 없는 추론 과정을 생성하는 현상을 말합니다. 최근 연구에서는 이를 완화하기 위해 고정된 토큰 예산을 강제하는 방법을 시도했지만, 이는 특히 더 어려운 문제에서 '사고 부족(underthinking)'을 초래할 수 있습니다. 실증적 분석을 통해 우리는 이러한 비효율성이 종종 불명확한 문제 해결 전략에서 비롯된다는 것을 확인했습니다. 이를 공식화하기 위해, 우리는 베이지안 예산 할당 모델(BBAM)이라는 이론적 모델을 개발했습니다. 이 모델은 추론을 다양한 불확실성을 가진 하위 질문의 연속으로 모델링하며, 정확성과 계산 효율성 간의 균형을 포착하기 위해 E^3 지표를 도입했습니다. BBAM의 이론적 결과를 바탕으로, 우리는 복잡한 질문을 하위 질문으로 분해하고 적응형 스케줄링을 사용하여 예상 복잡도에 기반해 토큰 예산을 할당하는 모델-불가지론적 테스트-타임 프레임워크인 Plan-and-Budget을 제안합니다. Plan-and-Budget은 다양한 작업과 모델에서 추론 효율성을 개선하며, 최대 +70%의 정확도 향상, -39%의 토큰 감소, 그리고 E^3에서 +187.5%의 개선을 달성했습니다. 특히, 이는 더 작은 모델(DS-Qwen-32B)을 더 큰 모델(DS-LLaMA-70B)의 효율성과 동등하게 끌어올려, Plan-and-Budget이 재학습 없이도 성능 격차를 해소할 수 있음을 보여줍니다. 우리의 코드는 anonymous.4open.science/r/P-and-B-6513/에서 확인할 수 있습니다.

English

Large Language Models (LLMs) have achieved remarkable success in complex reasoning tasks, but their inference remains computationally inefficient. We observe a common failure mode in many prevalent LLMs, overthinking, where models generate verbose and tangential reasoning traces even for simple queries. Recent works have tried to mitigate this by enforcing fixed token budgets, however, this can lead to underthinking, especially on harder problems. Through empirical analysis, we identify that this inefficiency often stems from unclear problem-solving strategies. To formalize this, we develop a theoretical model, BBAM (Bayesian Budget Allocation Model), which models reasoning as a sequence of sub-questions with varying uncertainty, and introduce the E^3 metric to capture the trade-off between correctness and computation efficiency. Building on theoretical results from BBAM, we propose Plan-and-Budget, a model-agnostic, test-time framework that decomposes complex queries into sub-questions and allocates token budgets based on estimated complexity using adaptive scheduling. Plan-and-Budget improves reasoning efficiency across a range of tasks and models, achieving up to +70% accuracy gains, -39% token reduction, and +187.5% improvement in E^3. Notably, it elevates a smaller model (DS-Qwen-32B) to match the efficiency of a larger model (DS-LLaMA-70B)-demonstrating Plan-and-Budget's ability to close performance gaps without retraining. Our code is available at anonymous.4open.science/r/P-and-B-6513/.

계획 및 예산: 대규모 언어 모델 추론에서 효과적이고 효율적인 테스트 시간 스케일링

Plan and Budget: Effective and Efficient Test-Time Scaling on Large Language Model Reasoning

초록

Support