적응형 추론을 위한 계층적 예산 정책 최적화

초록

대규모 추론 모델은 광범위한 사고 사슬 생성(chain-of-thought generation)을 통해 뛰어난 성능을 달성하지만, 문제의 복잡성과 관계없이 균일한 추론 전략을 적용함으로써 상당한 계산 비효율성을 보입니다. 우리는 Hierarchical Budget Policy Optimization(HBPO)을 제안합니다. 이는 강화 학습 프레임워크로, 모델이 능력을 희생하지 않고 문제별 추론 깊이를 학습할 수 있도록 합니다. HBPO는 효율 지향적 훈련에서 발생하는 탐색 공간 붕괴(exploration space collapse)라는 근본적인 문제를 해결합니다. 여기서 긴 출력 길이에 대한 패널티는 모델이 필요한 긴 추론 경로에서 벗어나도록 체계적으로 편향시킵니다. 계층적 예산 탐색(hierarchical budget exploration)을 통해, 우리의 접근 방식은 롤아웃 샘플을 서로 다른 토큰 예산을 가진 여러 하위 그룹으로 분할하여, 능력 저하를 방지하면서 효율적인 자원 할당을 가능하게 합니다. 우리는 문제의 복잡성과 일치하는 예산 인식 인센티브를 생성하는 차별화된 보상 메커니즘을 도입하여, 모델이 작업 요구사항과 계산 노력 사이의 자연스러운 대응 관계를 발견할 수 있도록 합니다. 광범위한 실험을 통해 HBPO가 평균 토큰 사용량을 최대 60.6%까지 줄이면서도 네 가지 추론 벤치마크에서 정확도를 3.14% 향상시킴을 입증했습니다. 외부 제약을 부과하거나 이산 모드 선택에 의존하는 기존 방법과 달리, HBPO는 모델이 문제 복잡성에 따라 자동으로 추론 깊이를 조정하는 적응적 행동을 나타냅니다. 우리의 결과는 추론 효율성과 능력이 본질적으로 상충되지 않으며, 탐색 다양성을 보존하는 적절하게 구조화된 계층적 훈련을 통해 동시에 최적화될 수 있음을 시사합니다.

English

Large reasoning models achieve remarkable performance through extensive chain-of-thought generation, yet exhibit significant computational inefficiency by applying uniform reasoning strategies regardless of problem complexity. We present Hierarchical Budget Policy Optimization (HBPO), a reinforcement learning framework that enables models to learn problem-specific reasoning depths without sacrificing capability. HBPO addresses the fundamental challenge of exploration space collapse in efficiency-oriented training, where penalties on long output length systematically bias models away from necessary long reasoning paths. Through hierarchical budget exploration, our approach partitions rollout samples into multiple subgroups with distinct token budgets, aiming to enable efficient resource allocation while preventing degradation of capability. We introduce differentiated reward mechanisms that create budget-aware incentives aligned with the complexity of the problem, allowing models to discover natural correspondences between task requirements and computational effort. Extensive experiments demonstrate that HBPO reduces average token usage by up to 60.6% while improving accuracy by 3.14% across four reasoning benchmarks. Unlike existing methods that impose external constraints or rely on discrete mode selection, HBPO exhibits emergent adaptive behavior where models automatically adjust reasoning depth based on problem complexity. Our results suggest that reasoning efficiency and capability are not inherently conflicting, and can be simultaneously optimized through appropriately structured hierarchical training that preserves exploration diversity.

적응형 추론을 위한 계층적 예산 정책 최적화

Hierarchical Budget Policy Optimization for Adaptive Reasoning

초록

Support