작은 RL 제어기, 대형 언어 모델: 테스트 시점 스케일링을 위한 RL 기반 적응형 샘플링

초록

테스트 시간 스케일링은 대규모 언어 모델의 추론 성능을 향상시키지만, 전체 계산량과 지연 시간 모두에서 상당한 비용을 발생시킨다. 기존의 적응형 샘플링 방법은 샘플링 중단 시점을 동적으로 결정하여 이 문제를 부분적으로 완화하지만, 일반적으로 휴리스틱 규칙이나 분포 가정에 의존한다. 본 연구에서는 적응형 샘플링을 마르코프 결정 과정(MDP)으로 정식화한다. 강화 학습(RL)을 통해 가벼운 샘플링 컨트롤러를 훈련시켜 정답 정확성, 지연 시간 및 계산 비용을 동시에 균형 있게 조정한다. 각 라운드에서 컨트롤러는 샘플링을 중단할지 아니면 추가 샘플을 획득할지 결정한다. 제안하는 방법은 최종 답변의 통계에만 의존하는 가벼운 방식으로, CPU에서 훈련 및 배포가 가능하다. 또한 결과 프레임워크가 명시적 예산 제약이 있는 제약 최적화 문제의 라그랑주 완화로 해석될 수 있음을 보인다. ASC 및 ESC와 같은 강력한 기준선과의 실험을 통해 제안하는 방법이 정답 정확성, 샘플링 라운드 및 필요한 총 샘플 수 간의 개선된 트레이드오프를 달성함을 보여준다.

English

Test-time scaling improves the reasoning performance of large language models but incurs substantial cost in both total computation and latency. Existing adaptive sampling methods partially mitigate this issue by dynamically deciding when to stop sampling, yet they typically rely on heuristic rules or rely on distribution assumptions. In this work, we formulate adaptive sampling as a Markov decision process (MDP). We train a lightweight sampling controller with reinforcement learning (RL) to jointly balance answer correctness, latency, and computation cost. At each round, the controller decides to stop sampling or to acquire additional samples. Our method is lightweight which only relies on statistics of final answers, and can be trained and deployed on CPU. We further show that the resulting framework admits an interpretation as the Lagrangian relaxation of a constrained optimization problem with explicit budget constraints. Experiments against strong baselines such as ASC and ESC show that our method achieves improved trade-offs among answer correctness, sampling rounds, and total samples required.