LLM에 의한 LLM 개선: 테스트 시간 스케일링을 위한 에이전틱 발견

초록

테스트 시 확장(Test-time scaling, TTS)은 추론 중 추가 연산을 할당하여 대규모 언어 모델의 성능을 향상시키는 효과적인 접근 방식으로 자리 잡았다. 그러나 기존 TTS 전략은 대부분 수작업으로 설계되었다. 즉, 연구자들이 직관에 따라 수동으로 추론 패턴을 설계하고 휴리스틱을 조정함으로써, 연산 할당 공간의 상당 부분이 탐색되지 않은 상태로 남아 있다. 본 연구에서는 연구자들이 설계하는 대상을 개별 TTS 휴리스틱에서 TTS 전략이 자동으로 발견될 수 있는 환경으로 전환하는 환경 기반 프레임워크인 AutoTTS를 제안한다. AutoTTS의 핵심은 환경 구축에 있다. 발견 환경은 제어 공간을 다루기 쉽게 만들고, TTS 탐색을 위한 저렴하고 빈번한 피드백을 제공해야 한다. 구체적인 구현으로, 너비-깊이 TTS를 사전 수집된 추론 궤적과 프로브 신호에 대한 제어기 합성 문제로 정식화한다. 여기서 제어기는 분기, 계속, 프로브, 가지치기, 중단 시점을 결정하며, LLM 호출을 반복하지 않고도 저렴하게 평가할 수 있다. 또한, 탐색을 다루기 쉽게 만들기 위해 베타 매개변수화를 도입하고, 에이전트가 TTS 프로그램이 실패하는 이유를 진단하도록 도와 발견 효율성을 개선하는 미세 실행 추적 피드백을 도입한다. 수학적 추론 벤치마크 실험 결과, 발견된 전략은 강력한 수작업 설계 기준선 대비 전체 정확도-비용 트레이드오프를 개선한다. 발견된 전략은 검증되지 않은 벤치마크와 모델 규모에 일반화되며, 전체 발견 비용은 단 39.9달러와 160분에 불과하다. 데이터와 코드는 https://github.com/zhengkid/AutoTTS 에서 오픈소스로 제공될 예정이다.

English

Test-time scaling (TTS) has become an effective approach for improving large language model performance by allocating additional computation during inference. However, existing TTS strategies are largely hand-crafted: researchers manually design reasoning patterns and tune heuristics by intuition, leaving much of the computation-allocation space unexplored. We propose an environment-driven framework, AutoTTS, that changes what researchers design: from individual TTS heuristics to environments where TTS strategies can be discovered automatically. The key to AutoTTS lies in environment construction: the discovery environment must make the control space tractable and provide cheap, frequent feedback for TTS search. As a concrete instantiation, we formulate width--depth TTS as controller synthesis over pre-collected reasoning trajectories and probe signals, where controllers decide when to branch, continue, probe, prune, or stop and can be evaluated cheaply without repeated LLM calls. We further introduce beta parameterization to make the search tractable and fine-grained execution trace feedback to improve discovery efficiency by helping the agent diagnose why a TTS program fails. Experiments on mathematical reasoning benchmarks show that the discovered strategies improve the overall accuracy--cost tradeoff over strong manually designed baselines. The discovered strategies generalize to held-out benchmarks and model scales, while the entire discovery costs only $39.9 and 160 minutes. Our data, and code will be open-source at https://github.com/zhengkid/AutoTTS.