언어 모델은 더 나은 탐색을 위해 상태-가치 추정에서 자기 개선이 가능하다

초록

다단계 추론 작업에 대한 실측 데이터 작업 완료 보상이나 인간 시연 데이터를 수집하는 것은 비용이 많이 들고 시간이 소요되는 작업이며, 특히 웹 작업과 같은 상호작용이 필요한 영역에서 더욱 그러하다. 이러한 병목 현상을 해결하기 위해, 우리는 상태 전이 역학을 활용하여 언어 모델 제어 탐색을 효과적으로 안내할 수 있는 가치 모델을 훈련시키는 자기 지도 학습 방법인 '자기 주도형 룩어헤드(self-taught lookahead)'를 제안한다. 우리는 자기 주도형 룩어헤드로 개선된 중간 규모(80억 파라미터)의 오픈 가중치 가치 모델이 GPT-4와 같은 최첨단 대형 언어 모델(LLM)을 가치 모델로 사용했을 때의 성능과 맞먹을 수 있음을 발견했다. 또한, 자기 주도형 룩어헤드는 실측 데이터 보상에 의존하지 않으면서도 기존의 LLM 기반 트리 탐색 대비 성능을 20% 향상시키고 비용을 37배 절감할 수 있음을 확인했다.

English

Collecting ground truth task completion rewards or human demonstrations for multi-step reasoning tasks is often cost-prohibitive and time-consuming, especially in interactive domains like web tasks. To address this bottleneck, we present self-taught lookahead, a self-supervised method that leverages state-transition dynamics to train a value model capable of effectively guiding language model-controlled search. We find that moderately sized (8 billion parameters) open-weight value models improved with self-taught lookahead can match the performance of using a frontier LLM such as gpt-4o as the value model. Furthermore, we find that self-taught lookahead improves performance by 20% while reducing costs 37x compared to previous LLM-based tree search, without relying on ground truth rewards.

언어 모델은 더 나은 탐색을 위해 상태-가치 추정에서 자기 개선이 가능하다

Language Models can Self-Improve at State-Value Estimation for Better Search

초록

Summary

Support

Support