검색의 흐름(Stream of Search, SoS): 언어에서의 검색 학습

초록

언어 모델은 훈련 과정에서 유익한 실수를 거의 경험하지 못합니다. 그 결과, 이들은 다음 토큰을 넘어서는 데 어려움을 겪으며, 오류가 점점 쌓여가고 여러 단계 앞의 행동 결과를 예측하는 데 어려움을 겪습니다. 본 논문에서는 언어로 검색 과정을 표현하여, 이를 평면화된 문자열인 '검색 스트림(Stream of Search, SoS)'으로 나타내는 방법을 통해 언어 모델이 검색을 배울 수 있음을 보여줍니다. 우리는 다양한 상징적 검색 전략을 포괄하는 통합 검색 언어를 제안합니다. 이 접근법을 간단하지만 어려운 게임인 '카운트다운'을 통해 실증적으로 보여드리는데, 이 게임은 입력된 숫자들을 산술 연산을 통해 목표 숫자에 도달하는 것이 목표입니다. 우리는 휴리스틱 솔버로 생성된 검색 스트림 데이터셋을 기반으로 트랜스포머 기반 언어 모델을 처음부터 사전 훈련시켰습니다. SoS 사전 훈련은 최적 검색 궤적만 예측하도록 훈련된 모델보다 검색 정확도를 25% 향상시킨다는 것을 발견했습니다. 또한, 이 모델을 두 가지 정책 개선 방법인 'Advantage-Induced Policy Alignment(APA)'와 'Self-Taught Reasoner(STaR)'로 미세 조정했습니다. 미세 조정된 SoS 모델은 이전에 해결되지 못한 문제 중 36%를 해결했으며, 이는 휴리스틱 솔버로도 해결할 수 없었던 문제를 포함합니다. 우리의 결과는 언어 모델이 검색을 통해 문제를 해결하고, 다양한 검색 전략을 유연하게 사용하며, 잠재적으로 새로운 전략을 발견할 수 있음을 시사합니다.

English

Language models are rarely shown fruitful mistakes while training. They then struggle to look beyond the next token, suffering from a snowballing of errors and struggling to predict the consequence of their actions several steps ahead. In this paper, we show how language models can be taught to search by representing the process of search in language, as a flattened string -- a stream of search (SoS). We propose a unified language for search that captures an array of different symbolic search strategies. We demonstrate our approach using the simple yet difficult game of Countdown, where the goal is to combine input numbers with arithmetic operations to reach a target number. We pretrain a transformer-based language model from scratch on a dataset of streams of search generated by heuristic solvers. We find that SoS pretraining increases search accuracy by 25% over models trained to predict only the optimal search trajectory. We further finetune this model with two policy improvement methods: Advantage-Induced Policy Alignment (APA) and Self-Taught Reasoner (STaR). The finetuned SoS models solve 36% of previously unsolved problems, including problems that cannot be solved by any of the heuristic solvers. Our results indicate that language models can learn to solve problems via search, self-improve to flexibly use different search strategies, and potentially discover new ones.

검색의 흐름(Stream of Search, SoS): 언어에서의 검색 학습

Stream of Search (SoS): Learning to Search in Language

초록

Support