搜索流（SoS）：學習在語言中搜索

摘要

在訓練過程中，語言模型很少展示出有益的錯誤。因此，它們很難超越下一個標記，並且由於錯誤不斷累積，難以預測其行動幾步之後的後果。本文展示了如何教導語言模型通過將搜索過程表示為一個扁平化字符串 - 一個搜索流（SoS）來進行搜索。我們提出了一種統一的搜索語言，捕捉了各種不同的符號搜索策略。我們使用簡單但困難的Countdown遊戲來展示我們的方法，該遊戲的目標是將輸入數字與算術運算結合以達到目標數字。我們從頭開始在一個由啟發式求解器生成的搜索流數據集上對基於變壓器的語言模型進行預訓練。我們發現，SoS預訓練使搜索準確度提高了25%，優於僅訓練以預測最佳搜索軌跡的模型。我們進一步通過兩種策略改進方法對這個模型進行微調：優勢誘導策略對齊（APA）和自學習推理者（STaR）。微調後的SoS模型解決了36%以前無法解決的問題，包括任何啟發式求解器都無法解決的問題。我們的結果表明，語言模型可以通過搜索學習解決問題，自我改進以靈活使用不同的搜索策略，並可能發現新的策略。

English

Language models are rarely shown fruitful mistakes while training. They then struggle to look beyond the next token, suffering from a snowballing of errors and struggling to predict the consequence of their actions several steps ahead. In this paper, we show how language models can be taught to search by representing the process of search in language, as a flattened string -- a stream of search (SoS). We propose a unified language for search that captures an array of different symbolic search strategies. We demonstrate our approach using the simple yet difficult game of Countdown, where the goal is to combine input numbers with arithmetic operations to reach a target number. We pretrain a transformer-based language model from scratch on a dataset of streams of search generated by heuristic solvers. We find that SoS pretraining increases search accuracy by 25% over models trained to predict only the optimal search trajectory. We further finetune this model with two policy improvement methods: Advantage-Induced Policy Alignment (APA) and Self-Taught Reasoner (STaR). The finetuned SoS models solve 36% of previously unsolved problems, including problems that cannot be solved by any of the heuristic solvers. Our results indicate that language models can learn to solve problems via search, self-improve to flexibly use different search strategies, and potentially discover new ones.

搜索流（SoS）：學習在語言中搜索

Stream of Search (SoS): Learning to Search in Language

摘要

Support