搜索流（SoS）：学习在语言中搜索

摘要

在训练过程中，很少展示语言模型成功的错误。它们难以超越下一个标记，因为错误会不断积累，难以预测其行动数步之后的后果。本文展示了如何教导语言模型通过在语言中表示搜索的过程，将搜索过程表现为一个扁平化的字符串——搜索流（SoS）。我们提出了一个统一的搜索语言，捕捉了各种不同的符号搜索策略。我们使用简单但困难的Countdown游戏来演示我们的方法，该游戏的目标是将输入数字与算术运算结合以达到目标数字。我们从头开始在一个由启发式求解器生成的搜索流数据集上对基于Transformer的语言模型进行预训练。我们发现，SoS预训练可以将搜索准确度提高25%，超过了仅训练以预测最佳搜索轨迹的模型。我们进一步使用两种策略改进方法对这个模型进行微调：优势诱导策略对齐（APA）和自学习推理者（STaR）。微调后的SoS模型解决了36%以前无法解决的问题，包括任何启发式求解器都无法解决的问题。我们的结果表明，语言模型可以通过搜索学习解决问题，自我改进以灵活使用不同的搜索策略，并可能发现新的策略。

English

Language models are rarely shown fruitful mistakes while training. They then struggle to look beyond the next token, suffering from a snowballing of errors and struggling to predict the consequence of their actions several steps ahead. In this paper, we show how language models can be taught to search by representing the process of search in language, as a flattened string -- a stream of search (SoS). We propose a unified language for search that captures an array of different symbolic search strategies. We demonstrate our approach using the simple yet difficult game of Countdown, where the goal is to combine input numbers with arithmetic operations to reach a target number. We pretrain a transformer-based language model from scratch on a dataset of streams of search generated by heuristic solvers. We find that SoS pretraining increases search accuracy by 25% over models trained to predict only the optimal search trajectory. We further finetune this model with two policy improvement methods: Advantage-Induced Policy Alignment (APA) and Self-Taught Reasoner (STaR). The finetuned SoS models solve 36% of previously unsolved problems, including problems that cannot be solved by any of the heuristic solvers. Our results indicate that language models can learn to solve problems via search, self-improve to flexibly use different search strategies, and potentially discover new ones.

搜索流（SoS）：学习在语言中搜索

Stream of Search (SoS): Learning to Search in Language

摘要

Summary

Support

Support