思考LLMs：具有思维生成的通用指令遵循

摘要

通常，LLM被训练用于回答用户问题或遵循类似于人类专家回答的指令。然而，在标准对齐框架中，它们缺乏在回答之前进行明确思考的基本能力。思考对于需要推理和规划的复杂问题至关重要，但可以应用于任何任务。我们提出了一种训练方法，为现有的LLM配备这种思考能力，以便进行一般指令遵循，而无需使用额外的人类数据。我们通过迭代搜索和优化过程实现了这一点，该过程探索可能的思考生成空间，使模型能够学会如何在没有直接监督的情况下思考。对于每个指令，思考候选项仅通过评估其响应的评判模型进行评分，然后通过偏好优化进行优化。我们展示了这一过程在AlpacaEval和Arena-Hard上取得了卓越表现，并显示了在非推理类别（如营销、健康和一般知识）以及更传统的推理和问题解决任务中思考的收益。

English

LLMs are typically trained to answer user questions or follow instructions similarly to how human experts respond. However, in the standard alignment framework they lack the basic ability of explicit thinking before answering. Thinking is important for complex questions that require reasoning and planning -- but can be applied to any task. We propose a training method for equipping existing LLMs with such thinking abilities for general instruction following without use of additional human data. We achieve this by an iterative search and optimization procedure that explores the space of possible thought generations, allowing the model to learn how to think without direct supervision. For each instruction, the thought candidates are scored using a judge model to evaluate their responses only, and then optimized via preference optimization. We show that this procedure leads to superior performance on AlpacaEval and Arena-Hard, and shows gains from thinking on non-reasoning categories such as marketing, health and general knowledge, in addition to more traditional reasoning & problem-solving tasks.

思考LLMs：具有思维生成的通用指令遵循

Thinking LLMs: General Instruction Following with Thought Generation

摘要

Support