思考LLMs：具有思維生成的通用指令遵循

摘要

通常，LLMs 被訓練來回答使用者問題或遵循指示，類似於人類專家的回應方式。然而，在標準對齊框架中，它們缺乏在回答前進行明確思考的基本能力。思考對於需要推理和規劃的複雜問題至關重要，但也可應用於任何任務。我們提出了一種訓練方法，用於為現有的LLMs配備這種思考能力，以便進行一般指示遵循，而無需使用額外的人類數據。我們通過一個迭代搜索和優化程序來實現這一點，該程序探索可能思維生成的空間，使模型能夠學會如何在沒有直接監督的情況下思考。對於每個指示，思考候選方案僅通過評估其回應的評判模型進行打分，然後通過偏好優化進行優化。我們展示了這個程序在AlpacaEval和Arena-Hard上實現了卓越的表現，並且在行銷、健康和一般知識等非推理類別以及更傳統的推理和問題解決任務中展現了思考的收益。

English

LLMs are typically trained to answer user questions or follow instructions similarly to how human experts respond. However, in the standard alignment framework they lack the basic ability of explicit thinking before answering. Thinking is important for complex questions that require reasoning and planning -- but can be applied to any task. We propose a training method for equipping existing LLMs with such thinking abilities for general instruction following without use of additional human data. We achieve this by an iterative search and optimization procedure that explores the space of possible thought generations, allowing the model to learn how to think without direct supervision. For each instruction, the thought candidates are scored using a judge model to evaluate their responses only, and then optimized via preference optimization. We show that this procedure leads to superior performance on AlpacaEval and Arena-Hard, and shows gains from thinking on non-reasoning categories such as marketing, health and general knowledge, in addition to more traditional reasoning & problem-solving tasks.

思考LLMs：具有思維生成的通用指令遵循

Thinking LLMs: General Instruction Following with Thought Generation

摘要

Support