考えるLLMs：思考生成を伴う一般的な命令の遵守

要旨

LLM（Large Language Models）は通常、ユーザーの質問に回答したり、人間の専門家が応答するのと同様に指示に従ったりするために訓練されます。ただし、標準の整合フレームワークでは、回答する前に明示的な思考能力が欠如しています。思考は、推論や計画が必要な複雑な質問に重要ですが、任意のタスクに適用することができます。我々は、既存のLLMにそのような思考能力を備えた一般的な指示に従うためのトレーニング方法を提案しますが、追加の人間データを使用せずに行います。これを実現するために、可能な思考生成の空間を探索する反復的な検索および最適化手法を用い、モデルが直接の監督なしに思考する方法を学習するようにします。各指示に対して、思考候補は、その回答のみを評価するために判定モデルを使用してスコア付けされ、その後好みの最適化を経て最適化されます。この手法がAlpacaEvalとArena-Hardで優れたパフォーマンスを示し、マーケティング、健康、一般的な知識などの非推論カテゴリにおける思考の利点を示し、従来の推論や問題解決タスクに加えて優れた結果をもたらすことを示します。

English

LLMs are typically trained to answer user questions or follow instructions similarly to how human experts respond. However, in the standard alignment framework they lack the basic ability of explicit thinking before answering. Thinking is important for complex questions that require reasoning and planning -- but can be applied to any task. We propose a training method for equipping existing LLMs with such thinking abilities for general instruction following without use of additional human data. We achieve this by an iterative search and optimization procedure that explores the space of possible thought generations, allowing the model to learn how to think without direct supervision. For each instruction, the thought candidates are scored using a judge model to evaluate their responses only, and then optimized via preference optimization. We show that this procedure leads to superior performance on AlpacaEval and Arena-Hard, and shows gains from thinking on non-reasoning categories such as marketing, health and general knowledge, in addition to more traditional reasoning & problem-solving tasks.

考えるLLMs：思考生成を伴う一般的な命令の遵守

Thinking LLMs: General Instruction Following with Thought Generation

要旨

Support