思考介入による推論モデルの効果的制御

要旨

推論能力を強化した大規模言語モデル（LLMs）は、最終的な回答を生成する前に中間的な推論ステップを明示的に生成し、複雑な問題解決において優れた性能を発揮します。本論文では、この新たな生成フレームワークが、モデルの挙動をより細かく制御するためのユニークな機会を提供することを示します。私たちは、特定の思考トークンを戦略的に挿入または修正することで、LLMsの内部推論プロセスを明示的に導く新しいパラダイム「Thinking Intervention」を提案します。IFEvalにおける指示追従、SEPにおける指示階層、XSTestおよびSORRY-Benchにおける安全性アライメントなど、複数のタスクにわたる包括的な評価を実施しました。その結果、Thinking Interventionはベースラインのプロンプト手法を大幅に上回り、指示追従シナリオでは最大6.7%の精度向上、指示階層の推論では15.4%の改善、オープンソースのDeepSeek R1モデルを使用した安全でないプロンプトに対する拒否率では40.0%の増加を達成しました。全体として、本研究は推論LLMsを制御するための有望な新たな研究分野を開拓するものです。

English

Reasoning-enhanced large language models (LLMs) explicitly generate intermediate reasoning steps prior to generating final answers, helping the model excel in complex problem-solving. In this paper, we demonstrate that this emerging generation framework offers a unique opportunity for more fine-grained control over model behavior. We propose Thinking Intervention, a novel paradigm designed to explicitly guide the internal reasoning processes of LLMs by strategically inserting or revising specific thinking tokens. We conduct comprehensive evaluations across multiple tasks, including instruction following on IFEval, instruction hierarchy on SEP, and safety alignment on XSTest and SORRY-Bench. Our results demonstrate that Thinking Intervention significantly outperforms baseline prompting approaches, achieving up to 6.7% accuracy gains in instruction-following scenarios, 15.4% improvements in reasoning about instruction hierarchies, and a 40.0% increase in refusal rates for unsafe prompts using open-source DeepSeek R1 models. Overall, our work opens a promising new research avenue for controlling reasoning LLMs.

思考介入による推論モデルの効果的制御

Effectively Controlling Reasoning Models through Thinking Intervention

要旨

Support