CopT: 連続空間における対照的オン方策思考と汎用・エージェント的推論

要旨

思考連鎖（CoT）は、大規模言語モデル（LLM）から推論能力を引き出すための標準的な手法である。しかし、一般的なCoTパラダイムでは、回答の前提として思考を位置づけるため、モデルが拡張的な思考を行う前に回答を特定できる場合でも、妥当な回答へのアクセスが遅れ、不要なトークンコストが発生する。このような動作は「パフォーマティブ推論」と呼ばれる。本論文では、思考と回答の通常の順序を逆転させた、再構成型推論パイプラインであるCopTを提案する。CopTは、思考の前に回答を行うのではなく、まずドラフト回答を生成し、そのドラフト回答に基づく後続のオン・ポリシー思考を呼び出し、振り返りと修正を行う。ドラフト回答を信頼すべきかどうかを評価するため、CopTは連続埋め込みを推論時の対照検証器として再解釈する。具体的には、離散トークン入力と連続埋め込み入力の下でモデルが同じ生成トークンに対して示すサポートを対比し、回答信頼性に関する系列レベルの逆KL推定器を導出する。本解析により、特定の仮定の下で期待推定値が未解決の潜在状態と発行された回答トークン間の相互情報量に等しくなることが示され、なぜこの推定器が潜在状態における任意の不確実性ではなく回答関連の不確実性を捉えるのかが説明される。回答が不十分な信頼性と判断された場合、CopTはさらにオン・ポリシー思考を実行し、その際に第2のKL推定器がドラフト回答の可視性を動的に制御することで、有用な部分情報を保持しつつ、信頼できない内容に誤導されるリスクを低減する。数学、コーディング、エージェント推論タスクにおいて、CopTは追加学習なしで最大23%のピーク精度改善と、同等以上の精度で最大57%のトークン使用量削減を達成する。コードはhttps://github.com/sdc17/CopTで公開されている。

English

Chain-of-thought (CoT) is a standard approach for eliciting reasoning capabilities from large language models (LLMs). However, the common CoT paradigm treats thinking as a prerequisite for answering, which can delay access to plausible answers and incur unnecessary token costs even when the model is able to identify an answer before extended thinking, a behavior known as performative reasoning. In this paper, we introduce CopT, a reformulated reasoning pipeline that reverses the usual order of thinking and answering. Instead of thinking before answering, CopT first elicits a draft answer and then invokes subsequent on-policy thinking conditioned on its own draft answer for reflection and correction. To assess whether the draft answer should be trusted, CopT recasts continuous embeddings as inference-time contrastive verifiers. Specifically, it contrasts the model's support for the same generated tokens under discrete-token inputs and continuous-embedding inputs, yielding a sequence-level reverse KL estimator for answer reliability. Our analysis shows that under certain assumptions, the expected estimate equals the mutual information between the unresolved latent state and the emitted answer token, explaining why it captures answer-relevant uncertainty rather than arbitrary uncertainty in the latent state. When the answer is deemed insufficiently reliable, CopT performs further on-policy thinking, where a second KL estimator dynamically controls draft-answer visibility, preserving useful partial information while reducing the risk of being misled by unreliable content. Across mathematics, coding, and agentic reasoning tasks, CopT improves peak accuracy by up to 23% and reduces token usage by up to 57% at comparable or higher accuracy, without any additional training. The code is available at https://github.com/sdc17/CopT.