ChatPaper.aiChatPaper

CopT:連續空間下的對比式在策略思考用於通用及智能體推理

CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

May 19, 2026
作者: Dachuan Shi, Hanlin Zhu, Xiangchi Yuan, Wanjia Zhao, Kejing Xia, Wen Xiao, Wenke Lee
cs.AI

摘要

鏈式思考(CoT)是從大型語言模型(LLM)中引發推理能力的標準方法。然而,常見的CoT範例將思考視為回答的前提,這可能延遲獲得合理答案的時間,甚至在模型能夠在深入思考之前就已辨識出答案時,仍會產生不必要的詞元開銷——此行為稱為表演性推理。本文提出CopT,這是一種重新構建的推理管線,反轉了思考與回答的通常順序。不同於先思考再回答,CopT首先引出一個草稿答案,然後根據其自身草稿答案調用後續的基於當前策略的思考,以進行反思與修正。為了評估草稿答案是否可信,CopT將連續嵌入重新塑造成推論時的對比驗證器。具體而言,它對比模型在離散詞元輸入與連續嵌入輸入下對相同生成詞元的支持度,從而產生一個序列層級的逆向KL估計量,用以評估答案可靠性。我們的分析顯示,在某些假設下,期望估計值等於未解潛在狀態與生成答案詞元之間的互信息,這解釋了為何它能捕捉與答案相關的不確定性,而非潛在狀態中的任意不確定性。當答案被認為不夠可靠時,CopT會執行進一步的基於當前策略的思考,其中第二個KL估計量動態控制草稿答案的可見度,從而保留有用的部分資訊,同時降低被不可靠內容誤導的風險。在數學、程式設計與代理推理任務上,CopT在相當或更高的準確率下,峰值準確率提升最高達23%,詞元使用量減少最高達57%,且無需任何額外訓練。程式碼已於 https://github.com/sdc17/CopT 公開。
English
Chain-of-thought (CoT) is a standard approach for eliciting reasoning capabilities from large language models (LLMs). However, the common CoT paradigm treats thinking as a prerequisite for answering, which can delay access to plausible answers and incur unnecessary token costs even when the model is able to identify an answer before extended thinking, a behavior known as performative reasoning. In this paper, we introduce CopT, a reformulated reasoning pipeline that reverses the usual order of thinking and answering. Instead of thinking before answering, CopT first elicits a draft answer and then invokes subsequent on-policy thinking conditioned on its own draft answer for reflection and correction. To assess whether the draft answer should be trusted, CopT recasts continuous embeddings as inference-time contrastive verifiers. Specifically, it contrasts the model's support for the same generated tokens under discrete-token inputs and continuous-embedding inputs, yielding a sequence-level reverse KL estimator for answer reliability. Our analysis shows that under certain assumptions, the expected estimate equals the mutual information between the unresolved latent state and the emitted answer token, explaining why it captures answer-relevant uncertainty rather than arbitrary uncertainty in the latent state. When the answer is deemed insufficiently reliable, CopT performs further on-policy thinking, where a second KL estimator dynamically controls draft-answer visibility, preserving useful partial information while reducing the risk of being misled by unreliable content. Across mathematics, coding, and agentic reasoning tasks, CopT improves peak accuracy by up to 23% and reduces token usage by up to 57% at comparable or higher accuracy, without any additional training. The code is available at https://github.com/sdc17/CopT.