ChatPaper.aiChatPaper

CopT: 面向通用与智能体推理的连续空间对比式在策略思维

CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

May 19, 2026
作者: Dachuan Shi, Hanlin Zhu, Xiangchi Yuan, Wanjia Zhao, Kejing Xia, Wen Xiao, Wenke Lee
cs.AI

摘要

思维链(Chain-of-thought, CoT)是激发大型语言模型(LLMs)推理能力的标准方法。然而,常见的CoT范式中将思考视为回答的前提,这会延迟获得合理答案的时机,并在模型甚至能在延展思考前就已识别出答案的情况下(一种被称为表演性推理的行为)产生不必要的词元开销。本文中,我们提出CopT,一种重新编排的推理流程,它颠覆了思考与回答的常规顺序。CopT不预先进行思考,而是先引出草稿答案,随后基于该草稿答案调用同策略思考进行反思与修正。为评估草稿答案的可信度,CopT将连续嵌入重构为推理时对比验证器。具体而言,它对比模型在离散词元输入与连续嵌入输入下对相同生成词元的支持程度,从而得到用于答案可靠性的序列级反向KL散度估计量。我们的分析表明,在特定假设下,该期望估计值等于未解决的潜在状态与输出的答案词元之间的互信息,这解释了为何它能捕捉与答案相关的不确定性,而非潜在状态中的任意不确定性。当答案被认为不够可靠时,CopT执行进一步的同策略思考,其中第二个KL散度估计量动态控制草稿答案的可见性,既保留有用的部分信息,又降低被不可靠内容误导的风险。在数学、编程和智能体推理任务中,CopT在达到相当或更高准确率的情况下,将峰值准确率提升最高达23%,并将词元使用量减少最高达57%,且无需额外训练。代码已开源至https://github.com/sdc17/CopT。
English
Chain-of-thought (CoT) is a standard approach for eliciting reasoning capabilities from large language models (LLMs). However, the common CoT paradigm treats thinking as a prerequisite for answering, which can delay access to plausible answers and incur unnecessary token costs even when the model is able to identify an answer before extended thinking, a behavior known as performative reasoning. In this paper, we introduce CopT, a reformulated reasoning pipeline that reverses the usual order of thinking and answering. Instead of thinking before answering, CopT first elicits a draft answer and then invokes subsequent on-policy thinking conditioned on its own draft answer for reflection and correction. To assess whether the draft answer should be trusted, CopT recasts continuous embeddings as inference-time contrastive verifiers. Specifically, it contrasts the model's support for the same generated tokens under discrete-token inputs and continuous-embedding inputs, yielding a sequence-level reverse KL estimator for answer reliability. Our analysis shows that under certain assumptions, the expected estimate equals the mutual information between the unresolved latent state and the emitted answer token, explaining why it captures answer-relevant uncertainty rather than arbitrary uncertainty in the latent state. When the answer is deemed insufficiently reliable, CopT performs further on-policy thinking, where a second KL estimator dynamically controls draft-answer visibility, preserving useful partial information while reducing the risk of being misled by unreliable content. Across mathematics, coding, and agentic reasoning tasks, CopT improves peak accuracy by up to 23% and reduces token usage by up to 57% at comparable or higher accuracy, without any additional training. The code is available at https://github.com/sdc17/CopT.