LYNX:基于置信度控制推理的动态出口学习
LYNX: Learning Dynamic Exits for Confidence-Controlled Reasoning
December 5, 2025
作者: Ömer Faruk Akgül, Yusuf Hakan Kalaycı, Rajgopal Kannan, Willie Neiswanger, Viktor Prasanna
cs.AI
摘要
大型推理模型通过生成扩展的思维链在复杂任务上表现卓越,但常存在"过度思考"现象:即在已掌握足够信息给出正确答案后仍持续推理。这不仅浪费推理计算资源,还可能降低准确率。现有早期终止方案或通过附加采样和启发式方法干预解码过程,或依赖辅助验证模型,或仅作为事后分析流程而缺乏形式化保证。我们提出LYNX——一种将模型自身隐藏状态感知转化为置信度控制停止决策的在线早期退出机制。该机制在生成过程中将退出决策锚定于自然出现的推理线索(如"嗯""等等"),利用强制退出产生的监督信号在这些线索标记对应的隐藏状态上训练轻量级探测头,并通过分形保形预测封装得分以实现对提前退出的无分布控制。关键的是,我们在通用数学语料上一次性训练并校准该探测头,随后将其直接复用于不同基准测试、解码温度乃至非数学任务。在涵盖1.5B至32B参数的三个模型系列中,每个基础模型仅需配备一个数学训练的探测头即可实现优异的准确率-效率平衡。在GSM8K上,LYNX在减少40%-65%标记数的同时保持或提升基线准确率;在MATH-500上以约35%-60%的标记缩减实现最高12个百分点的准确率提升;在AIME 2024中节省超50%标记数即可恢复基线准确率;在非数学基准CommonsenseQA上,其零样本迁移实现了适度准确率增益与最高70%的标记节省。与最先进的早期退出方法相比,LYNX在保持完全在线、无需推理时代理模型且提供用户可调置信保障的前提下,呈现出具有竞争力或更优的帕累托边界。
English
Large reasoning models achieve strong performance on complex tasks by generating extended chains of thought, but they often "overthink": continuing to reason long after they have enough information to answer correctly. This wastes inference-time compute and can hurt accuracy. Existing attempts to stop early either manipulate decoding with extra sampling and heuristics, rely on auxiliary verifier models, or operate only as post-hoc analysis pipelines without formal guarantees. We introduce LYNX, an online early-exit mechanism that turns a model's own hidden-state awareness into confidence-controlled stopping decisions. LYNX attaches exit decisions to naturally occurring reasoning cues (e.g., "hmm", "wait") during generation, trains a lightweight probe on hidden states at those cue tokens using supervision from forced exits, and wraps the resulting scores in split conformal prediction to obtain distribution-free control over premature exits. Crucially, we train and calibrate this probe once on a generic mathematical corpus and reuse it unchanged across benchmarks, decoding temperatures, and even non-mathematical tasks. Across three model families spanning 1.5B to 32B parameters, a single mathematically trained probe per base model yields strong accuracy--efficiency tradeoffs. On GSM8K, LYNX matches or improves baseline accuracy while reducing tokens by 40--65\%; on MATH-500 it improves accuracy by up to 12 points with roughly 35--60\% fewer tokens; on AIME 2024 it recovers baseline accuracy with more than 50\% token savings; and on CommonsenseQA, a non-math benchmark, it transfers zero-shot with modest accuracy gains and up to 70\% fewer tokens. Compared to state-of-the-art early-exit methods, LYNX offers competitive or superior Pareto frontiers while remaining fully online, requiring no proxy models at inference, and providing explicit, user-tunable confidence guarantees.