ChatPaper.aiChatPaper

EpiCaR:认知未知领域对提升大型语言模型推理能力至关重要

EpiCaR: Knowing What You Don't Know Matters for Better Reasoning in LLMs

January 11, 2026
作者: Jewon Yeom, Jaewon Sok, Seonghyeon Park, Jeongjae Park, Taesup Kim
cs.AI

摘要

提升大型语言模型(LLM)的推理能力主要依赖于利用模型生成数据进行迭代式自训练。虽然现有方法能有效提升准确率,但其主要强化了成功的推理路径,导致需要付出显著的校准代价:模型会变得过度自信,丧失表征不确定性的能力。这种失效被描述为对齐过程中的一种模型坍塌形式——预测分布会退化为低方差的点估计。我们通过将推理训练重新定义为认知学习问题来解决此问题,在该框架下模型不仅需要学习如何推理,还需掌握何时该信任自身的推理过程。我们提出认知校准推理(EpiCaR)作为联合优化推理性能与校准度的训练目标,并利用显式自评估信号在迭代式监督微调框架中实现该目标。在Llama-3和Qwen-3系列模型上的实验表明,我们的方法在准确率与校准度方面均实现了对标准基线的帕累托优势,尤其在具备足够推理能力的模型(如3B+参数规模)中表现突出。该框架能有效泛化至分布外数学推理(GSM8K)和代码生成(MBPP)任务。最终,我们的方法使具备较强推理能力的模型仅需K=10个样本即可匹配STaR方法K=30样本的性能,实现了推理计算量3倍的降低。
English
Improving the reasoning abilities of large language models (LLMs) has largely relied on iterative self-training with model-generated data. While effective at boosting accuracy, existing approaches primarily reinforce successful reasoning paths, incurring a substantial calibration cost: models become overconfident and lose the ability to represent uncertainty. This failure has been characterized as a form of model collapse in alignment, where predictive distributions degenerate toward low-variance point estimates. We address this issue by reframing reasoning training as an epistemic learning problem, in which models must learn not only how to reason, but also when their reasoning should be trusted. We propose epistemically-calibrated reasoning (EpiCaR) as a training objective that jointly optimizes reasoning performance and calibration, and instantiate it within an iterative supervised fine-tuning framework using explicit self-evaluation signals. Experiments on Llama-3 and Qwen-3 families demonstrate that our approach achieves Pareto-superiority over standard baselines in both accuracy and calibration, particularly in models with sufficient reasoning capacity (e.g., 3B+). This framework generalizes effectively to OOD mathematical reasoning (GSM8K) and code generation (MBPP). Ultimately, our approach enables a 3X reduction in inference compute, matching the K=30 performance of STaR with only K=10 samples in capable models.
PDF51January 15, 2026