推理扩展能否提升推理忠实度?自我一致性权衡的多模型分析
Does Inference Scaling Improve Reasoning Faithfulness? A Multi-Model Analysis of Self-Consistency Tradeoffs
January 10, 2026
作者: Deep Mehta
cs.AI
摘要
自我一致性技术已成为提升大型语言模型在推理任务中准确性的常用方法。其实现方式直观明了:生成多条推理路径,通过多数表决选取最常见答案。虽然这种方法能稳定提升准确率,但其增益是否真正反映推理质量的提升仍不明确。我们针对一个尚未被研究的基础性问题展开探讨:推理规模扩展能否提升推理忠实度?
我们在100道GSM8K数学推理题上对四款前沿模型(GPT-5.2、Claude Opus 4.5、Gemini-3-flash-preview和DeepSeek-v3.2)进行了全面实证研究。通过采用自助置信区间、配对比较的麦克尼马尔检验以及科恩d值效应量等统计方法,我们严谨量化了模型表现。研究结果揭示了各模型间的显著差异,这对关于自我一致性的普遍假设提出了挑战。
GPT-5.2呈现出预期模式:当N=5时准确率从78%提升至90%,而忠实度保持相对稳定(0.540至0.510)。Claude Opus 4.5则展现出完全不同的情况:其准确率从78%下降至74.3%,而忠实度在N=5时从0.270跃升至0.891。DeepSeek-v3.2因已达98%的准确率呈现天花板效应,忠实度仅小幅提升(0.440至0.541)。Gemini-3-flash准确率从81%提升至86%,但忠实度微降(0.260至0.212)。
难题难度分析显示,GPT-5.2能解决82%的难题,而仅对13%的简单题产生误判。相比之下,Claude模型在23%的简单题上出现错误,这解释了其准确率下降的原因。这些发现对实践者具有重要意义:自我一致性并非普遍有益,团队在部署前应针对具体模型进行测试。我们已公开研究代码,并就如何权衡这些指标提供了实用建议。
English
Self-consistency has emerged as a popular technique for improving large language model accuracy on reasoning tasks. The approach is straightforward: generate multiple reasoning paths and select the most common answer through majority voting. While this reliably boosts accuracy, it remains unclear whether these gains reflect genuine improvements in reasoning quality. We investigate a fundamental question that has not been studied before: does inference scaling improve reasoning faithfulness?
We conduct a comprehensive empirical study across four frontier models (GPT-5.2, Claude Opus 4.5, Gemini-3-flash-preview, and DeepSeek-v3.2) on 100 GSM8K mathematical reasoning problems. Our analysis employs bootstrap confidence intervals, McNemar's tests for paired comparisons, and Cohen's d effect sizes to quantify the effects rigorously. The results reveal striking differences across models that challenge common assumptions about self-consistency.
GPT-5.2 shows the expected pattern: accuracy improves from 78% to 90% at N=5, with faithfulness remaining relatively stable (0.540 to 0.510). Claude Opus 4.5 tells a completely different story. Its accuracy actually drops from 78% to 74.3% while faithfulness jumps dramatically from 0.270 to 0.891 at N=5. DeepSeek-v3.2, already at 98% accuracy, shows ceiling effects with modest faithfulness gains (0.440 to 0.541). Gemini-3-flash improves from 81% to 86% accuracy with a slight faithfulness decrease (0.260 to 0.212).
Problem difficulty analysis reveals that GPT-5.2 solves 82% of hard problems while breaking only 13% of easy ones. Claude, in contrast, breaks 23% of easy problems, explaining its accuracy decrease. These findings matter for practitioners: self-consistency is not universally beneficial, and teams should test their specific models before deployment. We release our code and provide practical recommendations for navigating these tradeoffs.