推理扩展是否提升推理忠实度?多模型自洽性权衡分析
Does Inference Scaling Improve Reasoning Faithfulness? A Multi-Model Analysis of Self-Consistency Tradeoffs
January 10, 2026
作者: Deep Mehta
cs.AI
摘要
自洽性技术已成为提升大语言模型推理任务准确率的常用方法。该技术思路简明:生成多条推理路径后通过多数投票选择最常见答案。虽然这种方法能稳定提升准确率,但其增益是否真正反映推理质量的提升仍不明确。我们针对一个尚未被研究的基础问题展开探讨:推理规模扩展是否会提升推理忠实度?
我们在100道GSM8K数学推理题上对四种前沿模型(GPT-5.2、Claude Opus 4.5、Gemini-3-flash-preview和DeepSeek-v3.2)进行了全面实证研究。通过自助置信区间、配对比较的麦克尼马尔检验和科恩d值效应量进行量化分析,结果揭示了挑战自洽性常规认知的显著模型差异。
GPT-5.2呈现预期模式:当N=5时准确率从78%提升至90%,忠实度保持相对稳定(0.540→0.510)。Claude Opus 4.5则展现完全不同的趋势:准确率从78%降至74.3%,而忠实度在N=5时从0.270跃升至0.891。DeepSeek-v3.2因已达98%准确率出现天花板效应,忠实度仅小幅提升(0.440→0.541)。Gemini-3-flash准确率从81%增至86%,但忠实度微降(0.260→0.212)。
难题分析表明,GPT-5.2能解决82%的难题且仅破坏13%的简单题。相反地,Claude模型会破坏23%的简单题,这解释了其准确率下降的原因。这些发现对实践者具有重要意义:自洽性并非普遍有益,团队在部署前应针对具体模型进行测试。我们已公开代码,并为平衡这些权衡关系提供实践建议。
English
Self-consistency has emerged as a popular technique for improving large language model accuracy on reasoning tasks. The approach is straightforward: generate multiple reasoning paths and select the most common answer through majority voting. While this reliably boosts accuracy, it remains unclear whether these gains reflect genuine improvements in reasoning quality. We investigate a fundamental question that has not been studied before: does inference scaling improve reasoning faithfulness?
We conduct a comprehensive empirical study across four frontier models (GPT-5.2, Claude Opus 4.5, Gemini-3-flash-preview, and DeepSeek-v3.2) on 100 GSM8K mathematical reasoning problems. Our analysis employs bootstrap confidence intervals, McNemar's tests for paired comparisons, and Cohen's d effect sizes to quantify the effects rigorously. The results reveal striking differences across models that challenge common assumptions about self-consistency.
GPT-5.2 shows the expected pattern: accuracy improves from 78% to 90% at N=5, with faithfulness remaining relatively stable (0.540 to 0.510). Claude Opus 4.5 tells a completely different story. Its accuracy actually drops from 78% to 74.3% while faithfulness jumps dramatically from 0.270 to 0.891 at N=5. DeepSeek-v3.2, already at 98% accuracy, shows ceiling effects with modest faithfulness gains (0.440 to 0.541). Gemini-3-flash improves from 81% to 86% accuracy with a slight faithfulness decrease (0.260 to 0.212).
Problem difficulty analysis reveals that GPT-5.2 solves 82% of hard problems while breaking only 13% of easy ones. Claude, in contrast, breaks 23% of easy problems, explaining its accuracy decrease. These findings matter for practitioners: self-consistency is not universally beneficial, and teams should test their specific models before deployment. We release our code and provide practical recommendations for navigating these tradeoffs.