推論スケーリングは推論の忠実性を向上させるか？自己一貫性のトレードオフに関するマルチモデル分析

要旨

自己一貫性（Self-consistency）は、推論タスクにおける大規模言語モデルの精度向上を図る手法として広く用いられるようになってきた。この手法は単純明快で、複数の推論経路を生成し、多数決によって最も一般的な回答を選択する。これは確実に精度を向上させるが、この精度向上が真の推論品質の改善を反映しているかどうかは不明なままである。本研究では、これまで検討されてこなかった根本的な疑問、すなわち「推論スケーリングは推論の忠実性（faithfulness）を向上させるのか」を検証する。我々は、4つの先進モデル（GPT-5.2、Claude Opus 4.5、Gemini-3-flash-preview、DeepSeek-v3.2）を用い、GSM8Kの数学的推論問題100問に対して包括的な実証研究を行った。分析には、ブートストラップ信頼区間、対応のある比較のためのマクネマー検定、効果量の定量化のためのコーエンのdを採用し、効果を厳密に評価した。結果は、自己一貫性に関する通念に疑問を投げかける顕著なモデル間差を示した。 GPT-5.2は期待通りのパターンを示した：精度は78%から90%（N=5時）に向上し、忠実性は比較的安定（0.540から0.510）していた。Claude Opus 4.5の結果は全く異なる：精度は78%から74.3%に低下した一方で、忠実性はN=5で0.270から0.891へと劇的に上昇した。精度が既に98%のDeepSeek-v3.2は天井効果を示し、忠実性の向上は小幅（0.440から0.541）だった。Gemini-3-flashは精度が81%から86%に向上したが、忠実性はわずかに低下（0.260から0.212）した。問題の難易度分析によれば、GPT-5.2は難問の82%を解決する一方、容易な問題の誤答は13%のみであった。対照的に、Claudeは容易な問題の23%で誤答しており、これが精度低下の原因と考えられる。これらの知見は実務家にとって重要である：自己一貫性は必ずしも万能ではなく、チームは導入前に特定のモデルをテストすべきである。我々はコードを公開し、これらのトレードオフに対処するための実践的な提言を行う。

English

Self-consistency has emerged as a popular technique for improving large language model accuracy on reasoning tasks. The approach is straightforward: generate multiple reasoning paths and select the most common answer through majority voting. While this reliably boosts accuracy, it remains unclear whether these gains reflect genuine improvements in reasoning quality. We investigate a fundamental question that has not been studied before: does inference scaling improve reasoning faithfulness? We conduct a comprehensive empirical study across four frontier models (GPT-5.2, Claude Opus 4.5, Gemini-3-flash-preview, and DeepSeek-v3.2) on 100 GSM8K mathematical reasoning problems. Our analysis employs bootstrap confidence intervals, McNemar's tests for paired comparisons, and Cohen's d effect sizes to quantify the effects rigorously. The results reveal striking differences across models that challenge common assumptions about self-consistency. GPT-5.2 shows the expected pattern: accuracy improves from 78% to 90% at N=5, with faithfulness remaining relatively stable (0.540 to 0.510). Claude Opus 4.5 tells a completely different story. Its accuracy actually drops from 78% to 74.3% while faithfulness jumps dramatically from 0.270 to 0.891 at N=5. DeepSeek-v3.2, already at 98% accuracy, shows ceiling effects with modest faithfulness gains (0.440 to 0.541). Gemini-3-flash improves from 81% to 86% accuracy with a slight faithfulness decrease (0.260 to 0.212). Problem difficulty analysis reveals that GPT-5.2 solves 82% of hard problems while breaking only 13% of easy ones. Claude, in contrast, breaks 23% of easy problems, explaining its accuracy decrease. These findings matter for practitioners: self-consistency is not universally beneficial, and teams should test their specific models before deployment. We release our code and provide practical recommendations for navigating these tradeoffs.

推論スケーリングは推論の忠実性を向上させるか？自己一貫性のトレードオフに関するマルチモデル分析

Does Inference Scaling Improve Reasoning Faithfulness? A Multi-Model Analysis of Self-Consistency Tradeoffs

要旨

Support