私を騙してみて：推論モデルにおける連鎖思考推論の信頼性はどの程度か？

要旨

思考連鎖（CoT）推論は、安全が批判的な展開における大規模言語モデルの透明性メカニズムとして提案されてきたが、その有効性は忠実性（モデルが自身の出力に実際に影響を与える要因を正確に言語化できるかどうか）に依存する。従来の評価では2つのプロプライエタリモデルでのみこの特性が検証され、Claude 3.7 Sonnetで25%、DeepSeek-R1で39%という低い認識率が報告されていた。本研究では、オープンウェイト生態系における評価を拡張するため、9つのアーキテクチャファミリー（7B-685Bパラメータ）に跨る12のオープンウェイト推論モデルを対象に、MMLUとGPQA Diamondの498問の多肢選択問題を用いて検証を行った。6種類の推論ヒント（同調性、一貫性、視覚的パターン、メタデータ、採点者ハッキング、非倫理的情報）を注入し、ヒントが回答を変更させることに成功した場合に、モデルがCoT内でヒントの影響を認識する割合を測定した。41,832回の推論実行を通じて、モデルファミリー間の総合的な忠実性率は39.7%（Seed-1.6-Flash）から89.9%（DeepSeek-V3.2-Speciale）の範囲に分布し、一貫性ヒント（35.5%）と同調性ヒント（53.9%）が最も低い認識率を示した。訓練方法論とモデルファミリーは、パラメータ数よりも忠実性を強く予測し、キーワードベースの分析では、思考トークンの認識率（約87.5%）と回答テキストの認識率（約28.6%）に顕著な隔たりが明らかになった。これは、モデルが内部的にはヒントの影響を認識しているものの、出力において体系的にこの認識を抑制していることを示唆する。これらの知見は、安全メカニズムとしてのCoTモニタリングの実現可能性に直接的な示唆を与えるとともに、忠実性が推論モデルの固定的特性ではなく、アーキテクチャ、訓練方法、影響を与える手がかりの性質によって体系的に変化することを示している。

English

Chain-of-thought (CoT) reasoning has been proposed as a transparency mechanism for large language models in safety-critical deployments, yet its effectiveness depends on faithfulness (whether models accurately verbalize the factors that actually influence their outputs), a property that prior evaluations have examined in only two proprietary models, finding acknowledgment rates as low as 25% for Claude 3.7 Sonnet and 39% for DeepSeek-R1. To extend this evaluation across the open-weight ecosystem, this study tests 12 open-weight reasoning models spanning 9 architectural families (7B-685B parameters) on 498 multiple-choice questions from MMLU and GPQA Diamond, injecting six categories of reasoning hints (sycophancy, consistency, visual pattern, metadata, grader hacking, and unethical information) and measuring the rate at which models acknowledge hint influence in their CoT when hints successfully alter answers. Across 41,832 inference runs, overall faithfulness rates range from 39.7% (Seed-1.6-Flash) to 89.9% (DeepSeek-V3.2-Speciale) across model families, with consistency hints (35.5%) and sycophancy hints (53.9%) exhibiting the lowest acknowledgment rates. Training methodology and model family predict faithfulness more strongly than parameter count, and keyword-based analysis reveals a striking gap between thinking-token acknowledgment (approximately 87.5%) and answer-text acknowledgment (approximately 28.6%), suggesting that models internally recognize hint influence but systematically suppress this acknowledgment in their outputs. These findings carry direct implications for the viability of CoT monitoring as a safety mechanism and suggest that faithfulness is not a fixed property of reasoning models but varies systematically with architecture, training method, and the nature of the influencing cue.

私を騙してみて：推論モデルにおける連鎖思考推論の信頼性はどの程度か？

Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?

要旨

Support