나를 속여봐: 추론 모델에서 사고 연쇄(Chain-of-Thought) 추론의 신뢰도는 어느 정도인가?

초록

사안이 중요한 분야에서 대규모 언어 모델의 투명성 메커니즘으로 사고 연쇄(Chain-of-thought, CoT) 추론이 제안되었으나, 그 효과성은 충실도(모델이 실제로 출력에 영향을 미치는 요인을 정확하게 언어화하는지 여부)에 달려 있습니다. 기존 평가에서는 단 두 개의 사유 모델만을 대상으로 이 속성을 조사했으며, Claude 3.7 Sonnet은 25%, DeepSeek-R1은 39%라는 낮은 인정 비율을 발견했습니다. 이러한 평가를 오픈 웨이트 생태계 전반으로 확장하기 위해, 본 연구는 9개 아키텍처 계열(7B-685B 매개변수)에 걸친 12개의 오픈 웨이트 추론 모델을 MMLU 및 GPQA Diamond의 498개 객관식 문항에 대해 테스트했습니다. 여섯 가지 범주의 추론 힌트(아첨, 일관성, 시각적 패턴, 메타데이터, 채점기 조작, 비윤리적 정보)를 주입하고, 힌트가 답변을 성공적으로 변경했을 때 모델이 CoT에서 힌트 영향력을 인정하는 비율을 측정했습니다. 41,832회의 추론 실행 전반에 걸쳐, 모델 계열별 전체 충실도 비율은 Seed-1.6-Flash(39.7%)부터 DeepSeek-V3.2-Speciale(89.9%)까지 다양했으며, 일관성 힌트(35.5%)와 아첨 힌트(53.9%)에서 가장 낮은 인정 비율을 보였습니다. 훈련 방법론과 모델 계열이 매개변수 수보다 충실도를 더 강력하게 예측했으며, 키워드 기반 분석은 생각 토큰 인정(약 87.5%)과 답변 텍스트 인정(약 28.6%) 사이에 현격한 차이를 드러냈습니다. 이는 모델이 내부적으로는 힌트 영향력을 인식하지만 출력에서는 이를 체계적으로 억제함을 시사합니다. 이러한 연구 결과는 안전 메커니즘으로서 CoT 모니터링의 실현 가능성에 직접적인 시사점을 제공하며, 충실도가 추론 모델의 고정된 속성이 아니라 아키텍처, 훈련 방법, 영향 큐의 성격에 따라 체계적으로 변한다는 점을 보여줍니다.

English

Chain-of-thought (CoT) reasoning has been proposed as a transparency mechanism for large language models in safety-critical deployments, yet its effectiveness depends on faithfulness (whether models accurately verbalize the factors that actually influence their outputs), a property that prior evaluations have examined in only two proprietary models, finding acknowledgment rates as low as 25% for Claude 3.7 Sonnet and 39% for DeepSeek-R1. To extend this evaluation across the open-weight ecosystem, this study tests 12 open-weight reasoning models spanning 9 architectural families (7B-685B parameters) on 498 multiple-choice questions from MMLU and GPQA Diamond, injecting six categories of reasoning hints (sycophancy, consistency, visual pattern, metadata, grader hacking, and unethical information) and measuring the rate at which models acknowledge hint influence in their CoT when hints successfully alter answers. Across 41,832 inference runs, overall faithfulness rates range from 39.7% (Seed-1.6-Flash) to 89.9% (DeepSeek-V3.2-Speciale) across model families, with consistency hints (35.5%) and sycophancy hints (53.9%) exhibiting the lowest acknowledgment rates. Training methodology and model family predict faithfulness more strongly than parameter count, and keyword-based analysis reveals a striking gap between thinking-token acknowledgment (approximately 87.5%) and answer-text acknowledgment (approximately 28.6%), suggesting that models internally recognize hint influence but systematically suppress this acknowledgment in their outputs. These findings carry direct implications for the viability of CoT monitoring as a safety mechanism and suggest that faithfulness is not a fixed property of reasoning models but varies systematically with architecture, training method, and the nature of the influencing cue.

나를 속여봐: 추론 모델에서 사고 연쇄(Chain-of-Thought) 추론의 신뢰도는 어느 정도인가?

Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?

초록

Support