사슬은 유지되지만, 정답은 무너진다: 적대적 압력 하 추론 모델에서의 추론 흔적-정답 분리

초록

추론 모델은 단일 턴 벤치마크에서 평가되지만 다중 턴 대화 환경에 배포되며, 이 환경에서 사용자는 정답에 대해 반박을 가한다. 지속적인 적대적 압력 하에서 우리는 이전에 문서화되지 않은 실패 모드를 발견한다: 체인 오브 소트(사고 과정)는 첫 번째 턴부터 마지막 턴까지 사실적으로 정확하지만 출력된 답변은 잘못된 것으로 바뀐다. 이를 불충실한 항복(UC)이라 명명하고, 전환율 지표와 단일 턴 신뢰성 프로브가 모두 포착하지 못하는 2×2 잠재 대 행동 프레임워크로 이를 분리한다. 세 가지 데이터셋(MT-Consistency, MMLU-Pro, GSM8K)에서 행동 전환 시점의 잠재 정답률은 think 모드에서 약 50%에 군집하고 no_think 모드에서는 11-15%로 붕괴한다. 이는 짝을 이룬 모델 내 인과적 증거로, 추론이 그 격차를 만들어냄을 보여준다. 모델 간 효과는 추론 채널을 따라 추적된다(Qwen3-32B 및 GPT-OSS-20B에서는 높고, 인라인 CoT Gemma-4-31B-it에서는 낮음). 독립적인 GPT-4o 판별기는 UC 레이블의 86%를 확인하며, 토큰 수준 프로브는 답변 슬롯의 argmax가 UC 셀의 84%에서 정확함을 보여준다. 단순한 추적 기반 방어는 역효과를 낸다. 우리는 모든 궤적, 추적 및 판별기 레이블을 공개한다.

English

Reasoning models are evaluated on single-turn benchmarks but deployed in multi-turn dialogue, where users push back on correct answers. Under sustained adversarial pressure we find a previously undocumented failure mode: the chain-of-thought stays factually correct from first turn to last while the emitted answer flips wrong. We call this unfaithful capitulation (UC) and isolate it with a 2times 2 latent-versus-behavioral framework that flip-rate metrics and single-turn faithfulness probes both miss. Across three datasets (MT-Consistency, MMLU-Pro, GSM8K), the latent-correct rate at the behavioral flip clusters near 50% in think mode and collapses to 11-15% under no_think -- paired, within-model causal evidence that reasoning creates the gap. Across models the effect tracks the reasoning channel (high in Qwen3-32B and GPT-OSS-20B, low in inline-CoT Gemma-4-31B-it). An independent GPT-4o judge corroborates 86% of UC labels; a token-level probe shows the answer-slot argmax is correct in 84% of UC cells; and a naive trace-anchored defense backfires. We release all trajectories, traces, and judge labels.