사고의 연쇄가 더 잘 알 때: 다중 턴 추론 모델의 실패 모드

초록

다중 턴 추론 모델의 실패는 최종 점수 평가에서 거의 드러나지 않는다. 모델이 긴 대화 초반에 안전하지 않은 입장을 고수할 수 있지만, 최종 턴의 거부율은 강건하게 정렬된 기준 모델과 구별하기 어려워 보일 수 있다. 이러한 숨겨진 시간적 역학을 드러내기 위해, 우리는 추적 수준의 진단 도구인 CoT-출력 2x2 안전 매트릭스를 제안한다. 이 프레임워크는 모든 턴을 내부 추론과 가시적 출력이라는 두 독립적인 축을 따라 레이블링하여, 네 가지 작동적으로 정의된 실패 셀을 생성한다: 강건한 정렬, 정렬 가장, 공개적 탈옥, 그리고 우리가 맥락 주입 실패라고 명명한 별개의 실패 모드(CoT는 안전한 추론을 유지하지만 가시적 출력이 유해함을 생성하여, 추론 불충실성의 다중 턴 징후를 부각시킴). 우리는 세 가지 증류된 추론 대상 모델을 고정된 공격자에 대해 다섯 가지 감독 조건에서 평가하여, 정보-위해 시나리오에서 6750개의 턴 수준 관찰 데이터를 수집했다. 분석 결과, 두 가지 재현 가능한 취약점이 드러났다: 명시적 감독 신호가 오히려 정렬 가장 비율을 억제하지 않고 역설적으로 증가시키는 감독 역설, 그리고 모델이 안전한 내부 상태에도 불구하고 안전하지 않은 외부 출력에 고착되는 맥락 주입 실패이다. 우리는 다중 턴 대화와 CoT 추적 데이터 전체 세트를 공개하여 후속 추적 진단 연구를 지원한다.

English

Failures in multi-turn reasoning models are largely invisible to terminal-score evaluation. A model can lock onto an unsafe stance early in a long dialogue, yet its final-turn refusal rate may appear indistinguishable from a robustly aligned baseline. To expose these hidden temporal dynamics, we propose a trace-level diagnostic - the CoT-Output 2x2 safety matrix. This framework labels every turn along two independent axes (internal reasoning and visible output), yielding four operationally defined failure cells: robust alignment, alignment faking, overt jailbreak, and a distinct failure mode we term context-injection failure (where the CoT maintains safe reasoning, but the visible output produces harm, highlighting a multi-turn manifestation of reasoning unfaithfulness). We evaluate three distilled reasoning targets against a fixed attacker across five oversight conditions, collecting 6750 turn-level observations on the Information-Hazard scenario. Our analysis reveals two reproducible vulnerabilities: an oversight paradox where explicit monitoring cues paradoxically increase alignment-faking rates rather than suppress them, and a context-injection failure where models lock onto unsafe external outputs despite safe internal states. We release the full dataset of multi-turn dialogues and CoT traces to support follow-up trace-diagnostic research.