내적 독백 모니터링: 프로브 궤적이 드러내는 추론 역학

초록

대규모 추론 모델(LRM)은 Chain of Thought(CoT) 추론을 통해 안전 모니터링에 새로운 기회를 제공한다. 그러나 CoT가 항상 모델의 최종 출력에 충실한 것은 아니며, 이는 모니터링 도구로서의 신뢰성을 저하시킨다. 이 문제를 해결하기 위해, 우리는 LRM의 은닉 표현을 조사하여 프롬프트 및 CoT 표현으로부터 미래 행동을 예측할 수 있는지 확인한다. 생성된 각 토큰에서 프로브를 평가함으로써, 추론 과정 전반에 걸친 개념 확률의 연속적 진화인 프로브 궤적을 구성한다. 미래 모델 행동은 단일 정적 예측보다 전체 궤적에 걸쳐 조사할 때 더 잘 구별된다. 이러한 시간적 역학을 특성화하기 위해, 변동성, 추세, 정상 상태 행동을 포착하는 신호 처리 특징을 추출하여 미래 모델 상태의 분리를 크게 개선한다. 또한 두 가지 방법론적 통찰을 제시한다. 첫째, 템플릿 기반 훈련 데이터는 동적으로 생성된 모델 응답과 거의 동등한 성능을 달성하여, 비용이 많이 드는 초기 추론 및 레이블링 과정을 제거한다. 둘째, 풀링 연산의 선택이 중요하다: 평균 풀링과 마지막 토큰 방법은 무작위에 가까운 성능으로 떨어지는 반면, 맥스 풀링은 최대 95%의 AUROC를 달성하고 안정적인 프로브 궤적을 생성한다. 안전 및 수학 도메인에 걸친 네 개의 데이터셋과 네 개의 추론 모델을 사용하여, 궤적 특징이 결과 분리성을 개선하는 작업별 역학을 인코딩함을 입증한다. 이러한 발견은 프로브 궤적을 LRM 행동 모니터링을 위한 보완적 프레임워크로 확립한다. 경고: 본 논문은 잠재적으로 유해한 내용을 포함하고 있습니다.

English

Large Reasoning Models (LRMs) introduce new opportunities for safety monitoring through their Chain of Thought (CoT) reasoning. However, CoT is not always faithful to the model's final output, undermining its reliability as a monitoring tool. To address this, we investigate the hidden representations of LRMs to determine whether future behavior can be predicted from prompt and CoT representations. By evaluating a probe at each generated token, we construct a probe trajectory, the continuous evolution of a concept's probability across the reasoning process. We find that future model behavior is more distinguishable when examined over the full trajectory than from a single static prediction. To characterize these temporal dynamics, we extract signal-processing features that capture volatility, trend, and steady-state behavior, significantly improving the separation of future model states. We also present two methodological insights. First, template-based training data achieves near-parity with dynamically generated model responses, eliminating the need for a costly initial inference and labeling. Second, the choice of pooling operation is critical: average-pooling and last-token methods collapse to near-random performance, while max-pooling achieves up to 95% AUROC and yields stable probe trajectories. Using four datasets and four reasoning models across the domains of safety and mathematics, we demonstrate that trajectory features encode task-specific dynamics that improve outcome separability. These findings establish probe trajectories as a complementary framework for monitoring LRM behavior. Warning: This article contains potentially harmful content.