거절률이 급감하다: 추론 과정에서 안전 정렬은 어떻게 실패하는가?

초록

다단계 추론 능력을 갖춘 대형 추론 모델(LRMs)은 놀라운 문제 해결 능력을 보여주지만, 여전히 잘 이해되지 않는 심각한 안전성 취약점을 드러내고 있습니다. 본 연구에서는 기계적 해석 가능성(mechanistic interpretability)의 관점에서 추론 모델에서 안전성 정렬(safety alignment)이 실패하는 이유를 조사합니다. 토큰 위치별 거부 의도를 추적하기 위해 선형 탐색(linear probing) 접근법을 사용하여, 우리는 '거부 절벽(refusal cliff)'이라는 현저한 현상을 발견했습니다: 잘 정렬되지 않은 많은 추론 모델이 유해한 프롬프트를 정확히 식별하고 사고 과정 동안 강한 거부 의도를 유지하지만, 출력 생성 직전 최종 토큰에서 거부 점수가 급격히 하락합니다. 이는 이러한 모델이 본질적으로 안전하지 않은 것이 아니라, 거부 의도가 체계적으로 억제되고 있음을 시사합니다. 인과적 개입 분석(causal intervention analysis)을 통해, 우리는 거부 행동에 부정적으로 기여하는 희소한 주의 헤드(attention heads) 집합을 식별했습니다. 이러한 헤드 중 단 3%만을 제거하면 공격 성공률을 10% 미만으로 줄일 수 있습니다. 이러한 기계적 통찰을 바탕으로, 우리는 가장 큰 거부 절벽을 보이는 훈련 예제를 식별하여 추론 모델의 안전성 정렬을 효율적으로 수리하는 새로운 데이터 선택 방법인 Cliff-as-a-Judge를 제안합니다. 이 접근법은 기존 안전성 훈련 데이터의 단 1.7%만을 사용하여 비슷한 수준의 안전성 개선을 달성함으로써, 안전성 정렬에서 '적을수록 더 많다(less-is-more)'는 효과를 입증합니다.

English

Large reasoning models (LRMs) with multi-step reasoning capabilities have shown remarkable problem-solving abilities, yet they exhibit concerning safety vulnerabilities that remain poorly understood. In this work, we investigate why safety alignment fails in reasoning models through a mechanistic interpretability lens. Using a linear probing approach to trace refusal intentions across token positions, we discover a striking phenomenon termed as refusal cliff: many poorly-aligned reasoning models correctly identify harmful prompts and maintain strong refusal intentions during their thinking process, but experience a sharp drop in refusal scores at the final tokens before output generation. This suggests that these models are not inherently unsafe; rather, their refusal intentions are systematically suppressed. Through causal intervention analysis, we identify a sparse set of attention heads that negatively contribute to refusal behavior. Ablating just 3\% of these heads can reduce attack success rates below 10\%. Building on these mechanistic insights, we propose Cliff-as-a-Judge, a novel data selection method that identifies training examples exhibiting the largest refusal cliff to efficiently repair reasoning models' safety alignment. This approach achieves comparable safety improvements using only 1.7\% of the vanilla safety training data, demonstrating a less-is-more effect in safety alignment.

거절률이 급감하다: 추론 과정에서 안전 정렬은 어떻게 실패하는가?

Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?

초록

Support