기하학적 카나리아: 표현 안정성을 통한 조종 가능성 예측 및 표류 감지

초록

언어 모델의 안정적인 배포에는 표면적으로는 구별되지만 공통된 기하학적 토대를 공유하는 두 가지 능력이 필요합니다: 모델이 의도된 행동 제어를 수용할지 예측하는 것과 모델의 내부 구조가 저하될 때 이를 감지하는 것입니다. 우리는 표현의 짝거리(pairwise) 거리 구조의 일관성을 의미하는 기하학적 안정성(geometric stability)이 이 두 가지를 모두 해결할 수 있음을 보여줍니다. 작업에 정렬된 기하학적 안정성을 측정하는 지도 학습 방식의 Shesha 변형들은 35-69개의 임베딩 모델과 세 가지 NLP 작업에 걸쳐 선형 조종 가능성(linear steerability)을 거의 완벽에 가까운 정확도(ρ= 0.89-0.97)로 예측하며, 클래스 분리 가능성(class separability) 이상의 고유한 분산을 포착합니다(부분 ρ= 0.62-0.76). 중요한 분리 현상이 나타납니다: 비지도 학습 기반 안정성은 실제 작업에서의 조종(steering) 예측에 대해 완전히 실패하며(ρ 약 0.10), 이는 작업 정렬(task alignment)이 제어 가능성 예측에 필수적임을 보여줍니다. 그러나 비지도 학습 기반 안정성은 표류 감지(drift detection)에서 탁월한 성능을 발휘하여, 훈련 후 정렬(post-training alignment) 과정에서 CKA보다 최대 2배 가까운 기하학적 변화를 측정하며(Llama에서는 최대 5.23배), 모델의 73%에서 더 빠른 경고를 제공하고 Procrustes보다 6배 낮은 오경보율(false alarm rate)을 유지합니다. 지도 학습 및 비지도 학습 기반 안정성은 함께 LLM 배포 라이프사이클을 위한 상호 보완적인 진단 도구를 형성합니다. 하나는 배포 전 제어 가능성 평가를, 다른 하나는 배포 후 모니터링을 위한 것입니다.

English

Reliable deployment of language models requires two capabilities that appear distinct but share a common geometric foundation: predicting whether a model will accept targeted behavioral control, and detecting when its internal structure degrades. We show that geometric stability, the consistency of a representation's pairwise distance structure, addresses both. Supervised Shesha variants that measure task-aligned geometric stability predict linear steerability with near-perfect accuracy (ρ= 0.89-0.97) across 35-69 embedding models and three NLP tasks, capturing unique variance beyond class separability (partial ρ= 0.62-0.76). A critical dissociation emerges: unsupervised stability fails entirely for steering on real-world tasks (ρapprox 0.10), revealing that task alignment is essential for controllability prediction. However, unsupervised stability excels at drift detection, measuring nearly 2times greater geometric change than CKA during post-training alignment (up to 5.23times in Llama) while providing earlier warning in 73\% of models and maintaining a 6times lower false alarm rate than Procrustes. Together, supervised and unsupervised stability form complementary diagnostics for the LLM deployment lifecycle: one for pre-deployment controllability assessment, the other for post-deployment monitoring.

기하학적 카나리아: 표현 안정성을 통한 조종 가능성 예측 및 표류 감지

The Geometric Canary: Predicting Steerability and Detecting Drift via Representational Stability

초록

Support