几何金丝雀：通过表征稳定性预测可控性与检测漂移

摘要

語言模型的可靠部署需要兩種看似不同卻共享共同幾何基礎的能力：預測模型是否會接受目標行為控制，以及檢測其內部結構何時發生退化。我們發現，表徵的成對距離結構一致性——即幾何穩定性——能同時解決這兩大問題。通過測量任務對齊的幾何穩定性，監督式Shesha變體在35-69個嵌入模型和三個自然語言處理任務中，以接近完美的準確度（ρ=0.89-0.97）預測線性可控性，其捕捉的獨特方差甚至超越類別可分離性（偏相關ρ=0.62-0.76）。關鍵性分化在於：無監督穩定性在真實任務的可控性預測中完全失效（ρ約為0.10），表明任務對齊是可控性預測的必要條件。然而，無監督穩定性在漂移檢測方面表現卓越，在訓練後對齊過程中測得的幾何變化量是CKA的近2倍（Llama模型中最高達5.23倍），同時在73%的模型中提供更早預警，且誤報率比Procrustes方法低6倍。監督式與無監督穩定性共同構成了大語言模型部署生命週期的互補診斷工具：前者用於部署前的可控性評估，後者用於部署後的監控預警。

English

Reliable deployment of language models requires two capabilities that appear distinct but share a common geometric foundation: predicting whether a model will accept targeted behavioral control, and detecting when its internal structure degrades. We show that geometric stability, the consistency of a representation's pairwise distance structure, addresses both. Supervised Shesha variants that measure task-aligned geometric stability predict linear steerability with near-perfect accuracy (ρ= 0.89-0.97) across 35-69 embedding models and three NLP tasks, capturing unique variance beyond class separability (partial ρ= 0.62-0.76). A critical dissociation emerges: unsupervised stability fails entirely for steering on real-world tasks (ρapprox 0.10), revealing that task alignment is essential for controllability prediction. However, unsupervised stability excels at drift detection, measuring nearly 2times greater geometric change than CKA during post-training alignment (up to 5.23times in Llama) while providing earlier warning in 73\% of models and maintaining a 6times lower false alarm rate than Procrustes. Together, supervised and unsupervised stability form complementary diagnostics for the LLM deployment lifecycle: one for pre-deployment controllability assessment, the other for post-deployment monitoring.

几何金丝雀：通过表征稳定性预测可控性与检测漂移

The Geometric Canary: Predicting Steerability and Detecting Drift via Representational Stability

摘要

Support