几何金丝雀：通过表征稳定性预测可控性与检测漂移

摘要

语言模型的可靠部署需要两种看似不同但共享几何基础的能力：预测模型是否会接受定向行为控制，以及检测其内部结构何时发生退化。我们证明，表征间距离结构一致性的几何稳定性可同时解决这两个问题。通过测量任务对齐几何稳定性的监督式Shesha变体，在35-69个嵌入模型和三项NLP任务中实现了近乎完美的线性可操控性预测准确率（ρ=0.89-0.97），其捕获的独特方差超越类别分离度指标（偏相关系数ρ=0.62-0.76）。关键分化现象显现：无监督稳定性在现实任务操控预测中完全失效（ρ≈0.10），表明任务对齐对可控性预测至关重要。然而无监督稳定性在漂移检测中表现卓越，训练后对齐过程中测量的几何变化幅度达CKA方法的近2倍（Llama模型中高达5.23倍），在73%的模型中提供更早预警，同时保持比Procrustes方法低6倍的误报率。监督与无监督稳定性共同构成了LLM部署生命周期的互补诊断工具：前者用于部署前的可控性评估，后者用于部署后的状态监测。

English

Reliable deployment of language models requires two capabilities that appear distinct but share a common geometric foundation: predicting whether a model will accept targeted behavioral control, and detecting when its internal structure degrades. We show that geometric stability, the consistency of a representation's pairwise distance structure, addresses both. Supervised Shesha variants that measure task-aligned geometric stability predict linear steerability with near-perfect accuracy (ρ= 0.89-0.97) across 35-69 embedding models and three NLP tasks, capturing unique variance beyond class separability (partial ρ= 0.62-0.76). A critical dissociation emerges: unsupervised stability fails entirely for steering on real-world tasks (ρapprox 0.10), revealing that task alignment is essential for controllability prediction. However, unsupervised stability excels at drift detection, measuring nearly 2times greater geometric change than CKA during post-training alignment (up to 5.23times in Llama) while providing earlier warning in 73\% of models and maintaining a 6times lower false alarm rate than Procrustes. Together, supervised and unsupervised stability form complementary diagnostics for the LLM deployment lifecycle: one for pre-deployment controllability assessment, the other for post-deployment monitoring.

几何金丝雀：通过表征稳定性预测可控性与检测漂移

The Geometric Canary: Predicting Steerability and Detecting Drift via Representational Stability

摘要

Support