ChatPaper.aiChatPaper

几何金丝雀:通过表征稳定性预测可控性与检测漂移

The Geometric Canary: Predicting Steerability and Detecting Drift via Representational Stability

April 20, 2026
作者: Prashant C. Raju
cs.AI

摘要

语言模型的可靠部署需要两种看似不同但共享几何基础的能力:预测模型是否会接受定向行为控制,以及检测其内部结构何时发生退化。我们证明,表征间距离结构一致性的几何稳定性可同时解决这两个问题。通过测量任务对齐几何稳定性的监督式Shesha变体,在35-69个嵌入模型和三项NLP任务中实现了近乎完美的线性可操控性预测准确率(ρ=0.89-0.97),其捕获的独特方差超越类别分离度指标(偏相关系数ρ=0.62-0.76)。关键分化现象显现:无监督稳定性在现实任务操控预测中完全失效(ρ≈0.10),表明任务对齐对可控性预测至关重要。然而无监督稳定性在漂移检测中表现卓越,训练后对齐过程中测量的几何变化幅度达CKA方法的近2倍(Llama模型中高达5.23倍),在73%的模型中提供更早预警,同时保持比Procrustes方法低6倍的误报率。监督与无监督稳定性共同构成了LLM部署生命周期的互补诊断工具:前者用于部署前的可控性评估,后者用于部署后的状态监测。
English
Reliable deployment of language models requires two capabilities that appear distinct but share a common geometric foundation: predicting whether a model will accept targeted behavioral control, and detecting when its internal structure degrades. We show that geometric stability, the consistency of a representation's pairwise distance structure, addresses both. Supervised Shesha variants that measure task-aligned geometric stability predict linear steerability with near-perfect accuracy (ρ= 0.89-0.97) across 35-69 embedding models and three NLP tasks, capturing unique variance beyond class separability (partial ρ= 0.62-0.76). A critical dissociation emerges: unsupervised stability fails entirely for steering on real-world tasks (ρapprox 0.10), revealing that task alignment is essential for controllability prediction. However, unsupervised stability excels at drift detection, measuring nearly 2times greater geometric change than CKA during post-training alignment (up to 5.23times in Llama) while providing earlier warning in 73\% of models and maintaining a 6times lower false alarm rate than Procrustes. Together, supervised and unsupervised stability form complementary diagnostics for the LLM deployment lifecycle: one for pre-deployment controllability assessment, the other for post-deployment monitoring.
PDF12April 22, 2026