幾何学的カナリア：表現的安定性による操縦性の予測とドリフト検出

要旨

言語モデルの信頼性あるデプロイには、一見異なるが共通の幾何学的基盤を共有する二つの能力が求められる。すなわち、モデルが特定の行動制御を受け入れるかどうかを予測する能力と、その内部構造が劣化した時を検知する能力である。本論文では、表現のペアワイズ距離構造の一貫性である「幾何学的安定性」が両方の課題を解決することを示す。タスクに沿った幾何学的安定性を測定する教師ありShesha変種は、35～69の埋め込みモデルと3つのNLPタスクにわたり、線形 steerability をほぼ完璧に予測し（ρ=0.89-0.97）、クラス分離性を超えた独自の分散を捉える（偏相関ρ=0.62-0.76）。決定的な解離が明らかになった：教師なし安定性は実世界タスクにおける steering では完全に失敗し（ρ≈0.10）、制御可能性の予測にはタスクとの整合性が不可欠であることを示す。しかし、教師なし安定性はドリフト検出では優れており、学習後アライメント中にCKAと比べて最大2倍（Llamaでは最大5.23倍）の幾何学的変化を検出し、73%のモデルでより早期に警告を発し、Procrustesと比べて6倍低い誤警報率を維持する。教師あり安定性と教師なし安定性は合わせて、LLMデプロイメントライフサイクルのための相補的な診断手法を形成する。前者はデプロイ前の制御可能性評価、後者はデプロイ後のモニタリングに寄与する。

English

Reliable deployment of language models requires two capabilities that appear distinct but share a common geometric foundation: predicting whether a model will accept targeted behavioral control, and detecting when its internal structure degrades. We show that geometric stability, the consistency of a representation's pairwise distance structure, addresses both. Supervised Shesha variants that measure task-aligned geometric stability predict linear steerability with near-perfect accuracy (ρ= 0.89-0.97) across 35-69 embedding models and three NLP tasks, capturing unique variance beyond class separability (partial ρ= 0.62-0.76). A critical dissociation emerges: unsupervised stability fails entirely for steering on real-world tasks (ρapprox 0.10), revealing that task alignment is essential for controllability prediction. However, unsupervised stability excels at drift detection, measuring nearly 2times greater geometric change than CKA during post-training alignment (up to 5.23times in Llama) while providing earlier warning in 73\% of models and maintaining a 6times lower false alarm rate than Procrustes. Together, supervised and unsupervised stability form complementary diagnostics for the LLM deployment lifecycle: one for pre-deployment controllability assessment, the other for post-deployment monitoring.

幾何学的カナリア：表現的安定性による操縦性の予測とドリフト検出

The Geometric Canary: Predicting Steerability and Detecting Drift via Representational Stability

要旨

Support