アライメント品質指数（AQI）：拒否を超えて：潜在幾何、クラスター発散、および層ごとのプール表現によるAQIの本質的アライメント診断

要旨

アライメントはもはや贅沢ではなく、必要不可欠な要素である。大規模言語モデル（LLM）が教育、医療、ガバナンス、法律といったハイステークスな領域に進出するにつれ、その振る舞いは人間の価値観や安全性の制約に確実に沿ったものでなければならない。しかし、現在の評価は拒否率、G-Evalスコア、毒性分類器といった行動プロキシに大きく依存しており、これらには重大な盲点が存在する。アライメントされたモデルは、しばしばジャイルブレイク、生成の確率性、アライメントの偽装に対して脆弱である。この問題に対処するため、我々はアライメント品質指数（Alignment Quality Index, AQI）を導入する。この新しい幾何学的かつプロンプト不変のメトリックは、潜在空間における安全な活性化と不安全な活性化の分離を分析することで、LLMのアライメントを経験的に評価する。Davies-Bouldinスコア（DBS）、Dunn指数（DI）、Xie-Beni指数（XBI）、Calinski-Harabasz指数（CHI）といった様々な定式化を組み合わせることで、AQIはクラスタリングの品質を捉え、出力が一見準拠している場合でも隠れたミスアライメントやジャイルブレイクのリスクを検出する。AQIはまた、アライメントの偽装に対する早期警告信号としても機能し、振る舞いに依存しない安全性監査のための堅牢なデコード不変ツールを提供する。さらに、これらの困難な条件下での堅牢な評価を促進するため、LITMUSデータセットを提案する。DPO、GRPO、RLHFの条件下で訓練された異なるモデルに対するLITMUSの実証テストは、AQIが外部の評価者との相関を持ち、拒否メトリックでは見逃されていた脆弱性を明らかにする能力を示している。我々は、この分野の将来の研究を促進するため、実装を公開する。

English

Alignment is no longer a luxury, it is a necessity. As large language models (LLMs) enter high-stakes domains like education, healthcare, governance, and law, their behavior must reliably reflect human-aligned values and safety constraints. Yet current evaluations rely heavily on behavioral proxies such as refusal rates, G-Eval scores, and toxicity classifiers, all of which have critical blind spots. Aligned models are often vulnerable to jailbreaking, stochasticity of generation, and alignment faking. To address this issue, we introduce the Alignment Quality Index (AQI). This novel geometric and prompt-invariant metric empirically assesses LLM alignment by analyzing the separation of safe and unsafe activations in latent space. By combining measures such as the Davies-Bouldin Score (DBS), Dunn Index (DI), Xie-Beni Index (XBI), and Calinski-Harabasz Index (CHI) across various formulations, AQI captures clustering quality to detect hidden misalignments and jailbreak risks, even when outputs appear compliant. AQI also serves as an early warning signal for alignment faking, offering a robust, decoding invariant tool for behavior agnostic safety auditing. Additionally, we propose the LITMUS dataset to facilitate robust evaluation under these challenging conditions. Empirical tests on LITMUS across different models trained under DPO, GRPO, and RLHF conditions demonstrate AQI's correlation with external judges and ability to reveal vulnerabilities missed by refusal metrics. We make our implementation publicly available to foster future research in this area.