對齊品質指數(AQI):超越拒絕:AQI作為一種內在對齊診斷工具,透過潛在幾何、集群分歧與分層匯聚表徵實現
Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations
June 16, 2025
作者: Abhilekh Borah, Chhavi Sharma, Danush Khanna, Utkarsh Bhatt, Gurpreet Singh, Hasnat Md Abdullah, Raghav Kaushik Ravi, Vinija Jain, Jyoti Patel, Shubham Singh, Vasu Sharma, Arpita Vats, Rahul Raja, Aman Chadha, Amitava Das
cs.AI
摘要
對齊已不再是奢侈品,而是必需品。隨著大型語言模型(LLMs)進入教育、醫療、治理和法律等高風險領域,其行為必須可靠地反映與人類價值觀一致的安全約束。然而,目前的評估主要依賴於行為代理指標,如拒絕率、G-Eval分數和毒性分類器,這些指標都存在關鍵的盲點。對齊模型往往容易受到越獄攻擊、生成隨機性和對齊偽造的影響。
為解決這一問題,我們引入了對齊質量指數(AQI)。這一新穎的幾何且提示不變的指標,通過分析潛在空間中安全與不安全激活的分離,實證評估LLM的對齊情況。AQI結合了多種公式下的戴維斯-博爾丁分數(DBS)、鄧恩指數(DI)、謝-貝尼指數(XBI)和卡林斯基-哈拉巴斯指數(CHI)等度量,捕捉聚類質量以檢測隱藏的對齊偏差和越獄風險,即使輸出看似合規。AQI還可作為對齊偽造的早期預警信號,提供一種強大的、解碼不變的行為無關安全審計工具。
此外,我們提出了LITMUS數據集,以促進在這些挑戰性條件下的穩健評估。在LITMUS上對不同模型(在DPO、GRPO和RLHF條件下訓練)進行的實證測試表明,AQI與外部評判者具有相關性,並能揭示拒絕指標所遺漏的漏洞。我們公開了我們的實現,以促進該領域的未來研究。
English
Alignment is no longer a luxury, it is a necessity. As large language models
(LLMs) enter high-stakes domains like education, healthcare, governance, and
law, their behavior must reliably reflect human-aligned values and safety
constraints. Yet current evaluations rely heavily on behavioral proxies such as
refusal rates, G-Eval scores, and toxicity classifiers, all of which have
critical blind spots. Aligned models are often vulnerable to jailbreaking,
stochasticity of generation, and alignment faking.
To address this issue, we introduce the Alignment Quality Index (AQI). This
novel geometric and prompt-invariant metric empirically assesses LLM alignment
by analyzing the separation of safe and unsafe activations in latent space. By
combining measures such as the Davies-Bouldin Score (DBS), Dunn Index (DI),
Xie-Beni Index (XBI), and Calinski-Harabasz Index (CHI) across various
formulations, AQI captures clustering quality to detect hidden misalignments
and jailbreak risks, even when outputs appear compliant. AQI also serves as an
early warning signal for alignment faking, offering a robust, decoding
invariant tool for behavior agnostic safety auditing.
Additionally, we propose the LITMUS dataset to facilitate robust evaluation
under these challenging conditions. Empirical tests on LITMUS across different
models trained under DPO, GRPO, and RLHF conditions demonstrate AQI's
correlation with external judges and ability to reveal vulnerabilities missed
by refusal metrics. We make our implementation publicly available to foster
future research in this area.