ChatPaper.aiChatPaper

对齐质量指数(AQI):超越拒绝:通过潜在几何、聚类差异与分层池化表征作为内在对齐诊断的AQI

Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations

June 16, 2025
作者: Abhilekh Borah, Chhavi Sharma, Danush Khanna, Utkarsh Bhatt, Gurpreet Singh, Hasnat Md Abdullah, Raghav Kaushik Ravi, Vinija Jain, Jyoti Patel, Shubham Singh, Vasu Sharma, Arpita Vats, Rahul Raja, Aman Chadha, Amitava Das
cs.AI

摘要

对齐不再是一种奢侈,而是必需品。随着大型语言模型(LLMs)进入教育、医疗、治理和法律等高风险领域,其行为必须可靠地体现与人类价值观一致的安全约束。然而,当前的评估主要依赖于行为代理指标,如拒绝率、G-Eval分数和毒性分类器,这些指标都存在关键盲点。对齐模型往往容易受到越狱攻击、生成随机性和对齐伪造的影响。 为解决这一问题,我们引入了对齐质量指数(AQI)。这一新颖的几何且提示不变的度量方法,通过分析潜在空间中安全与不安全激活的分离,实证评估LLM的对齐情况。AQI结合了戴维斯-布尔丁评分(DBS)、邓恩指数(DI)、谢-贝尼指数(XBI)和卡林斯基-哈拉巴斯指数(CHI)等多种公式的测量,捕捉聚类质量,以检测隐藏的错位和越狱风险,即使输出看似合规。AQI还作为对齐伪造的早期预警信号,提供了一种稳健的解码不变工具,用于行为无关的安全审计。 此外,我们提出了LITMUS数据集,以促进在这些挑战性条件下的稳健评估。在LITMUS上对不同模型(在DPO、GRPO和RLHF条件下训练)进行的实证测试表明,AQI与外部评判者具有相关性,并能揭示拒绝指标遗漏的漏洞。我们公开了我们的实现,以促进该领域的未来研究。
English
Alignment is no longer a luxury, it is a necessity. As large language models (LLMs) enter high-stakes domains like education, healthcare, governance, and law, their behavior must reliably reflect human-aligned values and safety constraints. Yet current evaluations rely heavily on behavioral proxies such as refusal rates, G-Eval scores, and toxicity classifiers, all of which have critical blind spots. Aligned models are often vulnerable to jailbreaking, stochasticity of generation, and alignment faking. To address this issue, we introduce the Alignment Quality Index (AQI). This novel geometric and prompt-invariant metric empirically assesses LLM alignment by analyzing the separation of safe and unsafe activations in latent space. By combining measures such as the Davies-Bouldin Score (DBS), Dunn Index (DI), Xie-Beni Index (XBI), and Calinski-Harabasz Index (CHI) across various formulations, AQI captures clustering quality to detect hidden misalignments and jailbreak risks, even when outputs appear compliant. AQI also serves as an early warning signal for alignment faking, offering a robust, decoding invariant tool for behavior agnostic safety auditing. Additionally, we propose the LITMUS dataset to facilitate robust evaluation under these challenging conditions. Empirical tests on LITMUS across different models trained under DPO, GRPO, and RLHF conditions demonstrate AQI's correlation with external judges and ability to reveal vulnerabilities missed by refusal metrics. We make our implementation publicly available to foster future research in this area.
PDF32June 18, 2025