人工智慧會為了拯救病童而說謊嗎?以AIRiskDilemmas進行AI價值優先序的試金石測試
Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas
May 20, 2025
作者: Yu Ying Chiu, Zhilin Wang, Sharan Maiya, Yejin Choi, Kyle Fish, Sydney Levine, Evan Hubinger
cs.AI
摘要
隨著更強大的AI模型出現,並找到諸如「對齊偽裝」等新方法來規避檢測,識別AI風險變得更加具有挑戰性。受人類風險行為(即可能傷害他人的非法活動)有時受強烈價值觀驅動的啟發,我們認為識別AI模型中的價值觀可以作為AI風險行為的早期預警系統。我們創建了LitmusValues,這是一個評估管道,用於揭示AI模型在各種AI價值類別上的優先級。隨後,我們收集了AIRiskDilemmas,這是一個多樣化的困境集合,在與AI安全風險(如權力追求)相關的情境中,將不同價值觀相互對立。通過測量AI模型基於其綜合選擇的價值優先級,我們獲得了一組自洽的預測價值優先級,從而揭示潛在風險。我們展示了LitmusValues中的價值觀(包括看似無害的價值觀,如關懷)能夠預測AIRiskDilemmas中已觀察到的風險行為,以及HarmBench中未觀察到的風險行為。
English
Detecting AI risks becomes more challenging as stronger models emerge and
find novel methods such as Alignment Faking to circumvent these detection
attempts. Inspired by how risky behaviors in humans (i.e., illegal activities
that may hurt others) are sometimes guided by strongly-held values, we believe
that identifying values within AI models can be an early warning system for
AI's risky behaviors. We create LitmusValues, an evaluation pipeline to reveal
AI models' priorities on a range of AI value classes. Then, we collect
AIRiskDilemmas, a diverse collection of dilemmas that pit values against one
another in scenarios relevant to AI safety risks such as Power Seeking. By
measuring an AI model's value prioritization using its aggregate choices, we
obtain a self-consistent set of predicted value priorities that uncover
potential risks. We show that values in LitmusValues (including seemingly
innocuous ones like Care) can predict for both seen risky behaviors in
AIRiskDilemmas and unseen risky behaviors in HarmBench.