人工智慧會為了拯救病童而說謊嗎？以AIRiskDilemmas進行AI價值優先序的試金石測試

摘要

隨著更強大的AI模型出現，並找到諸如「對齊偽裝」等新方法來規避檢測，識別AI風險變得更加具有挑戰性。受人類風險行為（即可能傷害他人的非法活動）有時受強烈價值觀驅動的啟發，我們認為識別AI模型中的價值觀可以作為AI風險行為的早期預警系統。我們創建了LitmusValues，這是一個評估管道，用於揭示AI模型在各種AI價值類別上的優先級。隨後，我們收集了AIRiskDilemmas，這是一個多樣化的困境集合，在與AI安全風險（如權力追求）相關的情境中，將不同價值觀相互對立。通過測量AI模型基於其綜合選擇的價值優先級，我們獲得了一組自洽的預測價值優先級，從而揭示潛在風險。我們展示了LitmusValues中的價值觀（包括看似無害的價值觀，如關懷）能夠預測AIRiskDilemmas中已觀察到的風險行為，以及HarmBench中未觀察到的風險行為。

English

Detecting AI risks becomes more challenging as stronger models emerge and find novel methods such as Alignment Faking to circumvent these detection attempts. Inspired by how risky behaviors in humans (i.e., illegal activities that may hurt others) are sometimes guided by strongly-held values, we believe that identifying values within AI models can be an early warning system for AI's risky behaviors. We create LitmusValues, an evaluation pipeline to reveal AI models' priorities on a range of AI value classes. Then, we collect AIRiskDilemmas, a diverse collection of dilemmas that pit values against one another in scenarios relevant to AI safety risks such as Power Seeking. By measuring an AI model's value prioritization using its aggregate choices, we obtain a self-consistent set of predicted value priorities that uncover potential risks. We show that values in LitmusValues (including seemingly innocuous ones like Care) can predict for both seen risky behaviors in AIRiskDilemmas and unseen risky behaviors in HarmBench.

人工智慧會為了拯救病童而說謊嗎？以AIRiskDilemmas進行AI價值優先序的試金石測試

Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas

摘要

Support