人工智能会为拯救患病儿童而说谎吗？用AIRiskDilemmas进行AI价值观优先级测试

摘要

随着更强大的AI模型出现并找到诸如“对齐伪装”等新方法来规避检测，识别AI风险变得愈发困难。受人类危险行为（即可能伤害他人的非法活动）有时受强烈价值观驱使的启发，我们认为，识别AI模型中的价值观可作为其危险行为的早期预警系统。为此，我们创建了LitmusValues，一个评估流程，用于揭示AI模型在各类AI价值观上的优先级。接着，我们收集了AIRiskDilemmas，这是一系列多样化的困境，这些困境在涉及AI安全风险（如权力追求）的场景中将不同价值观置于对立面。通过测量AI模型基于其综合选择的价值优先级，我们获得了一组自洽的预测价值优先级，从而揭示潜在风险。我们证明，LitmusValues中的价值观（包括看似无害的如“关怀”）不仅能预测AIRiskDilemmas中已观察到的危险行为，还能预测HarmBench中未见的危险行为。

English

Detecting AI risks becomes more challenging as stronger models emerge and find novel methods such as Alignment Faking to circumvent these detection attempts. Inspired by how risky behaviors in humans (i.e., illegal activities that may hurt others) are sometimes guided by strongly-held values, we believe that identifying values within AI models can be an early warning system for AI's risky behaviors. We create LitmusValues, an evaluation pipeline to reveal AI models' priorities on a range of AI value classes. Then, we collect AIRiskDilemmas, a diverse collection of dilemmas that pit values against one another in scenarios relevant to AI safety risks such as Power Seeking. By measuring an AI model's value prioritization using its aggregate choices, we obtain a self-consistent set of predicted value priorities that uncover potential risks. We show that values in LitmusValues (including seemingly innocuous ones like Care) can predict for both seen risky behaviors in AIRiskDilemmas and unseen risky behaviors in HarmBench.

人工智能会为拯救患病儿童而说谎吗？用AIRiskDilemmas进行AI价值观优先级测试

Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas

摘要

Support