前沿人工智能风险管理框架实践:风险分析 技术报告
Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report
July 22, 2025
作者: Shanghai AI Lab, Xiaoyang Chen, Yunhao Chen, Zeren Chen, Zhiyun Chen, Hanyun Cui, Yawen Duan, Jiaxuan Guo, Qi Guo, Xuhao Hu, Hong Huang, Lige Huang, Chunxiao Li, Juncheng Li, Qihao Lin, Dongrui Liu, Xinmin Liu, Zicheng Liu, Chaochao Lu, Xiaoya Lu, Jingjing Qu, Qibing Ren, Jing Shao, Jingwei Shi, Jingwei Sun, Peng Wang, Weibing Wang, Jia Xu, Lewen Yan, Xiao Yu, Yi Yu, Boxuan Zhang, Jie Zhang, Weichen Zhang, Zhijie Zheng, Tianyi Zhou, Bowen Zhou
cs.AI
摘要
为理解并识别快速演进的人工智能(AI)模型所带来的前所未有的风险,本报告对其前沿风险进行了全面评估。借鉴《前沿人工智能风险管理框架(v1.0)》(SafeWork-F1-Framework)中的E-T-C分析(部署环境、威胁来源、赋能能力),我们在七个关键领域识别了主要风险:网络攻击、生物与化学风险、说服与操控、失控的自主AI研发、战略欺骗与谋划、自我复制以及合谋。遵循“AI-45度法则”,我们通过“红线”(不可容忍的阈值)和“黄线”(早期预警指标)来划分风险区域:绿色(常规部署与持续监控下的可控风险)、黄色(需加强缓解措施与受控部署)以及红色(需暂停开发及/或部署)。实验结果显示,近期所有前沿AI模型均处于绿色与黄色区域,未触及红线。具体而言,在评估的网络攻击或失控AI研发风险中,无模型跨越黄线。对于自我复制及战略欺骗与谋划,除部分推理模型处于黄色区域外,多数模型保持在绿色区域。在说服与操控方面,由于模型对人类的有效影响,多数模型位于黄色区域。至于生物与化学风险,尽管需进行详细的威胁建模与深入评估以进一步断言,但我们无法排除多数模型处于黄色区域的可能性。此项工作反映了我们当前对AI前沿风险的理解,并呼吁采取集体行动以应对这些挑战。
English
To understand and identify the unprecedented risks posed by rapidly advancing
artificial intelligence (AI) models, this report presents a comprehensive
assessment of their frontier risks. Drawing on the E-T-C analysis (deployment
environment, threat source, enabling capability) from the Frontier AI Risk
Management Framework (v1.0) (SafeWork-F1-Framework), we identify critical risks
in seven areas: cyber offense, biological and chemical risks, persuasion and
manipulation, uncontrolled autonomous AI R\&D, strategic deception and
scheming, self-replication, and collusion. Guided by the "AI-45^circ Law,"
we evaluate these risks using "red lines" (intolerable thresholds) and "yellow
lines" (early warning indicators) to define risk zones: green (manageable risk
for routine deployment and continuous monitoring), yellow (requiring
strengthened mitigations and controlled deployment), and red (necessitating
suspension of development and/or deployment). Experimental results show that
all recent frontier AI models reside in green and yellow zones, without
crossing red lines. Specifically, no evaluated models cross the yellow line for
cyber offense or uncontrolled AI R\&D risks. For self-replication, and
strategic deception and scheming, most models remain in the green zone, except
for certain reasoning models in the yellow zone. In persuasion and
manipulation, most models are in the yellow zone due to their effective
influence on humans. For biological and chemical risks, we are unable to rule
out the possibility of most models residing in the yellow zone, although
detailed threat modeling and in-depth assessment are required to make further
claims. This work reflects our current understanding of AI frontier risks and
urges collective action to mitigate these challenges.