프론티어 AI 리스크 관리 프레임워크의 실제 적용: 리스크 분석 기술 보고서

초록

급속히 발전하는 인공지능(AI) 모델이 초래할 전례 없는 위험을 이해하고 식별하기 위해, 본 보고서는 이러한 최첨단 AI 모델의 위험을 포괄적으로 평가한다. Frontier AI Risk Management Framework(v1.0)(SafeWork-F1-Framework)의 E-T-C 분석(배포 환경, 위협 원천, 가능케 하는 역량)을 바탕으로, 우리는 사이버 공격, 생물학적 및 화학적 위험, 설득 및 조작, 통제되지 않은 자율 AI 연구개발(R&D), 전략적 기만 및 음모, 자기 복제, 그리고 공모 등 7개 영역에서의 주요 위험을 식별한다. "AI-45° 법칙"을 지침으로 삼아, 이러한 위험을 "레드 라인"(허용 불가능한 임계값)과 "옐로우 라인"(조기 경고 지표)을 사용하여 평가함으로써 위험 구역을 정의한다: 그린(일상적인 배포와 지속적인 모니터링이 가능한 관리 가능한 위험), 옐로우(강화된 완화 조치와 통제된 배포가 필요한 위험), 레드(개발 및/또는 배포 중단이 필요한 위험). 실험 결과는 최근의 모든 최첨단 AI 모델이 레드 라인을 넘지 않고 그린 및 옐로우 구역에 위치함을 보여준다. 구체적으로, 평가된 모델 중 사이버 공격 또는 통제되지 않은 AI R&D 위험에 대해 옐로우 라인을 넘는 모델은 없다. 자기 복제, 그리고 전략적 기만 및 음모의 경우, 대부분의 모델이 그린 구역에 머물지만, 특정 추론 모델은 옐로우 구역에 위치한다. 설득 및 조작의 경우, 대부분의 모델이 인간에게 미치는 효과적인 영향력으로 인해 옐로우 구역에 있다. 생물학적 및 화학적 위험의 경우, 대부분의 모델이 옐로우 구역에 위치할 가능성을 배제할 수 없으나, 추가 주장을 하기 위해서는 상세한 위협 모델링과 심층 평가가 필요하다. 이 작업은 AI 최첨단 위험에 대한 우리의 현재 이해를 반영하며, 이러한 도전을 완화하기 위한 집단적 행동을 촉구한다.

English

To understand and identify the unprecedented risks posed by rapidly advancing artificial intelligence (AI) models, this report presents a comprehensive assessment of their frontier risks. Drawing on the E-T-C analysis (deployment environment, threat source, enabling capability) from the Frontier AI Risk Management Framework (v1.0) (SafeWork-F1-Framework), we identify critical risks in seven areas: cyber offense, biological and chemical risks, persuasion and manipulation, uncontrolled autonomous AI R\&D, strategic deception and scheming, self-replication, and collusion. Guided by the "AI-45^circ Law," we evaluate these risks using "red lines" (intolerable thresholds) and "yellow lines" (early warning indicators) to define risk zones: green (manageable risk for routine deployment and continuous monitoring), yellow (requiring strengthened mitigations and controlled deployment), and red (necessitating suspension of development and/or deployment). Experimental results show that all recent frontier AI models reside in green and yellow zones, without crossing red lines. Specifically, no evaluated models cross the yellow line for cyber offense or uncontrolled AI R\&D risks. For self-replication, and strategic deception and scheming, most models remain in the green zone, except for certain reasoning models in the yellow zone. In persuasion and manipulation, most models are in the yellow zone due to their effective influence on humans. For biological and chemical risks, we are unable to rule out the possibility of most models residing in the yellow zone, although detailed threat modeling and in-depth assessment are required to make further claims. This work reflects our current understanding of AI frontier risks and urges collective action to mitigate these challenges.

프론티어 AI 리스크 관리 프레임워크의 실제 적용: 리스크 분석 기술 보고서

Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report

초록

Support