OpenAI 的 o3-mini 早期外部安全測試：來自部署前評估的見解

摘要

大型語言模型（LLMs）已成為我們日常生活中不可或缺的一部分。然而，它們帶來了一些風險，包括可能損害個人隱私、持續存在偏見並傳播錯誤信息。這些風險凸顯了需要強健的安全機制、道德準則和全面測試，以確保它們的負責任部署。LLMs的安全性是一個關鍵特性，需要在模型部署和向一般用戶提供之前進行全面測試。本文報告了蒙德拉貢大學和塞維利亞大學的研究人員在OpenAI的o3-mini LLM上進行的外部安全測試經驗，作為OpenAI早期安全測試計劃的一部分。具體而言，我們應用我們的工具ASTRAL，自動並系統地生成最新的不安全測試輸入（即提示），以幫助我們測試和評估LLMs的不同安全類別。我們自動生成並執行了總共10,080個不安全的測試輸入在o3-mini的早期測試版本上。在通過ASTRAL手動驗證被歸類為不安全的測試案例後，我們識別出總共87個實際的不安全LLM行為實例。我們突出了在OpenAI最新LLM的部署前外部測試階段中發現的關鍵見解和結果。

English

Large Language Models (LLMs) have become an integral part of our daily lives. However, they impose certain risks, including those that can harm individuals' privacy, perpetuate biases and spread misinformation. These risks highlight the need for robust safety mechanisms, ethical guidelines, and thorough testing to ensure their responsible deployment. Safety of LLMs is a key property that needs to be thoroughly tested prior the model to be deployed and accessible to the general users. This paper reports the external safety testing experience conducted by researchers from Mondragon University and University of Seville on OpenAI's new o3-mini LLM as part of OpenAI's early access for safety testing program. In particular, we apply our tool, ASTRAL, to automatically and systematically generate up to date unsafe test inputs (i.e., prompts) that helps us test and assess different safety categories of LLMs. We automatically generate and execute a total of 10,080 unsafe test input on a early o3-mini beta version. After manually verifying the test cases classified as unsafe by ASTRAL, we identify a total of 87 actual instances of unsafe LLM behavior. We highlight key insights and findings uncovered during the pre-deployment external testing phase of OpenAI's latest LLM.

OpenAI 的 o3-mini 早期外部安全測試：來自部署前評估的見解

Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation

摘要

Support