OpenAIのo3-miniの初期外部安全テスト：導入前評価からの洞察

要旨

大規模言語モデル（LLM）は、私たちの日常生活の欠かせない一部となっています。しかしながら、個人のプライバシーを損なう可能性や偏見を助長し、誤情報を広めるなどのリスクが伴います。これらのリスクは、適切な安全メカニズム、倫理的ガイドライン、徹底的なテストが必要であることを示しており、その責任ある展開を確保するために重要です。LLMの安全性は、一般ユーザーがアクセス可能になる前に徹底的にテストされるべき重要な性質です。本論文では、モンドラゴン大学とセビリア大学の研究者によるOpenAIの新しいo3-mini LLMの外部安全性テスト体験について報告します。具体的には、ASTRALというツールを適用して、LLMの異なる安全カテゴリをテストおよび評価するのに役立つ最新の危険なテスト入力（プロンプト）を自動的かつ体系的に生成します。私たちは、o3-miniのベータ版で総計10,080の危険なテスト入力を自動的に生成し実行します。ASTRALによって危険と分類されたテストケースを手動で検証した結果、87件の実際の危険なLLMの挙動のインスタンスを特定します。OpenAIの最新LLMの展開前外部テストフェーズで明らかになった主要な知見と結果を強調します。

English

Large Language Models (LLMs) have become an integral part of our daily lives. However, they impose certain risks, including those that can harm individuals' privacy, perpetuate biases and spread misinformation. These risks highlight the need for robust safety mechanisms, ethical guidelines, and thorough testing to ensure their responsible deployment. Safety of LLMs is a key property that needs to be thoroughly tested prior the model to be deployed and accessible to the general users. This paper reports the external safety testing experience conducted by researchers from Mondragon University and University of Seville on OpenAI's new o3-mini LLM as part of OpenAI's early access for safety testing program. In particular, we apply our tool, ASTRAL, to automatically and systematically generate up to date unsafe test inputs (i.e., prompts) that helps us test and assess different safety categories of LLMs. We automatically generate and execute a total of 10,080 unsafe test input on a early o3-mini beta version. After manually verifying the test cases classified as unsafe by ASTRAL, we identify a total of 87 actual instances of unsafe LLM behavior. We highlight key insights and findings uncovered during the pre-deployment external testing phase of OpenAI's latest LLM.