CIRCLEで実行中？LLMコードインタプリタのセキュリティに関するシンプルなベンチマーク

要旨

大規模言語モデル（LLM）がネイティブコードインタプリタを統合するにつれ、強力なリアルタイム実行機能が可能となり、その有用性が大幅に拡大しています。しかし、このような統合は、プロンプトベースの脆弱性とは根本的に異なるシステムレベルのサイバーセキュリティ脅威を引き起こします。これらのインタプリタ固有のリスクを体系的に評価するため、我々はCIRCLE（Code-Interpreter Resilience Check for LLM Exploits）を提案します。これは、CPU、メモリ、ディスクリソースの枯渇をターゲットとした1,260のプロンプトからなるシンプルなベンチマークです。各リスクカテゴリには、明らかに悪意のある（「直接」）プロンプトと、一見無害に見える（「間接」）プロンプトのバリエーションが含まれています。我々の自動評価フレームワークは、LLMがリスクのあるコードを拒否するか生成するかだけでなく、生成されたコードをインタプリタ環境内で実行し、コードの正確性、LLMがコードを安全にするために行った簡略化、または実行タイムアウトを評価します。OpenAIとGoogleの7つの商用モデルを評価した結果、重大かつ一貫性のない脆弱性が明らかになりました。例えば、評価結果は、プロバイダ内でも大きな差異を示しています。OpenAIのo4-miniはリスクのあるリクエストを7.1%で正しく拒否し、GPT-4.1の0.5%と比較して顕著に高い割合を示しました。結果は特に、間接的でソーシャルエンジニアリングされたプロンプトがモデルの防御を大幅に弱めることを強調しています。これは、インタプリタ固有のサイバーセキュリティベンチマーク、専用の緩和ツール（例：ガードレール）、およびLLMインタプリタ統合の安全かつ責任ある展開を導く明確な業界標準の緊急の必要性を浮き彫りにしています。ベンチマークデータセットと評価コードは、さらなる研究を促進するために公開されています。

English

As large language models (LLMs) increasingly integrate native code interpreters, they enable powerful real-time execution capabilities, substantially expanding their utility. However, such integrations introduce potential system-level cybersecurity threats, fundamentally different from prompt-based vulnerabilities. To systematically evaluate these interpreter-specific risks, we propose CIRCLE (Code-Interpreter Resilience Check for LLM Exploits), a simple benchmark comprising 1,260 prompts targeting CPU, memory, and disk resource exhaustion. Each risk category includes explicitly malicious ("direct") and plausibly benign ("indirect") prompt variants. Our automated evaluation framework assesses not only whether LLMs refuse or generates risky code, but also executes the generated code within the interpreter environment to evaluate code correctness, simplifications made by the LLM to make the code safe, or execution timeouts. Evaluating 7 commercially available models from OpenAI and Google, we uncover significant and inconsistent vulnerabilities. For instance, evaluations show substantial disparities even within providers - OpenAI's o4-mini correctly refuses risky requests at 7.1%, notably higher rates compared to GPT-4.1 at 0.5%. Results particularly underscore that indirect, socially-engineered prompts substantially weaken model defenses. This highlights an urgent need for interpreter-specific cybersecurity benchmarks, dedicated mitigation tools (e.g., guardrails), and clear industry standards to guide safe and responsible deployment of LLM interpreter integrations. The benchmark dataset and evaluation code are publicly released to foster further research.

CIRCLEで実行中？LLMコードインタプリタのセキュリティに関するシンプルなベンチマーク

Running in CIRCLE? A Simple Benchmark for LLM Code Interpreter Security

要旨

Support