在循环中运行?一个关于大语言模型代码解释器安全性的简单基准测试
Running in CIRCLE? A Simple Benchmark for LLM Code Interpreter Security
July 25, 2025
作者: Gabriel Chua
cs.AI
摘要
随着大型语言模型(LLMs)日益集成原生代码解释器,它们实现了强大的实时执行能力,显著扩展了其应用范围。然而,此类集成引入了潜在的系统级网络安全威胁,这些威胁与基于提示的漏洞有本质区别。为了系统评估这些解释器特有的风险,我们提出了CIRCLE(LLM代码解释器韧性检查),这是一个包含1,260个提示的简单基准,旨在针对CPU、内存和磁盘资源耗尽进行测试。每个风险类别均包含明确恶意(“直接”)和看似无害(“间接”)的提示变体。我们的自动化评估框架不仅评估LLMs是否拒绝或生成风险代码,还在解释器环境中执行生成的代码,以评估代码的正确性、LLM为使代码安全而进行的简化,或执行超时情况。通过对OpenAI和Google的7个商用模型进行评估,我们发现了显著且不一致的漏洞。例如,评估结果显示,即便在同一提供商内部也存在巨大差异——OpenAI的o4-mini正确拒绝风险请求的比例为7.1%,远高于GPT-4.1的0.5%。结果特别强调,间接的、社会工程学式的提示大大削弱了模型的防御能力。这凸显了迫切需要针对解释器的网络安全基准、专门的缓解工具(如防护栏)以及明确的行业标准,以指导LLM解释器集成的安全与负责任部署。基准数据集和评估代码已公开发布,以促进进一步研究。
English
As large language models (LLMs) increasingly integrate native code
interpreters, they enable powerful real-time execution capabilities,
substantially expanding their utility. However, such integrations introduce
potential system-level cybersecurity threats, fundamentally different from
prompt-based vulnerabilities. To systematically evaluate these
interpreter-specific risks, we propose CIRCLE (Code-Interpreter Resilience
Check for LLM Exploits), a simple benchmark comprising 1,260 prompts targeting
CPU, memory, and disk resource exhaustion. Each risk category includes
explicitly malicious ("direct") and plausibly benign ("indirect") prompt
variants. Our automated evaluation framework assesses not only whether LLMs
refuse or generates risky code, but also executes the generated code within the
interpreter environment to evaluate code correctness, simplifications made by
the LLM to make the code safe, or execution timeouts. Evaluating 7 commercially
available models from OpenAI and Google, we uncover significant and
inconsistent vulnerabilities. For instance, evaluations show substantial
disparities even within providers - OpenAI's o4-mini correctly refuses risky
requests at 7.1%, notably higher rates compared to GPT-4.1 at 0.5%. Results
particularly underscore that indirect, socially-engineered prompts
substantially weaken model defenses. This highlights an urgent need for
interpreter-specific cybersecurity benchmarks, dedicated mitigation tools
(e.g., guardrails), and clear industry standards to guide safe and responsible
deployment of LLM interpreter integrations. The benchmark dataset and
evaluation code are publicly released to foster further research.