ChatPaper.aiChatPaper

OpenRT:面向多模态大语言模型的开源红队测试框架

OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs

January 4, 2026
作者: Xin Wang, Yunhao Chen, Juncheng Li, Yixu Wang, Yang Yao, Tianle Gu, Jie Li, Yan Teng, Xingjun Ma, Yingchun Wang, Xia Hu
cs.AI

摘要

多模态大语言模型(MLLMs)在关键应用中的快速集成正日益受到持续性安全漏洞的阻碍。然而,现有的红队测试基准往往碎片化,局限于单轮文本交互,且缺乏系统化评估所需的可扩展性。为此,我们推出OpenRT——一个统一、模块化、高吞吐的红队测试框架,旨在全面评估MLLM安全性。该框架的核心是通过引入对抗内核实现了自动化红队测试的范式革新,将模型集成、数据管理、攻击策略、评判方法和评估指标五个关键维度进行模块化分离。通过标准化攻击接口,它将对抗逻辑与高吞吐异步运行时解耦,实现了跨多样模型的系统化扩展。我们的框架整合了37种攻击方法,涵盖白盒梯度攻击、多模态扰动及复杂多智能体进化策略。通过对20个先进模型(包括GPT-5.2、Claude 4.5和Gemini 3 Pro)的大规模实证研究,我们揭示了关键安全缺陷:即使前沿模型也未能泛化至所有攻击范式,领先模型的平均攻击成功率高达49.14%。值得注意的是,研究发现推理模型并不天然具备针对复杂多轮越狱的更强鲁棒性。通过开源OpenRT,我们提供了一个可持续、可扩展且持续维护的基础设施,以加速AI安全领域的研发与标准化进程。
English
The rapid integration of Multimodal Large Language Models (MLLMs) into critical applications is increasingly hindered by persistent safety vulnerabilities. However, existing red-teaming benchmarks are often fragmented, limited to single-turn text interactions, and lack the scalability required for systematic evaluation. To address this, we introduce OpenRT, a unified, modular, and high-throughput red-teaming framework designed for comprehensive MLLM safety evaluation. At its core, OpenRT architects a paradigm shift in automated red-teaming by introducing an adversarial kernel that enables modular separation across five critical dimensions: model integration, dataset management, attack strategies, judging methods, and evaluation metrics. By standardizing attack interfaces, it decouples adversarial logic from a high-throughput asynchronous runtime, enabling systematic scaling across diverse models. Our framework integrates 37 diverse attack methodologies, spanning white-box gradients, multi-modal perturbations, and sophisticated multi-agent evolutionary strategies. Through an extensive empirical study on 20 advanced models (including GPT-5.2, Claude 4.5, and Gemini 3 Pro), we expose critical safety gaps: even frontier models fail to generalize across attack paradigms, with leading models exhibiting average Attack Success Rates as high as 49.14%. Notably, our findings reveal that reasoning models do not inherently possess superior robustness against complex, multi-turn jailbreaks. By open-sourcing OpenRT, we provide a sustainable, extensible, and continuously maintained infrastructure that accelerates the development and standardization of AI safety.
PDF31January 8, 2026