RedBench:面向大型語言模型全面紅隊測試的通用數據集
RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models
January 7, 2026
作者: Quy-Anh Dang, Chris Ngo, Truong-Son Hy
cs.AI
摘要
随着大语言模型在安全关键型应用中的日益普及,确保其对抗恶意提示的鲁棒性变得至关重要。然而,现有的红队测试数据集存在风险分类不一致、领域覆盖有限和评估方法过时等问题,阻碍了系统性漏洞评估的开展。为解决这些挑战,我们推出RedBench——一个聚合了来自顶尖学术会议和代码库的37个基准测试数据集的全域评估框架,包含涵盖攻击性提示与拒绝性提示的29,362个样本。该框架采用包含22个风险类别和19个领域的标准化分类体系,能对大语言模型漏洞进行一致且全面的评估。我们不仅对现有数据集开展详细分析,为现代大语言模型建立性能基线,同时开源数据集与评估代码。本研究的贡献在于实现可靠的模型对比、推动未来研究发展,并促进适用于现实场景的安全可靠大语言模型开发。代码地址:https://github.com/knoveleng/redeval
English
As large language models (LLMs) become integral to safety-critical applications, ensuring their robustness against adversarial prompts is paramount. However, existing red teaming datasets suffer from inconsistent risk categorizations, limited domain coverage, and outdated evaluations, hindering systematic vulnerability assessments. To address these challenges, we introduce RedBench, a universal dataset aggregating 37 benchmark datasets from leading conferences and repositories, comprising 29,362 samples across attack and refusal prompts. RedBench employs a standardized taxonomy with 22 risk categories and 19 domains, enabling consistent and comprehensive evaluations of LLM vulnerabilities. We provide a detailed analysis of existing datasets, establish baselines for modern LLMs, and open-source the dataset and evaluation code. Our contributions facilitate robust comparisons, foster future research, and promote the development of secure and reliable LLMs for real-world deployment. Code: https://github.com/knoveleng/redeval