WalledEval:用于大型语言模型的全面安全评估工具包
WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models
August 7, 2024
作者: Prannaya Gupta, Le Qi Yau, Hao Han Low, I-Shiang Lee, Hugo Maximus Lim, Yu Xin Teoh, Jia Hng Koh, Dar Win Liew, Rishabh Bhardwaj, Rajat Bhardwaj, Soujanya Poria
cs.AI
摘要
WalledEval是一个全面的人工智能安全测试工具包,旨在评估大型语言模型(LLMs)。它适用于各种模型,包括开放权重和基于API的模型,并涵盖超过35个安全基准,涵盖多语言安全、夸张安全和提示注入等领域。该框架支持LLM和评测基准,并整合了自定义变异器,用于针对各种文本风格变异(如将来时和释义)测试安全性。此外,WalledEval引入了WalledGuard,一个新的、小巧且高效的内容管理工具,以及SGXSTest,一个用于评估文化背景下夸张安全性的基准。我们将WalledEval公开发布在https://github.com/walledai/walledevalA。
English
WalledEval is a comprehensive AI safety testing toolkit designed to evaluate
large language models (LLMs). It accommodates a diverse range of models,
including both open-weight and API-based ones, and features over 35 safety
benchmarks covering areas such as multilingual safety, exaggerated safety, and
prompt injections. The framework supports both LLM and judge benchmarking, and
incorporates custom mutators to test safety against various text-style
mutations such as future tense and paraphrasing. Additionally, WalledEval
introduces WalledGuard, a new, small and performant content moderation tool,
and SGXSTest, a benchmark for assessing exaggerated safety in cultural
contexts. We make WalledEval publicly available at
https://github.com/walledai/walledevalA.Summary
AI-Generated Summary