WalledEval：用于大型语言模型的全面安全评估工具包

摘要

WalledEval是一个全面的人工智能安全测试工具包，旨在评估大型语言模型（LLMs）。它适用于各种模型，包括开放权重和基于API的模型，并涵盖超过35个安全基准，涵盖多语言安全、夸张安全和提示注入等领域。该框架支持LLM和评测基准，并整合了自定义变异器，用于针对各种文本风格变异（如将来时和释义）测试安全性。此外，WalledEval引入了WalledGuard，一个新的、小巧且高效的内容管理工具，以及SGXSTest，一个用于评估文化背景下夸张安全性的基准。我们将WalledEval公开发布在https://github.com/walledai/walledevalA。

English

WalledEval is a comprehensive AI safety testing toolkit designed to evaluate large language models (LLMs). It accommodates a diverse range of models, including both open-weight and API-based ones, and features over 35 safety benchmarks covering areas such as multilingual safety, exaggerated safety, and prompt injections. The framework supports both LLM and judge benchmarking, and incorporates custom mutators to test safety against various text-style mutations such as future tense and paraphrasing. Additionally, WalledEval introduces WalledGuard, a new, small and performant content moderation tool, and SGXSTest, a benchmark for assessing exaggerated safety in cultural contexts. We make WalledEval publicly available at https://github.com/walledai/walledevalA.

WalledEval：用于大型语言模型的全面安全评估工具包

WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models

摘要

Support