WalledEval: 大規模言語モデル向け包括的安全性評価ツールキット

要旨

WalledEvalは、大規模言語モデル（LLM）を評価するために設計された包括的なAI安全性テストツールキットです。オープンウェイトモデルやAPIベースのモデルなど、多様なモデルに対応しており、多言語安全性、過剰な安全性、プロンプトインジェクションなどの分野をカバーする35以上の安全性ベンチマークを備えています。このフレームワークは、LLMとジャッジのベンチマークをサポートし、未来形や言い換えなど、さまざまなテキストスタイルの変異に対する安全性をテストするためのカスタムミューテータを組み込んでいます。さらに、WalledEvalは、新たに小型で高性能なコンテンツモデレーションツールであるWalledGuardと、文化的文脈における過剰な安全性を評価するためのベンチマークであるSGXSTestを導入しています。WalledEvalは、https://github.com/walledai/walledevalA で公開されています。

English

WalledEval is a comprehensive AI safety testing toolkit designed to evaluate large language models (LLMs). It accommodates a diverse range of models, including both open-weight and API-based ones, and features over 35 safety benchmarks covering areas such as multilingual safety, exaggerated safety, and prompt injections. The framework supports both LLM and judge benchmarking, and incorporates custom mutators to test safety against various text-style mutations such as future tense and paraphrasing. Additionally, WalledEval introduces WalledGuard, a new, small and performant content moderation tool, and SGXSTest, a benchmark for assessing exaggerated safety in cultural contexts. We make WalledEval publicly available at https://github.com/walledai/walledevalA.

WalledEval: 大規模言語モデル向け包括的安全性評価ツールキット

WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models

要旨

Support