WalledEval: Een Uitgebreid Veiligheidsbeoordelingsinstrument voor Grote Taalmodellen

Samenvatting

WalledEval is een uitgebreid AI-veiligheidstestpakket ontworpen om grote taalmodellen (LLM's) te evalueren. Het ondersteunt een breed scala aan modellen, waaronder zowel open-weight als API-gebaseerde modellen, en biedt meer dan 35 veiligheidsbenchmarks die gebieden zoals meertalige veiligheid, overdreven veiligheid en promptinjecties bestrijken. Het framework ondersteunt zowel LLM- als beoordelingsbenchmarking en bevat aangepaste mutatoren om de veiligheid te testen tegen verschillende tekststijlmutaties, zoals toekomende tijd en parafrasering. Daarnaast introduceert WalledEval WalledGuard, een nieuwe, compacte en performante tool voor inhoudsmoderatie, en SGXSTest, een benchmark voor het beoordelen van overdreven veiligheid in culturele contexten. We maken WalledEval publiekelijk beschikbaar op https://github.com/walledai/walledevalA.

English

WalledEval is a comprehensive AI safety testing toolkit designed to evaluate large language models (LLMs). It accommodates a diverse range of models, including both open-weight and API-based ones, and features over 35 safety benchmarks covering areas such as multilingual safety, exaggerated safety, and prompt injections. The framework supports both LLM and judge benchmarking, and incorporates custom mutators to test safety against various text-style mutations such as future tense and paraphrasing. Additionally, WalledEval introduces WalledGuard, a new, small and performant content moderation tool, and SGXSTest, a benchmark for assessing exaggerated safety in cultural contexts. We make WalledEval publicly available at https://github.com/walledai/walledevalA.

WalledEval: Een Uitgebreid Veiligheidsbeoordelingsinstrument voor Grote Taalmodellen

WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models

Samenvatting

Support