介绍来自MLCommons的AI安全基准测试v0.5版本。
Introducing v0.5 of the AI Safety Benchmark from MLCommons
April 18, 2024
作者: Bertie Vidgen, Adarsh Agrawal, Ahmed M. Ahmed, Victor Akinwande, Namir Al-Nuaimi, Najla Alfaraj, Elie Alhajjar, Lora Aroyo, Trupti Bavalatti, Borhane Blili-Hamelin, Kurt Bollacker, Rishi Bomassani, Marisa Ferrara Boston, Siméon Campos, Kal Chakra, Canyu Chen, Cody Coleman, Zacharie Delpierre Coudert, Leon Derczynski, Debojyoti Dutta, Ian Eisenberg, James Ezick, Heather Frase, Brian Fuller, Ram Gandikota, Agasthya Gangavarapu, Ananya Gangavarapu, James Gealy, Rajat Ghosh, James Goel, Usman Gohar, Sujata Goswami, Scott A. Hale, Wiebke Hutiri, Joseph Marvin Imperial, Surgan Jandial, Nick Judd, Felix Juefei-Xu, Foutse Khomh, Bhavya Kailkhura, Hannah Rose Kirk, Kevin Klyman, Chris Knotz, Michael Kuchnik, Shachi H. Kumar, Chris Lengerich, Bo Li, Zeyi Liao, Eileen Peters Long, Victor Lu, Yifan Mai, Priyanka Mary Mammen, Kelvin Manyeki, Sean McGregor, Virendra Mehta, Shafee Mohammed, Emanuel Moss, Lama Nachman, Dinesh Jinenhally Naganna, Amin Nikanjam, Besmira Nushi, Luis Oala, Iftach Orr, Alicia Parrish, Cigdem Patlak, William Pietri, Forough Poursabzi-Sangdeh, Eleonora Presani, Fabrizio Puletti, Paul Röttger, Saurav Sahay, Tim Santos, Nino Scherrer, Alice Schoenauer Sebag, Patrick Schramowski, Abolfazl Shahbazi, Vin Sharma, Xudong Shen, Vamsi Sistla, Leonard Tang, Davide Testuggine, Vithursan Thangarasa, Elizabeth Anne Watkins, Rebecca Weiss, Chris Welty, Tyler Wilbers, Adina Williams, Carole-Jean Wu, Poonam Yadav, Xianjun Yang, Yi Zeng, Wenhui Zhang, Fedor Zhdanov, Jiacheng Zhu, Percy Liang, Peter Mattson, Joaquin Vanschoren
cs.AI
摘要
本文介绍了由MLCommons AI安全工作组创建的AI安全基准v0.5。AI安全基准旨在评估使用聊天调整语言模型的AI系统的安全风险。我们引入了一种原则性方法来规定和构建该基准,v0.5版本仅涵盖一个用例(成年人与通用助手用英语交谈),以及有限的人物角色(即典型用户、恶意用户和易受攻击用户)。我们创建了一个包含13种危险类别的新分类法,其中v0.5基准中有7种类别进行了测试。我们计划在2024年底发布AI安全基准的1.0版本。v1.0基准将为AI系统的安全提供有意义的见解。然而,v0.5基准不应用于评估AI系统的安全性。我们已经全面记录了v0.5的局限性、缺陷和挑战。AI安全基准v0.5的发布包括:(1)规定和构建基准的原则性方法,包括用例、被测系统类型(SUTs)、语言和背景、人物角色、测试和测试项;(2)包含定义和子类别的13种危险类别的分类法;(3)七种危险类别的测试,每种包括一组独特的测试项,即提示。总共有43,090个测试项,我们使用模板创建;(4)用于对AI系统进行基准评估的评分系统;(5)一个名为ModelBench的开放平台和可下载工具,可用于评估AI系统在基准上的安全性;(6)一个示例评估报告,对十几种公开可用的聊天调整语言模型的性能进行基准测试;(7)基准的测试规范。
English
This paper introduces v0.5 of the AI Safety Benchmark, which has been created
by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been
designed to assess the safety risks of AI systems that use chat-tuned language
models. We introduce a principled approach to specifying and constructing the
benchmark, which for v0.5 covers only a single use case (an adult chatting to a
general-purpose assistant in English), and a limited set of personas (i.e.,
typical users, malicious users, and vulnerable users). We created a new
taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark.
We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024.
The v1.0 benchmark will provide meaningful insights into the safety of AI
systems. However, the v0.5 benchmark should not be used to assess the safety of
AI systems. We have sought to fully document the limitations, flaws, and
challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes
(1) a principled approach to specifying and constructing the benchmark, which
comprises use cases, types of systems under test (SUTs), language and context,
personas, tests, and test items; (2) a taxonomy of 13 hazard categories with
definitions and subcategories; (3) tests for seven of the hazard categories,
each comprising a unique set of test items, i.e., prompts. There are 43,090
test items in total, which we created with templates; (4) a grading system for
AI systems against the benchmark; (5) an openly available platform, and
downloadable tool, called ModelBench that can be used to evaluate the safety of
AI systems on the benchmark; (6) an example evaluation report which benchmarks
the performance of over a dozen openly available chat-tuned language models;
(7) a test specification for the benchmark.Summary
AI-Generated Summary