ChatPaper.aiChatPaper

MLCommons 從 v0.5 開始推出 AI 安全基準。

Introducing v0.5 of the AI Safety Benchmark from MLCommons

April 18, 2024
作者: Bertie Vidgen, Adarsh Agrawal, Ahmed M. Ahmed, Victor Akinwande, Namir Al-Nuaimi, Najla Alfaraj, Elie Alhajjar, Lora Aroyo, Trupti Bavalatti, Borhane Blili-Hamelin, Kurt Bollacker, Rishi Bomassani, Marisa Ferrara Boston, Siméon Campos, Kal Chakra, Canyu Chen, Cody Coleman, Zacharie Delpierre Coudert, Leon Derczynski, Debojyoti Dutta, Ian Eisenberg, James Ezick, Heather Frase, Brian Fuller, Ram Gandikota, Agasthya Gangavarapu, Ananya Gangavarapu, James Gealy, Rajat Ghosh, James Goel, Usman Gohar, Sujata Goswami, Scott A. Hale, Wiebke Hutiri, Joseph Marvin Imperial, Surgan Jandial, Nick Judd, Felix Juefei-Xu, Foutse Khomh, Bhavya Kailkhura, Hannah Rose Kirk, Kevin Klyman, Chris Knotz, Michael Kuchnik, Shachi H. Kumar, Chris Lengerich, Bo Li, Zeyi Liao, Eileen Peters Long, Victor Lu, Yifan Mai, Priyanka Mary Mammen, Kelvin Manyeki, Sean McGregor, Virendra Mehta, Shafee Mohammed, Emanuel Moss, Lama Nachman, Dinesh Jinenhally Naganna, Amin Nikanjam, Besmira Nushi, Luis Oala, Iftach Orr, Alicia Parrish, Cigdem Patlak, William Pietri, Forough Poursabzi-Sangdeh, Eleonora Presani, Fabrizio Puletti, Paul Röttger, Saurav Sahay, Tim Santos, Nino Scherrer, Alice Schoenauer Sebag, Patrick Schramowski, Abolfazl Shahbazi, Vin Sharma, Xudong Shen, Vamsi Sistla, Leonard Tang, Davide Testuggine, Vithursan Thangarasa, Elizabeth Anne Watkins, Rebecca Weiss, Chris Welty, Tyler Wilbers, Adina Williams, Carole-Jean Wu, Poonam Yadav, Xianjun Yang, Yi Zeng, Wenhui Zhang, Fedor Zhdanov, Jiacheng Zhu, Percy Liang, Peter Mattson, Joaquin Vanschoren
cs.AI

摘要

本文介紹了由MLCommons AI安全工作組創建的AI安全基準v0.5。AI安全基準旨在評估使用語言模型進行聊天調整的AI系統的安全風險。我們引入了一種原則性方法來指定和構建該基準,v0.5版本僅涵蓋一個用例(成年人與通用助手在英語中進行對話),以及一組有限的角色(即典型用戶、惡意用戶和易受攻擊的用戶)。我們創建了一個包含13種危害類別的新分類法,其中v0.5基準中有7種類別進行了測試。我們計劃在2024年底發布AI安全基準的1.0版本。v1.0基準將為AI系統的安全提供有意義的見解。然而,v0.5基準不應用於評估AI系統的安全性。我們已經詳細記錄了v0.5的限制、缺陷和挑戰。這個v0.5版本的AI安全基準發布包括:(1)一種原則性方法來指定和構建基準,其中包括用例、被測系統類型(SUTs)、語言和上下文、角色、測試和測試項目;(2)包含定義和子類別的13種危害類別的分類法;(3)七種危害類別的測試,每種包含一組獨特的測試項目,即提示。總共有43,090個測試項目,我們使用模板創建;(4)針對基準的AI系統的評分系統;(5)一個名為ModelBench的開放平台和可下載工具,可用於評估AI系統在基準上的安全性;(6)一份範例評估報告,對超過十幾個公開可用的聊天調整語言模型的性能進行基準測試;(7)基準的測試規範。
English
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark.

Summary

AI-Generated Summary

PDF111December 15, 2024