ChatPaper.aiChatPaper

Phi-3 安全后训练:将语言模型与“修复周期”对齐

Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle

July 18, 2024
作者: Emman Haider, Daniel Perez-Becker, Thomas Portet, Piyush Madan, Amit Garg, David Majercak, Wen Wen, Dongwoo Kim, Ziyi Yang, Jianwen Zhang, Hiteshi Sharma, Blake Bullwinkel, Martin Pouliot, Amanda Minnich, Shiven Chawla, Solianna Herrera, Shahed Warreth, Maggie Engler, Gary Lopez, Nina Chikanov, Raja Sekhar Rao Dheekonda, Bolor-Erdene Jagdagdorj, Roman Lutz, Richard Lundeen, Tori Westerhoff, Pete Bryan, Christian Seifert, Ram Shankar Siva Kumar, Andrew Berkley, Alex Kessler
cs.AI

摘要

最近在语言模型训练方面的创新表明,可以创建性能出色的模型,且体积小到足以在智能手机上运行。随着这些模型在越来越多的领域部署,确保它们与人类偏好和安全考虑保持一致至关重要。在本报告中,我们介绍了我们用于安全对齐 Phi-3 系列语言模型的方法论。我们采用了“修复-破解”循环,进行了多轮数据集筛选、训练后安全性处理、基准测试、红队攻击和漏洞识别,以涵盖单轮和多轮场景中的各种危害领域。我们的结果表明,这种方法逐步改善了 Phi-3 模型在广泛的负责任人工智能基准测试中的性能。
English
Recent innovations in language model training have demonstrated that it is possible to create highly performant models that are small enough to run on a smartphone. As these models are deployed in an increasing number of domains, it is critical to ensure that they are aligned with human preferences and safety considerations. In this report, we present our methodology for safety aligning the Phi-3 series of language models. We utilized a "break-fix" cycle, performing multiple rounds of dataset curation, safety post-training, benchmarking, red teaming, and vulnerability identification to cover a variety of harm areas in both single and multi-turn scenarios. Our results indicate that this approach iteratively improved the performance of the Phi-3 models across a wide range of responsible AI benchmarks.

Summary

AI-Generated Summary

PDF122November 28, 2024