ChatPaper.aiChatPaper

Phi-3 安全後訓練:將語言模型與「修復循環」對齊

Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle

July 18, 2024
作者: Emman Haider, Daniel Perez-Becker, Thomas Portet, Piyush Madan, Amit Garg, David Majercak, Wen Wen, Dongwoo Kim, Ziyi Yang, Jianwen Zhang, Hiteshi Sharma, Blake Bullwinkel, Martin Pouliot, Amanda Minnich, Shiven Chawla, Solianna Herrera, Shahed Warreth, Maggie Engler, Gary Lopez, Nina Chikanov, Raja Sekhar Rao Dheekonda, Bolor-Erdene Jagdagdorj, Roman Lutz, Richard Lundeen, Tori Westerhoff, Pete Bryan, Christian Seifert, Ram Shankar Siva Kumar, Andrew Berkley, Alex Kessler
cs.AI

摘要

最近在語言模型訓練方面的創新已經證明,可以創建性能卓越的模型,並且足夠小以在智能手機上運行。隨著這些模型在越來越多的領域部署,確保它們與人類偏好和安全考量保持一致至關重要。在本報告中,我們介紹了我們用於安全對齊 Phi-3 系列語言模型的方法論。我們採用了“破壞-修復”循環,進行了多輪數據集編輯、訓練後安全性、基準測試、紅隊測試和漏洞識別,以涵蓋單輪和多輪情景中各種損害領域。我們的結果表明,這種方法逐步改善了 Phi-3 模型在廣泛的負責任 AI 基準測試中的表現。
English
Recent innovations in language model training have demonstrated that it is possible to create highly performant models that are small enough to run on a smartphone. As these models are deployed in an increasing number of domains, it is critical to ensure that they are aligned with human preferences and safety considerations. In this report, we present our methodology for safety aligning the Phi-3 series of language models. We utilized a "break-fix" cycle, performing multiple rounds of dataset curation, safety post-training, benchmarking, red teaming, and vulnerability identification to cover a variety of harm areas in both single and multi-turn scenarios. Our results indicate that this approach iteratively improved the performance of the Phi-3 models across a wide range of responsible AI benchmarks.

Summary

AI-Generated Summary

PDF122November 28, 2024