ChatPaper.aiChatPaper

利用合成偏好數據來提升大型語言模型

Self-Boosting Large Language Models with Synthetic Preference Data

October 9, 2024
作者: Qingxiu Dong, Li Dong, Xingxing Zhang, Zhifang Sui, Furu Wei
cs.AI

摘要

通過與人類偏好的對齊,大型語言模型(LLMs)在生成誠實、無害和有幫助的回應方面取得了顯著進展。然而,收集高質量的偏好數據是一個資源密集且需要創造力的過程,特別是為了持續改進LLMs。我們引入了SynPO,一種利用合成偏好數據進行模型對齊的自我增強範式。SynPO採用一種迭代機制,其中自我提示生成器創建多樣的提示,而回應改進器逐步優化模型回應。這種方法訓練LLMs自主學習其自身輸出的生成獎勵,並消除了對大規模提示和人類偏好標註的需求。在四個SynPO迭代之後,Llama3-8B和Mistral-7B在AlpacaEval 2.0和ArenaHard上的指令遵循能力顯著提高,達到超過22.1%的勝率提升。同時,SynPO提高了LLMs在各種任務上的總體表現,通過Open LLM排行榜上3.2至5.0的平均分數增加得到驗證。
English
Through alignment with human preferences, Large Language Models (LLMs) have advanced significantly in generating honest, harmless, and helpful responses. However, collecting high-quality preference data is a resource-intensive and creativity-demanding process, especially for the continual improvement of LLMs. We introduce SynPO, a self-boosting paradigm that leverages synthetic preference data for model alignment. SynPO employs an iterative mechanism wherein a self-prompt generator creates diverse prompts, and a response improver refines model responses progressively. This approach trains LLMs to autonomously learn the generative rewards for their own outputs and eliminates the need for large-scale annotation of prompts and human preferences. After four SynPO iterations, Llama3-8B and Mistral-7B show significant enhancements in instruction-following abilities, achieving over 22.1% win rate improvements on AlpacaEval 2.0 and ArenaHard. Simultaneously, SynPO improves the general performance of LLMs on various tasks, validated by a 3.2 to 5.0 average score increase on the well-recognized Open LLM leaderboard.

Summary

AI-Generated Summary

PDF171November 16, 2024