ChatPaper.aiChatPaper

利用合成偏好数据来增强大型语言模型

Self-Boosting Large Language Models with Synthetic Preference Data

October 9, 2024
作者: Qingxiu Dong, Li Dong, Xingxing Zhang, Zhifang Sui, Furu Wei
cs.AI

摘要

通过与人类偏好的对齐,大型语言模型(LLMs)在生成真实、无害和有帮助的回复方面取得了显著进展。然而,收集高质量的偏好数据是一项资源密集且需要创造力的过程,特别是为了持续改进LLMs。我们引入了SynPO,这是一种利用合成偏好数据进行模型对齐的自我增强范式。SynPO采用迭代机制,其中自我提示生成器创建多样化提示,而响应改进器逐步完善模型回复。这种方法训练LLMs自主学习其输出的生成奖励,消除了大规模注释提示和人类偏好的需求。经过四次SynPO迭代,Llama3-8B和Mistral-7B在AlpacaEval 2.0和ArenaHard上的指令遵循能力显著提升,获得超过22.1%的胜率提升。同时,SynPO提高了LLMs在各种任务上的整体性能,通过公认的Open LLM排行榜,平均得分提高了3.2到5.0分。
English
Through alignment with human preferences, Large Language Models (LLMs) have advanced significantly in generating honest, harmless, and helpful responses. However, collecting high-quality preference data is a resource-intensive and creativity-demanding process, especially for the continual improvement of LLMs. We introduce SynPO, a self-boosting paradigm that leverages synthetic preference data for model alignment. SynPO employs an iterative mechanism wherein a self-prompt generator creates diverse prompts, and a response improver refines model responses progressively. This approach trains LLMs to autonomously learn the generative rewards for their own outputs and eliminates the need for large-scale annotation of prompts and human preferences. After four SynPO iterations, Llama3-8B and Mistral-7B show significant enhancements in instruction-following abilities, achieving over 22.1% win rate improvements on AlpacaEval 2.0 and ArenaHard. Simultaneously, SynPO improves the general performance of LLMs on various tasks, validated by a 3.2 to 5.0 average score increase on the well-recognized Open LLM leaderboard.

Summary

AI-Generated Summary

PDF171November 16, 2024