합성 선호 데이터를 활용한 자가 강화 대형 언어 모델

초록

인간의 선호에 부합함으로써 대형 언어 모델(LLMs)은 정직하고 무해하며 유익한 응답을 생성하는 데 상당한 발전을 이루었습니다. 그러나 고품질의 선호 데이터를 수집하는 것은 자원 집약적이며 창의력을 요구하는 과정이며, 특히 LLM의 지속적인 개선을 위해서는 그렇습니다. 본 연구에서는 모델 정렬을 위해 합성 선호 데이터를 활용하는 자체 강화 패러다임인 SynPO를 소개합니다. SynPO는 자가 프롬프트 생성기가 다양한 프롬프트를 생성하고 응답 개선자가 모델 응답을 점진적으로 개선하는 반복적 메커니즘을 채택합니다. 이 방법은 LLM이 자체적으로 출력물에 대한 생성적 보상을 학습하고 프롬프트와 인간의 선호에 대한 대규모 주석이 필요 없이 학습할 수 있도록 합니다. 4회의 SynPO 반복 후, Llama3-8B 및 Mistral-7B는 AlpacaEval 2.0 및 ArenaHard에서 22.1% 이상의 승률 향상을 달성하며 지시 따르기 능력을 크게 향상시켰습니다. 동시에 SynPO는 다양한 작업에서 LLM의 일반적인 성능을 향상시키며, 잘 알려진 Open LLM 리더보드에서 3.2에서 5.0의 평균 점수 증가로 검증되었습니다.

English

Through alignment with human preferences, Large Language Models (LLMs) have advanced significantly in generating honest, harmless, and helpful responses. However, collecting high-quality preference data is a resource-intensive and creativity-demanding process, especially for the continual improvement of LLMs. We introduce SynPO, a self-boosting paradigm that leverages synthetic preference data for model alignment. SynPO employs an iterative mechanism wherein a self-prompt generator creates diverse prompts, and a response improver refines model responses progressively. This approach trains LLMs to autonomously learn the generative rewards for their own outputs and eliminates the need for large-scale annotation of prompts and human preferences. After four SynPO iterations, Llama3-8B and Mistral-7B show significant enhancements in instruction-following abilities, achieving over 22.1% win rate improvements on AlpacaEval 2.0 and ArenaHard. Simultaneously, SynPO improves the general performance of LLMs on various tasks, validated by a 3.2 to 5.0 average score increase on the well-recognized Open LLM leaderboard.

합성 선호 데이터를 활용한 자가 강화 대형 언어 모델

Self-Boosting Large Language Models with Synthetic Preference Data

초록

Support