ChatPaper.aiChatPaper

sDPO:不要一次性使用您的數據

sDPO: Don't Use Your Data All at Once

March 28, 2024
作者: Dahyun Kim, Yungi Kim, Wonho Song, Hyeonwoo Kim, Yunsu Kim, Sanghoon Kim, Chanjun Park
cs.AI

摘要

隨著大型語言模型(LLM)的發展,將它們與人類偏好相協調變得日益重要。我們提出了分步式DPO(sDPO),這是對最近流行的直接偏好優化(DPO)進行擴展,用於調整協調。這種方法涉及將可用的偏好數據集分割並以分步方式利用,而不是一次性全部使用。我們證明了這種方法有助於在DPO訓練框架內使用更精確對齊的參考模型。此外,sDPO訓練最終模型的效能更佳,甚至優於其他具有更多參數的流行LLM。
English
As development of large language models (LLM) progresses, aligning them with human preferences has become increasingly important. We propose stepwise DPO (sDPO), an extension of the recently popularized direct preference optimization (DPO) for alignment tuning. This approach involves dividing the available preference datasets and utilizing them in a stepwise manner, rather than employing it all at once. We demonstrate that this method facilitates the use of more precisely aligned reference models within the DPO training framework. Furthermore, sDPO trains the final model to be more performant, even outperforming other popular LLMs with more parameters.

Summary

AI-Generated Summary

PDF423December 15, 2024