讓語言模型能夠從數據中隱式學習自我改進

摘要

大型語言模型（LLMs）在開放式文本生成任務中展現出卓越的能力。然而，這些任務固有的開放性意味著模型回應的質量始終有改進的空間。為應對這一挑戰，提出了各種方法來增強LLMs的性能。越來越多的關注點集中在使LLMs能夠自我改進其回應質量上，從而減少對於收集多樣且高質量訓練數據的大量人工標註工作的依賴。最近，基於提示的方法在自我改進方法中得到廣泛探討，因其效果、效率和便利性。然而，這些方法通常需要明確且詳盡撰寫的指導方針作為LLMs的輸入。手動推導並提供所有必要的指導方針以實現真實世界複雜目標的改進（例如更有幫助且更少有害）是昂貴且具有挑戰性的。因此，我們提出了一個隱式自我改進（PIT）框架，該框架從人類偏好數據中隱式學習改進目標。PIT僅需要用於訓練獎勵模型的偏好數據，無需額外的人力。具體而言，我們重新制定了從人類反饋中的強化學習（RLHF）的訓練目標，而不是為了給定輸入最大化回應質量，我們最大化了條件於參考回應的回應質量差。通過這種方式，PIT在隱式訓練中具有更好地與人類偏好保持一致的改進目標。在兩個真實世界數據集和一個合成數據集上的實驗表明，我們的方法明顯優於基於提示的方法。

English

Large Language Models (LLMs) have demonstrated remarkable capabilities in open-ended text generation tasks. However, the inherent open-ended nature of these tasks implies that there is always room for improvement in the quality of model responses. To address this challenge, various approaches have been proposed to enhance the performance of LLMs. There has been a growing focus on enabling LLMs to self-improve their response quality, thereby reducing the reliance on extensive human annotation efforts for collecting diverse and high-quality training data. Recently, prompting-based methods have been widely explored among self-improvement methods owing to their effectiveness, efficiency, and convenience. However, those methods usually require explicitly and thoroughly written rubrics as inputs to LLMs. It is expensive and challenging to manually derive and provide all necessary rubrics with a real-world complex goal for improvement (e.g., being more helpful and less harmful). To this end, we propose an ImPlicit Self-ImprovemenT (PIT) framework that implicitly learns the improvement goal from human preference data. PIT only requires preference data that are used to train reward models without extra human efforts. Specifically, we reformulate the training objective of reinforcement learning from human feedback (RLHF) -- instead of maximizing response quality for a given input, we maximize the quality gap of the response conditioned on a reference response. In this way, PIT is implicitly trained with the improvement goal of better aligning with human preferences. Experiments on two real-world datasets and one synthetic dataset show that our method significantly outperforms prompting-based methods.

讓語言模型能夠從數據中隱式學習自我改進

Enable Language Models to Implicitly Learn Self-Improvement From Data

摘要

Support