让语言模型能够从数据中隐式学习自我改进
Enable Language Models to Implicitly Learn Self-Improvement From Data
October 2, 2023
作者: Ziqi Wang, Le Hou, Tianjian Lu, Yuexin Wu, Yunxuan Li, Hongkun Yu, Heng Ji
cs.AI
摘要
大型语言模型(LLMs)在开放式文本生成任务中展示了卓越的能力。然而,这些任务固有的开放性意味着模型响应的质量始终有改进的空间。为了解决这一挑战,已经提出了各种方法来增强LLMs的性能。近来,越来越多的关注点集中在使LLMs能够自我改进其响应质量上,从而减少对收集多样化和高质量训练数据的大量人工标注工作的依赖。最近,基于提示的方法在自我改进方法中得到了广泛探讨,因为它们具有高效性和便利性。然而,这些方法通常需要明确和详尽地编写提示作为LLMs的输入。手动推导并提供与现实世界复杂目标(例如更有帮助和更少有害)的改进相关的所有必要提示是昂贵且具有挑战性的。为此,我们提出了一种隐式自我改进(PIT)框架,该框架从人类偏好数据中隐式学习改进目标。PIT仅需要用于训练奖励模型的偏好数据,无需额外的人力工作。具体而言,我们重新制定了强化学习从人类反馈(RLHF)中的训练目标,我们不是为了最大化给定输入的响应质量,而是为了最大化响应质量与参考响应之间的质量差距。通过这种方式,PIT在隐式训练中具有更好地与人类偏好相一致的改进目标。在两个真实世界数据集和一个合成数据集上的实验表明,我们的方法明显优于基于提示的方法。
English
Large Language Models (LLMs) have demonstrated remarkable capabilities in
open-ended text generation tasks. However, the inherent open-ended nature of
these tasks implies that there is always room for improvement in the quality of
model responses. To address this challenge, various approaches have been
proposed to enhance the performance of LLMs. There has been a growing focus on
enabling LLMs to self-improve their response quality, thereby reducing the
reliance on extensive human annotation efforts for collecting diverse and
high-quality training data. Recently, prompting-based methods have been widely
explored among self-improvement methods owing to their effectiveness,
efficiency, and convenience. However, those methods usually require explicitly
and thoroughly written rubrics as inputs to LLMs. It is expensive and
challenging to manually derive and provide all necessary rubrics with a
real-world complex goal for improvement (e.g., being more helpful and less
harmful). To this end, we propose an ImPlicit Self-ImprovemenT (PIT) framework
that implicitly learns the improvement goal from human preference data. PIT
only requires preference data that are used to train reward models without
extra human efforts. Specifically, we reformulate the training objective of
reinforcement learning from human feedback (RLHF) -- instead of maximizing
response quality for a given input, we maximize the quality gap of the response
conditioned on a reference response. In this way, PIT is implicitly trained
with the improvement goal of better aligning with human preferences.
Experiments on two real-world datasets and one synthetic dataset show that our
method significantly outperforms prompting-based methods.