拒绝的原因？将语言模型与判断对齐

摘要

作为人类，我们不断与同行互动，并以自然语言形式接收反馈。这种语言反馈使我们能够反思自己的行为，保持适当的行为，并纠正错误。一个自然而然的问题是：我们能否利用语言反馈来使大型语言模型（LLMs）保持一致？与以往将LLMs与奖励或偏好数据对齐的研究相比，我们首次系统地探讨了通过语言反馈（即判断）来进行对齐的方法。我们首先深入研究了可以用于将LLMs与判断对齐的潜在方法，发现这些方法无法充分利用这些判断。为了更有效地利用这些判断，我们提出了一个新颖的框架，对比不可能性训练（CUT），它允许基于判断进行细粒度不当内容的检测和纠正。我们的离线对齐结果显示，仅使用1317个现成的判断数据，CUT（LLaMA2-13b）就能击败175B DaVinci003，并在AlpacaEval上超过最佳基线52.34分。在线对齐结果表明，CUT可以通过使用特定于模型的判断数据，以迭代方式对齐LLMs（LLaMA2-chat-13b），在AlpacaEval上的得分从81.09稳步提高到91.36。我们的分析进一步表明，与奖励相比，判断对LLM对齐具有更大的潜力，并值得未来的研究。

English

As humans, we consistently engage in interactions with our peers and receive feedback in the form of natural language. This language feedback allows us to reflect on our actions, maintain appropriate behavior, and rectify our errors. The question arises naturally: can we use language feedback to align large language models (LLMs)? In contrast to previous research that aligns LLMs with reward or preference data, we present the first systematic exploration of alignment through the lens of language feedback (i.e., judgment). We commence with an in-depth investigation of potential methods that can be adapted for aligning LLMs with judgments, revealing that these methods are unable to fully capitalize on the judgments. To facilitate more effective utilization of judgments, we propose a novel framework, Contrastive Unlikelihood Training (CUT), that allows for fine-grained inappropriate content detection and correction based on judgments. Our offline alignment results show that, with merely 1317 off-the-shelf judgment data, CUT (LLaMA2-13b) can beat the 175B DaVinci003 and surpass the best baseline by 52.34 points on AlpacaEval. The online alignment results demonstrate that CUT can align LLMs (LLaMA2-chat-13b) in an iterative fashion using model-specific judgment data, with a steady performance improvement from 81.09 to 91.36 points on AlpacaEval. Our analysis further suggests that judgments exhibit greater potential than rewards for LLM alignment and warrant future research.

拒绝的原因？将语言模型与判断对齐

Reasons to Reject? Aligning Language Models with Judgments

摘要

Support