拒絕的原因？將語言模型與判斷對齊

摘要

作為人類，我們不斷與同行互動並以自然語言形式接收反饋。這種語言反饋使我們能夠反思自己的行為，保持適當的行為並糾正錯誤。一個自然而然的問題是：我們能否使用語言反饋來對齊大型語言模型（LLMs）？與以往將LLMs與獎勵或偏好數據對齊的研究相比，我們首次系統地探索了透過語言反饋（即判斷）來對齊的方法。我們開始深入研究可以適應對齊LLMs與判斷的潛在方法，揭示這些方法無法充分利用這些判斷。為了更有效地利用判斷，我們提出了一個新的框架，對比非概然訓練（CUT），它允許基於判斷對不當內容進行細粒度檢測和更正。我們的離線對齊結果顯示，僅使用1317個現成的判斷數據，CUT（LLaMA2-13b）能夠擊敗175B DaVinci003，並在AlpacaEval上超過最佳基準52.34分。在線對齊結果表明，CUT能夠以迭代方式使用特定於模型的判斷數據對齊LLMs（LLaMA2-chat-13b），在AlpacaEval上的性能從81.09提高到91.36分。我們的分析進一步表明，判斷對LLM對齊的潛力大於獎勵，值得未來進行研究。

English

As humans, we consistently engage in interactions with our peers and receive feedback in the form of natural language. This language feedback allows us to reflect on our actions, maintain appropriate behavior, and rectify our errors. The question arises naturally: can we use language feedback to align large language models (LLMs)? In contrast to previous research that aligns LLMs with reward or preference data, we present the first systematic exploration of alignment through the lens of language feedback (i.e., judgment). We commence with an in-depth investigation of potential methods that can be adapted for aligning LLMs with judgments, revealing that these methods are unable to fully capitalize on the judgments. To facilitate more effective utilization of judgments, we propose a novel framework, Contrastive Unlikelihood Training (CUT), that allows for fine-grained inappropriate content detection and correction based on judgments. Our offline alignment results show that, with merely 1317 off-the-shelf judgment data, CUT (LLaMA2-13b) can beat the 175B DaVinci003 and surpass the best baseline by 52.34 points on AlpacaEval. The online alignment results demonstrate that CUT can align LLMs (LLaMA2-chat-13b) in an iterative fashion using model-specific judgment data, with a steady performance improvement from 81.09 to 91.36 points on AlpacaEval. Our analysis further suggests that judgments exhibit greater potential than rewards for LLM alignment and warrant future research.

拒絕的原因？將語言模型與判斷對齊

Reasons to Reject? Aligning Language Models with Judgments

摘要

Support