拒否する理由？言語モデルと判断の整合性

要旨

人間として、私たちは常に他者との相互作用を行い、自然言語の形でフィードバックを受け取ります。この言語的フィードバックにより、私たちは自身の行動を振り返り、適切な行動を維持し、誤りを修正することができます。ここで自然に生じる疑問は、言語的フィードバックを用いて大規模言語モデル（LLM）をアライメント（整合）させることができるかどうかです。これまでの研究では、報酬や選好データを用いてLLMをアライメントさせてきましたが、本研究では、言語的フィードバック（すなわち、判断）を通じたアライメントの最初の体系的探求を提示します。まず、判断を用いてLLMをアライメントさせるために適応可能な潜在的な手法を詳細に調査し、これらの手法が判断を十分に活用できないことを明らかにします。判断をより効果的に活用するために、判断に基づく細粒度の不適切な内容の検出と修正を可能にする新しいフレームワーク、Contrastive Unlikelihood Training（CUT）を提案します。オフラインアライメントの結果、わずか1317の既存の判断データを用いて、CUT（LLaMA2-13b）は175BのDaVinci003を上回り、AlpacaEvalにおいて最良のベースラインを52.34ポイント上回りました。オンラインアライメントの結果は、CUTがモデル固有の判断データを用いて反復的にLLM（LLaMA2-chat-13b）をアライメントさせ、AlpacaEvalのスコアを81.09から91.36へと着実に向上させられることを示しています。さらに、分析により、判断は報酬よりもLLMアライメントにおいてより大きな可能性を示し、今後の研究の価値があることが示唆されています。

English

As humans, we consistently engage in interactions with our peers and receive feedback in the form of natural language. This language feedback allows us to reflect on our actions, maintain appropriate behavior, and rectify our errors. The question arises naturally: can we use language feedback to align large language models (LLMs)? In contrast to previous research that aligns LLMs with reward or preference data, we present the first systematic exploration of alignment through the lens of language feedback (i.e., judgment). We commence with an in-depth investigation of potential methods that can be adapted for aligning LLMs with judgments, revealing that these methods are unable to fully capitalize on the judgments. To facilitate more effective utilization of judgments, we propose a novel framework, Contrastive Unlikelihood Training (CUT), that allows for fine-grained inappropriate content detection and correction based on judgments. Our offline alignment results show that, with merely 1317 off-the-shelf judgment data, CUT (LLaMA2-13b) can beat the 175B DaVinci003 and surpass the best baseline by 52.34 points on AlpacaEval. The online alignment results demonstrate that CUT can align LLMs (LLaMA2-chat-13b) in an iterative fashion using model-specific judgment data, with a steady performance improvement from 81.09 to 91.36 points on AlpacaEval. Our analysis further suggests that judgments exhibit greater potential than rewards for LLM alignment and warrant future research.

拒否する理由？言語モデルと判断の整合性

Reasons to Reject? Aligning Language Models with Judgments

要旨

Support