使用直接原理反饋來抑制粉紅大象

摘要

現有控制語言模型的方法，如RLHF和Constitutional AI，涉及確定哪些LLM行為是可取的，並將其訓練到語言模型中。然而，在許多情況下，希望LLMs在推論時是可控的，這樣它們就可以在多種不同需求的情境中使用。我們通過粉紅大象問題來說明這一點：指示LLM避免討論某個實體（“粉紅大象”），而是討論一個首選實體（“灰色大象”）。我們應用了Constitutional AI的一種新簡化方法，即直接原則反饋，它跳過對回應進行排名，並直接在評論和修訂上使用DPO。我們的結果顯示，在我們的合成粉紅大象數據集上經過DPF微調後，我們的13B微調LLaMA 2模型在粉紅大象問題的測試集上明顯優於Llama-2-13B-Chat和提示基準，並在評估粉紅大象問題的精心選擇測試集上與GPT-4表現一致。

English

Existing methods for controlling language models, such as RLHF and Constitutional AI, involve determining which LLM behaviors are desirable and training them into a language model. However, in many cases, it is desirable for LLMs to be controllable at inference time, so that they can be used in multiple contexts with diverse needs. We illustrate this with the Pink Elephant Problem: instructing an LLM to avoid discussing a certain entity (a ``Pink Elephant''), and instead discuss a preferred entity (``Grey Elephant''). We apply a novel simplification of Constitutional AI, Direct Principle Feedback, which skips the ranking of responses and uses DPO directly on critiques and revisions. Our results show that after DPF fine-tuning on our synthetic Pink Elephants dataset, our 13B fine-tuned LLaMA 2 model significantly outperforms Llama-2-13B-Chat and a prompted baseline, and performs as well as GPT-4 in on our curated test set assessing the Pink Elephant Problem.

使用直接原理反饋來抑制粉紅大象

Suppressing Pink Elephants with Direct Principle Feedback

摘要

Support