展示，而非告訴：將語言模型與示範反饋對齊

摘要

語言模型被調整以模擬眾多聲音的集體，結果產生的輸出與特定個人無關。通過監督微調或RLHF，可以將LLM從通用輸出中引開，但對於新的即興任務，需要極大的數據集。我們認為，可以透過利用極少量（<10）的示範作為反饋，將LLM對齊到特定環境。我們的方法，即示範迭代任務優化（DITTO），直接將語言模型的輸出對齊到用戶示範的行為。DITTO利用在線模仿學習的思想，通過將用戶的示範視為優於LLM及其中間檢查點的輸出，便宜地生成在線比較數據。我們評估DITTO在學習細粒度風格和任務對齊方面的能力，跨越新聞文章、電子郵件和博客文章等領域。此外，我們進行了一項用戶研究，從參與者（N=16）那裡獲取各種示範。在我們的基準測試和用戶研究中，我們發現DITTO的勝率優於少量提示、監督微調和其他自我對弈方法，平均提高了19%。通過直接使用示範作為反饋，DITTO提供了一種有效定制LLM的新方法。

English

Language models are aligned to emulate the collective voice of many, resulting in outputs that align with no one in particular. Steering LLMs away from generic output is possible through supervised finetuning or RLHF, but requires prohibitively large datasets for new ad-hoc tasks. We argue that it is instead possible to align an LLM to a specific setting by leveraging a very small number (<10) of demonstrations as feedback. Our method, Demonstration ITerated Task Optimization (DITTO), directly aligns language model outputs to a user's demonstrated behaviors. Derived using ideas from online imitation learning, DITTO cheaply generates online comparison data by treating users' demonstrations as preferred over output from the LLM and its intermediate checkpoints. We evaluate DITTO's ability to learn fine-grained style and task alignment across domains such as news articles, emails, and blog posts. Additionally, we conduct a user study soliciting a range of demonstrations from participants (N=16). Across our benchmarks and user study, we find that win-rates for DITTO outperform few-shot prompting, supervised fine-tuning, and other self-play methods by an average of 19% points. By using demonstrations as feedback directly, DITTO offers a novel method for effective customization of LLMs.

展示，而非告訴：將語言模型與示範反饋對齊

Show, Don't Tell: Aligning Language Models with Demonstrated Feedback

摘要

Support