展示,而非告知:利用展示的反馈来对齐语言模型
Show, Don't Tell: Aligning Language Models with Demonstrated Feedback
June 2, 2024
作者: Omar Shaikh, Michelle Lam, Joey Hejna, Yijia Shao, Michael Bernstein, Diyi Yang
cs.AI
摘要
语言模型被调整以模拟许多人的集体声音,导致输出与特定个体无关。通过监督微调或RLHF,可以将LLM从生成通用输出的方向转移,但对于新的即席任务,这需要使用成本过高的大型数据集。我们认为,可以通过利用极少量(<10)的演示作为反馈,将LLM对齐到特定环境。我们的方法,即演示迭代任务优化(DITTO),直接将语言模型的输出与用户展示的行为对齐。借鉴在线模仿学习的思想,DITTO通过将用户的演示视为优于LLM及其中间检查点的输出,廉价地生成在线比较数据。我们评估了DITTO在学习跨领域细粒度风格和任务对齐方面的能力,如新闻文章、电子邮件和博客文章。此外,我们进行了一项用户研究,从参与者(N=16)那里收集了各种演示。在我们的基准测试和用户研究中,我们发现DITTO的胜率比少样本提示、监督微调和其他自我对弈方法平均高出19%点。通过直接使用演示作为反馈,DITTO提供了一种有效定制LLM的新方法。
English
Language models are aligned to emulate the collective voice of many,
resulting in outputs that align with no one in particular. Steering LLMs away
from generic output is possible through supervised finetuning or RLHF, but
requires prohibitively large datasets for new ad-hoc tasks. We argue that it is
instead possible to align an LLM to a specific setting by leveraging a very
small number (<10) of demonstrations as feedback. Our method, Demonstration
ITerated Task Optimization (DITTO), directly aligns language model outputs to a
user's demonstrated behaviors. Derived using ideas from online imitation
learning, DITTO cheaply generates online comparison data by treating users'
demonstrations as preferred over output from the LLM and its intermediate
checkpoints. We evaluate DITTO's ability to learn fine-grained style and task
alignment across domains such as news articles, emails, and blog posts.
Additionally, we conduct a user study soliciting a range of demonstrations from
participants (N=16). Across our benchmarks and user study, we find that
win-rates for DITTO outperform few-shot prompting, supervised fine-tuning, and
other self-play methods by an average of 19% points. By using demonstrations as
feedback directly, DITTO offers a novel method for effective customization of
LLMs.Summary
AI-Generated Summary