金枪鱼：利用大型语言模型反馈进行指令调整

摘要

利用来自更强大的LLMs（如Instruct-GPT和GPT-4）的直接输出对开源大型语言模型（LLMs）进行指导调整，已被证明是一种成本有效的方法，可以使模型行为与人类偏好保持一致。然而，经过指导调整的模型只看到每个指令的一个响应，缺乏潜在更好响应的知识。在本文中，我们提出了使用我们的新颖概率排名和上下文排名方法对经过指导调整的LLM进行微调，以增加生成更好响应的可能性。概率排名使经过指导调整的模型继承了来自教师LLM的高质量和低质量响应的相对排名。另一方面，学习上下文排名使模型利用更强大LLMs的上下文理解能力来优化自己的响应分布。此外，我们将概率排名和上下文排名依次应用于经过指导调整的LLM。得到的模型，我们称之为Tuna，在超自然指令（119个测试任务）、LMentry（25个测试任务）、Vicuna QA上始终改善性能，并且甚至可以获得比几个强强化学习基线更好的结果。我们的代码和数据可在https://github.com/microsoft/LMOps获取。

English

Instruction tuning of open-source large language models (LLMs) like LLaMA, using direct outputs from more powerful LLMs such as Instruct-GPT and GPT-4, has proven to be a cost-effective way to align model behaviors with human preferences. However, the instruction-tuned model has only seen one response per instruction, lacking the knowledge of potentially better responses. In this paper, we propose finetuning an instruction-tuned LLM using our novel probabilistic ranking and contextual ranking approaches to increase the likelihood of generating better responses. Probabilistic ranking enables the instruction-tuned model to inherit the relative rankings of high-quality and low-quality responses from the teacher LLM. On the other hand, learning with contextual ranking allows the model to refine its own response distribution using the contextual understanding ability of stronger LLMs. Furthermore, we apply probabilistic ranking and contextual ranking sequentially to the instruction-tuned LLM. The resulting model, which we call Tuna, consistently improves the performance on Super Natural Instructions (119 test tasks), LMentry (25 test tasks), Vicuna QA, and can even obtain better results than several strong reinforcement learning baselines. Our code and data are available at https://github.com/microsoft/LMOps.

金枪鱼：利用大型语言模型反馈进行指令调整

Tuna: Instruction Tuning using Feedback from Large Language Models

摘要

Support