Tuna: 利用大型語言模型的反饋進行指令調整

摘要

透過使用來自更強大的大型語言模型（LLMs）如Instruct-GPT和GPT-4的直接輸出，對開源大型語言模型（LLMs）進行指導調整，已被證明是將模型行為與人類偏好調整一致的一種具成本效益的方法。然而，指導調整的模型僅看到每個指導的一個回應，缺乏潛在更好回應的知識。在本文中，我們提出了使用我們的新穎概率排名和情境排名方法對指導調整的LLM進行微調，以增加生成更好回應的可能性。概率排名使指導調整的模型繼承高質量和低質量回應的相對排名，從教師LLM那裡。另一方面，通過情境排名學習使模型利用更強大LLMs的情境理解能力來完善自己的回應分佈。此外，我們將概率排名和情境排名依序應用於指導調整的LLM。結果模型，我們稱之為Tuna，在Super Natural Instructions（119個測試任務）、LMentry（25個測試任務）、Vicuna QA上持續改善性能，甚至可以獲得比幾個強強化學習基線更好的結果。我們的代碼和數據可在 https://github.com/microsoft/LMOps 上找到。

English

Instruction tuning of open-source large language models (LLMs) like LLaMA, using direct outputs from more powerful LLMs such as Instruct-GPT and GPT-4, has proven to be a cost-effective way to align model behaviors with human preferences. However, the instruction-tuned model has only seen one response per instruction, lacking the knowledge of potentially better responses. In this paper, we propose finetuning an instruction-tuned LLM using our novel probabilistic ranking and contextual ranking approaches to increase the likelihood of generating better responses. Probabilistic ranking enables the instruction-tuned model to inherit the relative rankings of high-quality and low-quality responses from the teacher LLM. On the other hand, learning with contextual ranking allows the model to refine its own response distribution using the contextual understanding ability of stronger LLMs. Furthermore, we apply probabilistic ranking and contextual ranking sequentially to the instruction-tuned LLM. The resulting model, which we call Tuna, consistently improves the performance on Super Natural Instructions (119 test tasks), LMentry (25 test tasks), Vicuna QA, and can even obtain better results than several strong reinforcement learning baselines. Our code and data are available at https://github.com/microsoft/LMOps.

Tuna: 利用大型語言模型的反饋進行指令調整

Tuna: Instruction Tuning using Feedback from Large Language Models

摘要

Support