Tuna: 大規模言語モデルからのフィードバックを用いた指示チューニング

要旨

LLaMAのようなオープンソースの大規模言語モデル（LLM）を、Instruct-GPTやGPT-4といったより強力なLLMの直接的な出力を用いて命令チューニングすることは、モデルの挙動を人間の好みに合わせるためのコスト効率の良い方法として証明されています。しかし、命令チューニングされたモデルは、各命令に対して1つの応答しか見ておらず、潜在的に優れた応答に関する知識を欠いています。本論文では、命令チューニングされたLLMを、我々が提案する新しい確率的ランキングと文脈的ランキングのアプローチを用いてファインチューニングし、より優れた応答を生成する可能性を高めることを提案します。確率的ランキングにより、命令チューニングされたモデルは、教師LLMからの高品質と低品質の応答の相対的なランキングを継承することができます。一方、文脈的ランキングを用いた学習により、モデルはより強力なLLMの文脈理解能力を活用して自身の応答分布を洗練させることができます。さらに、確率的ランキングと文脈的ランキングを順次、命令チューニングされたLLMに適用します。その結果得られたモデル、我々がTunaと呼ぶものは、Super Natural Instructions（119のテストタスク）、LMentry（25のテストタスク）、Vicuna QAにおいて一貫して性能を向上させ、いくつかの強力な強化学習ベースラインよりも優れた結果を得ることができます。我々のコードとデータはhttps://github.com/microsoft/LMOpsで公開されています。

English

Instruction tuning of open-source large language models (LLMs) like LLaMA, using direct outputs from more powerful LLMs such as Instruct-GPT and GPT-4, has proven to be a cost-effective way to align model behaviors with human preferences. However, the instruction-tuned model has only seen one response per instruction, lacking the knowledge of potentially better responses. In this paper, we propose finetuning an instruction-tuned LLM using our novel probabilistic ranking and contextual ranking approaches to increase the likelihood of generating better responses. Probabilistic ranking enables the instruction-tuned model to inherit the relative rankings of high-quality and low-quality responses from the teacher LLM. On the other hand, learning with contextual ranking allows the model to refine its own response distribution using the contextual understanding ability of stronger LLMs. Furthermore, we apply probabilistic ranking and contextual ranking sequentially to the instruction-tuned LLM. The resulting model, which we call Tuna, consistently improves the performance on Super Natural Instructions (119 test tasks), LMentry (25 test tasks), Vicuna QA, and can even obtain better results than several strong reinforcement learning baselines. Our code and data are available at https://github.com/microsoft/LMOps.

Tuna: 大規模言語モデルからのフィードバックを用いた指示チューニング

Tuna: Instruction Tuning using Feedback from Large Language Models

要旨

Support