Tuna: 대규모 언어 모델의 피드백을 활용한 지시 튜닝

초록

LLaMA와 같은 오픈소스 대형 언어 모델(LLM)을 Instruct-GPT 및 GPT-4와 같은 더 강력한 LLM의 직접 출력을 사용하여 명령어 튜닝(instruction tuning)하는 것은 모델의 행동을 인간의 선호에 맞추는 비용 효율적인 방법으로 입증되었습니다. 그러나 명령어 튜닝된 모델은 각 명령어에 대해 하나의 응답만을 보았기 때문에 잠재적으로 더 나은 응답에 대한 지식이 부족합니다. 본 논문에서는 명령어 튜닝된 LLM을 우리의 새로운 확률적 순위 지정(probabilistic ranking) 및 문맥적 순위 지정(contextual ranking) 접근법을 사용하여 미세 조정(finetuning)함으로써 더 나은 응답을 생성할 가능성을 높이는 방법을 제안합니다. 확률적 순위 지정은 명령어 튜닝된 모델이 교사 LLM으로부터 고품질 및 저품질 응답의 상대적 순위를 상속받을 수 있게 합니다. 반면, 문맥적 순위 지정을 통한 학습은 모델이 더 강력한 LLM의 문맥 이해 능력을 사용하여 자신의 응답 분포를 개선할 수 있도록 합니다. 또한, 우리는 확률적 순위 지정과 문맥적 순위 지정을 명령어 튜닝된 LLM에 순차적으로 적용합니다. 그 결과로 나온 모델, 즉 Tuna는 Super Natural Instructions(119개 테스트 작업), LMentry(25개 테스트 작업), Vicuna QA에서 일관되게 성능을 향상시키며, 여러 강력한 강화 학습 기반 모델보다 더 나은 결과를 얻을 수도 있습니다. 우리의 코드와 데이터는 https://github.com/microsoft/LMOps에서 확인할 수 있습니다.

English

Instruction tuning of open-source large language models (LLMs) like LLaMA, using direct outputs from more powerful LLMs such as Instruct-GPT and GPT-4, has proven to be a cost-effective way to align model behaviors with human preferences. However, the instruction-tuned model has only seen one response per instruction, lacking the knowledge of potentially better responses. In this paper, we propose finetuning an instruction-tuned LLM using our novel probabilistic ranking and contextual ranking approaches to increase the likelihood of generating better responses. Probabilistic ranking enables the instruction-tuned model to inherit the relative rankings of high-quality and low-quality responses from the teacher LLM. On the other hand, learning with contextual ranking allows the model to refine its own response distribution using the contextual understanding ability of stronger LLMs. Furthermore, we apply probabilistic ranking and contextual ranking sequentially to the instruction-tuned LLM. The resulting model, which we call Tuna, consistently improves the performance on Super Natural Instructions (119 test tasks), LMentry (25 test tasks), Vicuna QA, and can even obtain better results than several strong reinforcement learning baselines. Our code and data are available at https://github.com/microsoft/LMOps.

Tuna: 대규모 언어 모델의 피드백을 활용한 지시 튜닝

Tuna: Instruction Tuning using Feedback from Large Language Models

초록

Support