낙타는 얼마나 멀리 갈 수 있는가? 오픈 리소스에서의 지시 튜닝 현황 탐구

초록

본 연구에서는 다양한 오픈 명령어 수행 데이터셋을 기반으로 한 언어 모델의 최신 지시 튜닝(instruction-tuning) 기술의 발전을 탐구한다. 최근 오픈 모델이 최첨단 독점 모델과 동등한 성능을 낼 수 있다는 주장이 제기되고 있지만, 이러한 주장은 종종 제한된 평가와 함께 이루어져 모델 간 전반적인 비교와 다양한 자원의 유용성을 판단하기 어렵다. 우리는 6.7B에서 65B 파라미터 크기의 다양한 지시 튜닝 모델을 제공하며, 이는 수동으로 정제된 데이터셋(예: OpenAssistant)부터 합성 및 증류된 데이터셋(예: Alpaca)에 이르는 12개의 명령어 데이터셋으로 학습되었다. 또한, 자동화된 평가, 모델 기반 평가, 인간 기반 평가를 통해 사실 지식, 추론 능력, 다국어 지원, 코딩 능력, 그리고 개방형 명령어 수행 능력을 체계적으로 평가한다. 더 나아가, 우리는 고품질 오픈 자원의 조합으로 미세 조정된 최고 성능의 지시 튜닝 모델 제품군인 T\"ulu를 소개한다. 실험 결과, 서로 다른 지시 튜닝 데이터셋은 특정 기술을 발견하거나 향상시킬 수 있지만, 단일 데이터셋(또는 조합)이 모든 평가에서 최고의 성능을 제공하지는 않는다는 것을 보여준다. 흥미롭게도, 모델 및 인간 선호도 기반 평가는 벤치마크 기반 평가에서 드러나는 모델 능력의 차이를 반영하지 못하는 것으로 나타나, 본 연구에서 수행한 체계적인 평가의 필요성을 시사한다. 우리의 평가 결과에 따르면, 특정 평가에서 최고 성능을 보인 모델은 평균적으로 ChatGPT 성능의 83%, GPT-4 성능의 68%에 달하며, 이는 격차를 줄이기 위해 더 나은 기본 모델과 지시 튜닝 데이터 구축에 대한 추가 투자가 필요함을 시사한다. 우리는 완전히 미세 조정된 65B T\"ulu 모델을 포함한 지시 튜닝 모델과 코드, 데이터, 평가 프레임워크를 https://github.com/allenai/open-instruct에서 공개하여 향후 연구를 촉진하고자 한다.

English

In this work we explore recent advances in instruction-tuning language models on a range of open instruction-following datasets. Despite recent claims that open models can be on par with state-of-the-art proprietary models, these claims are often accompanied by limited evaluation, making it difficult to compare models across the board and determine the utility of various resources. We provide a large set of instruction-tuned models from 6.7B to 65B parameters in size, trained on 12 instruction datasets ranging from manually curated (e.g., OpenAssistant) to synthetic and distilled (e.g., Alpaca) and systematically evaluate them on their factual knowledge, reasoning, multilinguality, coding, and open-ended instruction following abilities through a collection of automatic, model-based, and human-based metrics. We further introduce T\"ulu, our best performing instruction-tuned model suite finetuned on a combination of high-quality open resources. Our experiments show that different instruction-tuning datasets can uncover or enhance specific skills, while no single dataset (or combination) provides the best performance across all evaluations. Interestingly, we find that model and human preference-based evaluations fail to reflect differences in model capabilities exposed by benchmark-based evaluations, suggesting the need for the type of systemic evaluation performed in this work. Our evaluations show that the best model in any given evaluation reaches on average 83% of ChatGPT performance, and 68% of GPT-4 performance, suggesting that further investment in building better base models and instruction-tuning data is required to close the gap. We release our instruction-tuned models, including a fully finetuned 65B T\"ulu, along with our code, data, and evaluation framework at https://github.com/allenai/open-instruct to facilitate future research.

낙타는 얼마나 멀리 갈 수 있는가? 오픈 리소스에서의 지시 튜닝 현황 탐구

How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources

초록

Support