ChatPaper.aiChatPaper

INSTRUCTEVAL:朝向全面評估調校過的大型語言模型的方向

INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models

June 7, 2023
作者: Yew Ken Chia, Pengfei Hong, Lidong Bing, Soujanya Poria
cs.AI

摘要

指令調校的大型語言模型已經在自然語言處理領域引起了革命,並展現了在對話代理等應用中的巨大潛力。這些模型,如GPT-4,不僅能掌握語言,還能解決數學、編碼、醫學和法律等領域的複雜任務。儘管它們具有令人印象深刻的能力,但由於許多模型的黑盒特性以及缺乏全面評估研究,對它們的全部潛力仍存在著理解上的不足。為應對這些挑戰,我們提出了INSTRUCTEVAL,這是一個專門為指令調校的大型語言模型設計的更全面的評估套件。與以往的研究不同,我們的評估包括對模型在解決問題、寫作能力和與人類價值觀的一致性方面進行嚴格評估。我們採取了一種全面的方法來分析影響模型性能的各種因素,包括預訓練基礎、指令調校數據和訓練方法。我們的研究結果顯示,指令數據的質量是影響模型性能的最關鍵因素。儘管開源模型展示了令人印象深刻的寫作能力,但在解決問題和一致性方面仍有很大的改進空間。我們對開源社區快速發展模型感到鼓舞,但也強調了對這些模型所做聲稱的嚴格評估的必要性。通過INSTRUCTEVAL,我們旨在促進對指令調校模型的更深入理解以及其能力的進步。INSTRUCTEVAL可以在https://github.com/declare-lab/instruct-eval 公開獲取。
English
Instruction-tuned large language models have revolutionized natural language processing and have shown great potential in applications such as conversational agents. These models, such as GPT-4, can not only master language but also solve complex tasks in areas like mathematics, coding, medicine, and law. Despite their impressive capabilities, there is still a lack of comprehensive understanding regarding their full potential, primarily due to the black-box nature of many models and the absence of holistic evaluation studies. To address these challenges, we present INSTRUCTEVAL, a more comprehensive evaluation suite designed specifically for instruction-tuned large language models. Unlike previous works, our evaluation involves a rigorous assessment of models based on problem-solving, writing ability, and alignment to human values. We take a holistic approach to analyze various factors affecting model performance, including the pretraining foundation, instruction-tuning data, and training methods. Our findings reveal that the quality of instruction data is the most crucial factor in scaling model performance. While open-source models demonstrate impressive writing abilities, there is substantial room for improvement in problem-solving and alignment. We are encouraged by the rapid development of models by the open-source community, but we also highlight the need for rigorous evaluation to support claims made about these models. Through INSTRUCTEVAL, we aim to foster a deeper understanding of instruction-tuned models and advancements in their capabilities. INSTRUCTEVAL is publicly available at https://github.com/declare-lab/instruct-eval.
PDF60December 15, 2024