INSTRUCTEVAL：朝向全面評估調校過的大型語言模型的方向

摘要

指令調校的大型語言模型已經在自然語言處理領域引起了革命，並展現了在對話代理等應用中的巨大潛力。這些模型，如GPT-4，不僅能掌握語言，還能解決數學、編碼、醫學和法律等領域的複雜任務。儘管它們具有令人印象深刻的能力，但由於許多模型的黑盒特性以及缺乏全面評估研究，對它們的全部潛力仍存在著理解上的不足。為應對這些挑戰，我們提出了INSTRUCTEVAL，這是一個專門為指令調校的大型語言模型設計的更全面的評估套件。與以往的研究不同，我們的評估包括對模型在解決問題、寫作能力和與人類價值觀的一致性方面進行嚴格評估。我們採取了一種全面的方法來分析影響模型性能的各種因素，包括預訓練基礎、指令調校數據和訓練方法。我們的研究結果顯示，指令數據的質量是影響模型性能的最關鍵因素。儘管開源模型展示了令人印象深刻的寫作能力，但在解決問題和一致性方面仍有很大的改進空間。我們對開源社區快速發展模型感到鼓舞，但也強調了對這些模型所做聲稱的嚴格評估的必要性。通過INSTRUCTEVAL，我們旨在促進對指令調校模型的更深入理解以及其能力的進步。INSTRUCTEVAL可以在https://github.com/declare-lab/instruct-eval 公開獲取。

English

Instruction-tuned large language models have revolutionized natural language processing and have shown great potential in applications such as conversational agents. These models, such as GPT-4, can not only master language but also solve complex tasks in areas like mathematics, coding, medicine, and law. Despite their impressive capabilities, there is still a lack of comprehensive understanding regarding their full potential, primarily due to the black-box nature of many models and the absence of holistic evaluation studies. To address these challenges, we present INSTRUCTEVAL, a more comprehensive evaluation suite designed specifically for instruction-tuned large language models. Unlike previous works, our evaluation involves a rigorous assessment of models based on problem-solving, writing ability, and alignment to human values. We take a holistic approach to analyze various factors affecting model performance, including the pretraining foundation, instruction-tuning data, and training methods. Our findings reveal that the quality of instruction data is the most crucial factor in scaling model performance. While open-source models demonstrate impressive writing abilities, there is substantial room for improvement in problem-solving and alignment. We are encouraged by the rapid development of models by the open-source community, but we also highlight the need for rigorous evaluation to support claims made about these models. Through INSTRUCTEVAL, we aim to foster a deeper understanding of instruction-tuned models and advancements in their capabilities. INSTRUCTEVAL is publicly available at https://github.com/declare-lab/instruct-eval.

INSTRUCTEVAL：朝向全面評估調校過的大型語言模型的方向

INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models

摘要

Support