INSTRUCTEVAL：走向对指导调整的大型语言模型进行全面评估

摘要

指令调整的大型语言模型已经彻底改变了自然语言处理，并在诸如对话代理等应用中展现出巨大潜力。这些模型，如GPT-4，不仅可以精通语言，还可以解决数学、编码、医学和法律等领域的复杂任务。尽管它们具有令人印象深刻的能力，但由于许多模型的黑盒特性以及缺乏整体评估研究，对它们的全部潜力仍然缺乏全面的理解。为了解决这些挑战，我们提出了INSTRUCTEVAL，这是一个专门为指令调整的大型语言模型设计的更全面的评估套件。与先前的工作不同，我们的评估涉及对模型在问题解决、写作能力和与人类价值观的一致性等方面进行严格评估。我们采用整体方法分析影响模型性能的各种因素，包括预训练基础、指令调整数据和训练方法。我们的研究结果表明，指令数据的质量是影响模型性能扩展的最关键因素。虽然开源模型展示出令人印象深刻的写作能力，但在问题解决和一致性方面仍有很大的改进空间。我们对开源社区快速发展模型的进展感到鼓舞，但我们也强调了需要进行严格评估以支持关于这些模型的声明。通过INSTRUCTEVAL，我们旨在促进对指令调整模型的更深入理解以及其能力的进步。INSTRUCTEVAL可在https://github.com/declare-lab/instruct-eval 上公开获取。

English

Instruction-tuned large language models have revolutionized natural language processing and have shown great potential in applications such as conversational agents. These models, such as GPT-4, can not only master language but also solve complex tasks in areas like mathematics, coding, medicine, and law. Despite their impressive capabilities, there is still a lack of comprehensive understanding regarding their full potential, primarily due to the black-box nature of many models and the absence of holistic evaluation studies. To address these challenges, we present INSTRUCTEVAL, a more comprehensive evaluation suite designed specifically for instruction-tuned large language models. Unlike previous works, our evaluation involves a rigorous assessment of models based on problem-solving, writing ability, and alignment to human values. We take a holistic approach to analyze various factors affecting model performance, including the pretraining foundation, instruction-tuning data, and training methods. Our findings reveal that the quality of instruction data is the most crucial factor in scaling model performance. While open-source models demonstrate impressive writing abilities, there is substantial room for improvement in problem-solving and alignment. We are encouraged by the rapid development of models by the open-source community, but we also highlight the need for rigorous evaluation to support claims made about these models. Through INSTRUCTEVAL, we aim to foster a deeper understanding of instruction-tuned models and advancements in their capabilities. INSTRUCTEVAL is publicly available at https://github.com/declare-lab/instruct-eval.

INSTRUCTEVAL：走向对指导调整的大型语言模型进行全面评估

INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models

摘要

Support