评估大型语言模型在条件生成方面的能力

摘要

预训练的大型语言模型（PLMs）是自然语言处理中大多数新发展的基础。它们已经将该领域从特定应用模型管道转变为一个适用于广泛任务的单一模型。像GPT-3或PaLM这样的自回归PLMs，以及少样本学习等技术，进一步将输出模式从分类或回归转变为生成。尽管它们被广泛使用，但语言模型的生成质量很少在引入这些模型时进行评估。此外，现有的生成任务如何与人们一直在采用的真实用例相关联，尽管它们可用于在高层次比较系统，但仍不清楚。在这项工作中，我们讨论了如何将现有的特定应用生成基准适应PLMs，并对PLMs在自然语言生成任务中的限制和能力进行了深入的实证研究，涉及规模、架构、输入和输出语言等方面。我们的结果显示，PLMs在不同数据范畴的适用性以及对多种语言的泛化能力存在差异，并指导在给定生成任务设置中使用哪种PLMs。我们分享了在开发即将推出的PLMs时进行基准测试生成能力时应考虑的最佳实践。

English

Pre-trained large language models (PLMs) underlie most new developments in natural language processing. They have shifted the field from application-specific model pipelines to a single model that is adapted to a wide range of tasks. Autoregressive PLMs like GPT-3 or PaLM, alongside techniques like few-shot learning, have additionally shifted the output modality to generation instead of classification or regression. Despite their ubiquitous use, the generation quality of language models is rarely evaluated when these models are introduced. Additionally, it is unclear how existing generation tasks--while they can be used to compare systems at a high level--relate to the real world use cases for which people have been adopting them. In this work, we discuss how to adapt existing application-specific generation benchmarks to PLMs and provide an in-depth, empirical study of the limitations and capabilities of PLMs in natural language generation tasks along dimensions such as scale, architecture, input and output language. Our results show that PLMs differ in their applicability to different data regimes and their generalization to multiple languages and inform which PLMs to use for a given generation task setup. We share best practices to be taken into consideration when benchmarking generation capabilities during the development of upcoming PLMs.

评估大型语言模型在条件生成方面的能力

Benchmarking Large Language Model Capabilities for Conditional Generation

摘要

Support