評估大型語言模型在條件生成方面的能力

摘要

預訓練大型語言模型（PLMs）是自然語言處理中大多數新發展的基礎。它們將該領域從特定應用模型流程轉變為適應多種任務的單一模型。像GPT-3或PaLM這樣的自回歸PLMs，再加上少樣本學習等技術，進一步將輸出模式從分類或回歸轉變為生成。儘管它們被廣泛使用，語言模型的生成質量很少在這些模型推出時進行評估。此外，現有的生成任務如何與人們一直在採用的真實用例相關，目前尚不清楚。在本研究中，我們討論如何將現有的特定應用生成基準適應於PLMs，並對PLMs在自然語言生成任務中的限制和能力進行深入的實證研究，涵蓋規模、架構、輸入和輸出語言等方面。我們的結果顯示，PLMs在不同數據範疇的適用性以及對多種語言的泛化能力存在差異，並提供了在特定生成任務設置中應使用哪些PLMs的信息。我們分享了在開發即將推出的PLMs時進行生成能力基準測試時應考慮的最佳實踐。

English

Pre-trained large language models (PLMs) underlie most new developments in natural language processing. They have shifted the field from application-specific model pipelines to a single model that is adapted to a wide range of tasks. Autoregressive PLMs like GPT-3 or PaLM, alongside techniques like few-shot learning, have additionally shifted the output modality to generation instead of classification or regression. Despite their ubiquitous use, the generation quality of language models is rarely evaluated when these models are introduced. Additionally, it is unclear how existing generation tasks--while they can be used to compare systems at a high level--relate to the real world use cases for which people have been adopting them. In this work, we discuss how to adapt existing application-specific generation benchmarks to PLMs and provide an in-depth, empirical study of the limitations and capabilities of PLMs in natural language generation tasks along dimensions such as scale, architecture, input and output language. Our results show that PLMs differ in their applicability to different data regimes and their generalization to multiple languages and inform which PLMs to use for a given generation task setup. We share best practices to be taken into consideration when benchmarking generation capabilities during the development of upcoming PLMs.

評估大型語言模型在條件生成方面的能力

Benchmarking Large Language Model Capabilities for Conditional Generation

摘要

Support