조건부 생성을 위한 대규모 언어 모델 능력 벤치마킹

초록

사전 학습된 대규모 언어 모델(PLM)은 자연어 처리 분야의 대부분의 새로운 발전을 뒷받침하고 있습니다. 이 모델들은 특정 애플리케이션에 맞춰진 파이프라인에서 다양한 작업에 적응 가능한 단일 모델로의 전환을 이끌었습니다. GPT-3나 PaLM과 같은 자기회귀적 PLM들은 소수 샷 학습과 같은 기법과 함께 출력 방식을 분류나 회귀에서 생성으로 전환시켰습니다. 그러나 이러한 모델들이 널리 사용되고 있음에도 불구하고, 언어 모델의 생성 품질은 모델이 소개될 때 거의 평가되지 않습니다. 또한, 기존의 생성 작업들이 시스템을 높은 수준에서 비교하는 데 사용될 수는 있지만, 실제 사용 사례와 어떻게 연관되는지는 명확하지 않습니다. 본 연구에서는 기존의 애플리케이션 특화 생성 벤치마크를 PLM에 어떻게 적용할지 논의하고, 규모, 아키텍처, 입력 및 출력 언어와 같은 차원에서 PLM의 자연어 생성 작업에서의 한계와 능력에 대한 심층적인 실증 연구를 제공합니다. 우리의 결과는 PLM이 다양한 데이터 체계에 대한 적용 가능성과 다중 언어로의 일반화 능력에서 차이를 보이며, 주어진 생성 작업 설정에 어떤 PLM을 사용할지에 대한 정보를 제공합니다. 또한, 향후 PLM 개발 과정에서 생성 능력을 벤치마킹할 때 고려해야 할 모범 사례를 공유합니다.

English

Pre-trained large language models (PLMs) underlie most new developments in natural language processing. They have shifted the field from application-specific model pipelines to a single model that is adapted to a wide range of tasks. Autoregressive PLMs like GPT-3 or PaLM, alongside techniques like few-shot learning, have additionally shifted the output modality to generation instead of classification or regression. Despite their ubiquitous use, the generation quality of language models is rarely evaluated when these models are introduced. Additionally, it is unclear how existing generation tasks--while they can be used to compare systems at a high level--relate to the real world use cases for which people have been adopting them. In this work, we discuss how to adapt existing application-specific generation benchmarks to PLMs and provide an in-depth, empirical study of the limitations and capabilities of PLMs in natural language generation tasks along dimensions such as scale, architecture, input and output language. Our results show that PLMs differ in their applicability to different data regimes and their generalization to multiple languages and inform which PLMs to use for a given generation task setup. We share best practices to be taken into consideration when benchmarking generation capabilities during the development of upcoming PLMs.

조건부 생성을 위한 대규모 언어 모델 능력 벤치마킹

Benchmarking Large Language Model Capabilities for Conditional Generation

초록

Support