大規模言語モデルの条件付き生成能力のベンチマーキング

要旨

事前学習済み大規模言語モデル（PLM）は、自然言語処理における最新の進展の大部分を支えています。これらは、特定のアプリケーションに特化したモデルパイプラインから、幅広いタスクに適応可能な単一のモデルへと分野を転換させました。GPT-3やPaLMのような自己回帰型PLMは、さらに、数ショット学習などの技術とともに、出力モダリティを分類や回帰ではなく生成へとシフトさせました。しかし、これらのモデルが導入される際に、その生成品質が評価されることはほとんどありません。また、既存の生成タスクは、システムを高レベルで比較するために使用できるものの、人々が実際に採用している現実世界のユースケースとどのように関連しているかは不明確です。本研究では、既存のアプリケーション固有の生成ベンチマークをPLMに適応させる方法について議論し、スケール、アーキテクチャ、入力および出力言語などの次元に沿って、PLMの自然言語生成タスクにおける限界と能力について詳細な実証研究を提供します。結果は、PLMが異なるデータレジームへの適用性や複数言語への一般化において異なることを示し、特定の生成タスク設定にどのPLMを使用すべきかを明らかにします。また、今後のPLM開発において生成能力をベンチマークする際に考慮すべきベストプラクティスを共有します。

English

Pre-trained large language models (PLMs) underlie most new developments in natural language processing. They have shifted the field from application-specific model pipelines to a single model that is adapted to a wide range of tasks. Autoregressive PLMs like GPT-3 or PaLM, alongside techniques like few-shot learning, have additionally shifted the output modality to generation instead of classification or regression. Despite their ubiquitous use, the generation quality of language models is rarely evaluated when these models are introduced. Additionally, it is unclear how existing generation tasks--while they can be used to compare systems at a high level--relate to the real world use cases for which people have been adopting them. In this work, we discuss how to adapt existing application-specific generation benchmarks to PLMs and provide an in-depth, empirical study of the limitations and capabilities of PLMs in natural language generation tasks along dimensions such as scale, architecture, input and output language. Our results show that PLMs differ in their applicability to different data regimes and their generalization to multiple languages and inform which PLMs to use for a given generation task setup. We share best practices to be taken into consideration when benchmarking generation capabilities during the development of upcoming PLMs.

大規模言語モデルの条件付き生成能力のベンチマーキング

Benchmarking Large Language Model Capabilities for Conditional Generation

要旨

Support