Catwalk：一个适用于多个数据集的统一语言模型评估框架

摘要

大型语言模型的成功已经改变了自然语言处理（NLP）中的评估范式。社区的兴趣已经转向在多个任务、领域和数据集上比较NLP模型，通常是在极端规模上。这带来了新的工程挑战：构建数据集和模型的工作变得分散，它们的格式和接口不兼容。因此，通常需要进行大量的（重新）实现工作才能进行规模化的公平和受控比较。 Catwalk的目标是解决这些问题。Catwalk为广泛范围的现有NLP数据集和模型提供统一接口，包括经典的监督训练和微调，以及更现代的范式，如上下文学习。其精心设计的抽象化允许轻松扩展到许多其他领域。Catwalk大大降低了进行规模化受控实验的障碍。例如，我们使用单个命令对超过86个数据集上的64个模型进行微调和评估，而无需编写任何代码。由AllenNLP团队在Allen人工智能研究所（AI2）维护，Catwalk是一个持续的开源努力：https://github.com/allenai/catwalk。

English

The success of large language models has shifted the evaluation paradigms in natural language processing (NLP). The community's interest has drifted towards comparing NLP models across many tasks, domains, and datasets, often at an extreme scale. This imposes new engineering challenges: efforts in constructing datasets and models have been fragmented, and their formats and interfaces are incompatible. As a result, it often takes extensive (re)implementation efforts to make fair and controlled comparisons at scale. Catwalk aims to address these issues. Catwalk provides a unified interface to a broad range of existing NLP datasets and models, ranging from both canonical supervised training and fine-tuning, to more modern paradigms like in-context learning. Its carefully-designed abstractions allow for easy extensions to many others. Catwalk substantially lowers the barriers to conducting controlled experiments at scale. For example, we finetuned and evaluated over 64 models on over 86 datasets with a single command, without writing any code. Maintained by the AllenNLP team at the Allen Institute for Artificial Intelligence (AI2), Catwalk is an ongoing open-source effort: https://github.com/allenai/catwalk.

Catwalk：一个适用于多个数据集的统一语言模型评估框架

Catwalk: A Unified Language Model Evaluation Framework for Many Datasets

摘要

Support