Catwalk：一個針對多個資料集的統一語言模型評估框架

摘要

大型語言模型的成功已經改變了自然語言處理（NLP）中的評估範式。社群的興趣已轉向比較NLP模型在許多任務、領域和數據集上的表現，通常是在極端規模下。這帶來了新的工程挑戰：構建數據集和模型的努力變得分散，它們的格式和接口不兼容。因此，通常需要進行大量的（重新）實施工作，才能進行公平和受控的大規模比較。 Catwalk的目標是解決這些問題。Catwalk為眾多現有NLP數據集和模型提供統一的接口，包括傳統的監督式訓練和微調，以及更現代的範式，如上下文學習。其精心設計的抽象化允許輕鬆擴展到許多其他領域。Catwalk大大降低了進行大規模受控實驗的門檻。例如，我們使用一個命令對超過86個數據集上的64個模型進行了微調和評估，而無需編寫任何代碼。由AllenNLP團隊在Allen人工智慧研究所（AI2）維護，Catwalk是一個持續的開源努力：https://github.com/allenai/catwalk。

English

The success of large language models has shifted the evaluation paradigms in natural language processing (NLP). The community's interest has drifted towards comparing NLP models across many tasks, domains, and datasets, often at an extreme scale. This imposes new engineering challenges: efforts in constructing datasets and models have been fragmented, and their formats and interfaces are incompatible. As a result, it often takes extensive (re)implementation efforts to make fair and controlled comparisons at scale. Catwalk aims to address these issues. Catwalk provides a unified interface to a broad range of existing NLP datasets and models, ranging from both canonical supervised training and fine-tuning, to more modern paradigms like in-context learning. Its carefully-designed abstractions allow for easy extensions to many others. Catwalk substantially lowers the barriers to conducting controlled experiments at scale. For example, we finetuned and evaluated over 64 models on over 86 datasets with a single command, without writing any code. Maintained by the AllenNLP team at the Allen Institute for Artificial Intelligence (AI2), Catwalk is an ongoing open-source effort: https://github.com/allenai/catwalk.

Catwalk：一個針對多個資料集的統一語言模型評估框架

Catwalk: A Unified Language Model Evaluation Framework for Many Datasets

摘要

Support