政治辩论：用于政治文本的高效零样本和少样本分类器

摘要

社会科学家迅速采用大型语言模型，因为它们具有在无监督训练的情况下对文档进行注释的能力，这种能力被称为零样本学习。然而，由于其计算需求、成本和通常的专有属性，这些模型常常与复制和开放科学标准相冲突。本文介绍了用于政治文档零样本和少样本分类的政治DEBATE（DeBERTa文本蕴涵算法）语言模型。这些模型不仅在零样本和少样本分类方面与最先进的大型语言模型一样好，甚至更好，而且效率更高，完全开源。通过在简单随机抽样的10-25个文档上训练模型，它们可以胜过在数百或数千个文档上训练的监督分类器和使用复杂的工程提示的最先进生成模型。此外，我们发布了用于训练这些模型的PolNLI数据集，这是一个包含超过200,000份政治文档的语料库，涵盖800多个分类任务，具有高度准确的标签。

English

Social scientists quickly adopted large language models due to their ability to annotate documents without supervised training, an ability known as zero-shot learning. However, due to their compute demands, cost, and often proprietary nature, these models are often at odds with replication and open science standards. This paper introduces the Political DEBATE (DeBERTa Algorithm for Textual Entailment) language models for zero-shot and few-shot classification of political documents. These models are not only as good, or better than, state-of-the art large language models at zero and few-shot classification, but are orders of magnitude more efficient and completely open source. By training the models on a simple random sample of 10-25 documents, they can outperform supervised classifiers trained on hundreds or thousands of documents and state-of-the-art generative models with complex, engineered prompts. Additionally, we release the PolNLI dataset used to train these models -- a corpus of over 200,000 political documents with highly accurate labels across over 800 classification tasks.

政治辩论：用于政治文本的高效零样本和少样本分类器

Political DEBATE: Efficient Zero-shot and Few-shot Classifiers for Political Text

摘要

Support