政治辯論：用於政治文本的高效零樣本和少樣本分類器

摘要

社會科學家迅速採用大型語言模型，因為這些模型能夠在無監督訓練的情況下對文件進行註釋，這種能力被稱為零-shot學習。然而，由於其計算需求、成本和通常的專有性質，這些模型常常與複製和開放科學標準相抵觸。本文介紹了用於政治文件零-shot和少-shot分類的Political DEBATE（DeBERTa文本蘊涵算法）語言模型。這些模型不僅在零-shot和少-shot分類方面與最先進的大型語言模型一樣好，甚至更好，而且效率更高，完全開源。通過在簡單隨機樣本的10-25份文件上訓練模型，它們可以優於通過數百或數千份文件進行訓練的監督分類器和具有複雜、工程化提示的最先進生成模型。此外，我們釋出了用於訓練這些模型的PolNLI數據集，這是一個包含超過200,000份政治文件並涵蓋超過800個分類任務的語料庫，標籤非常準確。

English

Social scientists quickly adopted large language models due to their ability to annotate documents without supervised training, an ability known as zero-shot learning. However, due to their compute demands, cost, and often proprietary nature, these models are often at odds with replication and open science standards. This paper introduces the Political DEBATE (DeBERTa Algorithm for Textual Entailment) language models for zero-shot and few-shot classification of political documents. These models are not only as good, or better than, state-of-the art large language models at zero and few-shot classification, but are orders of magnitude more efficient and completely open source. By training the models on a simple random sample of 10-25 documents, they can outperform supervised classifiers trained on hundreds or thousands of documents and state-of-the-art generative models with complex, engineered prompts. Additionally, we release the PolNLI dataset used to train these models -- a corpus of over 200,000 political documents with highly accurate labels across over 800 classification tasks.

政治辯論：用於政治文本的高效零樣本和少樣本分類器

Political DEBATE: Efficient Zero-shot and Few-shot Classifiers for Political Text

摘要

Support