政治辯論:用於政治文本的高效零樣本和少樣本分類器
Political DEBATE: Efficient Zero-shot and Few-shot Classifiers for Political Text
September 3, 2024
作者: Michael Burnham, Kayla Kahn, Ryan Yank Wang, Rachel X. Peng
cs.AI
摘要
社會科學家迅速採用大型語言模型,因為這些模型能夠在無監督訓練的情況下對文件進行註釋,這種能力被稱為零-shot學習。然而,由於其計算需求、成本和通常的專有性質,這些模型常常與複製和開放科學標準相抵觸。本文介紹了用於政治文件零-shot和少-shot分類的Political DEBATE(DeBERTa文本蘊涵算法)語言模型。這些模型不僅在零-shot和少-shot分類方面與最先進的大型語言模型一樣好,甚至更好,而且效率更高,完全開源。通過在簡單隨機樣本的10-25份文件上訓練模型,它們可以優於通過數百或數千份文件進行訓練的監督分類器和具有複雜、工程化提示的最先進生成模型。此外,我們釋出了用於訓練這些模型的PolNLI數據集,這是一個包含超過200,000份政治文件並涵蓋超過800個分類任務的語料庫,標籤非常準確。
English
Social scientists quickly adopted large language models due to their ability
to annotate documents without supervised training, an ability known as
zero-shot learning. However, due to their compute demands, cost, and often
proprietary nature, these models are often at odds with replication and open
science standards. This paper introduces the Political DEBATE (DeBERTa
Algorithm for Textual Entailment) language models for zero-shot and few-shot
classification of political documents. These models are not only as good, or
better than, state-of-the art large language models at zero and few-shot
classification, but are orders of magnitude more efficient and completely open
source. By training the models on a simple random sample of 10-25 documents,
they can outperform supervised classifiers trained on hundreds or thousands of
documents and state-of-the-art generative models with complex, engineered
prompts. Additionally, we release the PolNLI dataset used to train these models
-- a corpus of over 200,000 political documents with highly accurate labels
across over 800 classification tasks.Summary
AI-Generated Summary