手動で注釈付けされたデータのないテキスト分類のためのLLM教師生徒フレームワーク：IPTCニューストピック分類のケーススタディ

要旨

オンラインで利用可能なニュース記事の数が増加するにつれて、言語に関係なくトピックごとに分類することは、読者が関連コンテンツにアクセスするために重要になっています。この課題に対処するため、大規模言語モデル（LLMs）に基づく教師-生徒フレームワークを提案し、手動データ注釈の必要がない合理的なサイズの多言語ニュース分類モデルを開発します。このフレームワークは、Generative Pretrained Transformer（GPT）モデルを教師モデルとして使用し、スロベニア語、クロアチア語、ギリシャ語、カタロニア語のニュース記事を自動注釈してIPTCメディアトピックトレーニングデータセットを開発します。教師モデルは、4つの言語すべてで高いゼロショットパフォーマンスを示します。人間の注釈者同士の合意と同等の精度を持っています。1日に数百万のテキストを処理する必要がある計算上の制約を緩和するために、GPTで注釈付けされたデータセットで小さなBERTライクな生徒モデルをファインチューニングします。これらの生徒モデルは、教師モデルと同等の高いパフォーマンスを達成します。さらに、生徒モデルのパフォーマンスに対するトレーニングデータサイズの影響を探り、それらの単言語、多言語、ゼロショットのクロスリンガル能力を調査します。研究結果は、生徒モデルが比較的少数のトレーニングインスタンスで高いパフォーマンスを達成し、強力なゼロショットのクロスリンガル能力を示すことを示しています。最後に、最も優れたニューストピック分類器を公開し、IPTCメディアトピックスキーマのトップレベルカテゴリでの多言語分類を可能にします。

English

With the ever-increasing number of news stories available online, classifying them by topic, regardless of the language they are written in, has become crucial for enhancing readers' access to relevant content. To address this challenge, we propose a teacher-student framework based on large language models (LLMs) for developing multilingual news classification models of reasonable size with no need for manual data annotation. The framework employs a Generative Pretrained Transformer (GPT) model as the teacher model to develop an IPTC Media Topic training dataset through automatic annotation of news articles in Slovenian, Croatian, Greek, and Catalan. The teacher model exhibits a high zero-shot performance on all four languages. Its agreement with human annotators is comparable to that between the human annotators themselves. To mitigate the computational limitations associated with the requirement of processing millions of texts daily, smaller BERT-like student models are fine-tuned on the GPT-annotated dataset. These student models achieve high performance comparable to the teacher model. Furthermore, we explore the impact of the training data size on the performance of the student models and investigate their monolingual, multilingual and zero-shot cross-lingual capabilities. The findings indicate that student models can achieve high performance with a relatively small number of training instances, and demonstrate strong zero-shot cross-lingual abilities. Finally, we publish the best-performing news topic classifier, enabling multilingual classification with the top-level categories of the IPTC Media Topic schema.

手動で注釈付けされたデータのないテキスト分類のためのLLM教師生徒フレームワーク：IPTCニューストピック分類のケーススタディ

LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification

要旨

Support