WikiNER-fr-gold: ゴールド標準の固有表現抽出コーパス

要旨

本論文では、多言語固有表現認識コーパスであるWikiNERコーパスの品質について取り上げ、それを統合したバージョンを提供します。WikiNERの注釈付けは、半教師付きの方法で行われました。つまり、事後に手作業での検証は行われていません。このようなコーパスは銀標準と呼ばれます。本論文では、WikiNERのフランス語部分の改訂版であるWikiNER-fr-goldを提案します。当該コーパスは、元のフランス語サブコーパス（26,818文、700kトークン）のランダムに抽出された20%から構成されています。我々は、各カテゴリに含まれるエンティティタイプを要約し、注釈付けのガイドラインを定義した後、コーパスの改訂に取り組みます。最後に、WikiNER-frコーパスで観察されたエラーや不整合性の分析を提示し、今後の研究方向について議論します。

English

We address in this article the the quality of the WikiNER corpus, a multilingual Named Entity Recognition corpus, and provide a consolidated version of it. The annotation of WikiNER was produced in a semi-supervised manner i.e. no manual verification has been carried out a posteriori. Such corpus is called silver-standard. In this paper we propose WikiNER-fr-gold which is a revised version of the French proportion of WikiNER. Our corpus consists of randomly sampled 20% of the original French sub-corpus (26,818 sentences with 700k tokens). We start by summarizing the entity types included in each category in order to define an annotation guideline, and then we proceed to revise the corpus. Finally we present an analysis of errors and inconsistency observed in the WikiNER-fr corpus, and we discuss potential future work directions.