新たなGloVesのペア

要旨

本レポートは、2024年版の英語GloVe（Global Vectors for Word Representation）モデルを記録、説明、評価するものである。2014年に構築されたオリジナルのGloVeモデルは広く使用され、有用性が認められてきたが、言語と世界は進化を続けており、最新の使用状況に適した更新モデルが有益であると考えた。さらに、2014年モデルは使用された正確なデータバージョンと前処理について十分に文書化されておらず、この点を改善するために新しいモデルの詳細を記録した。我々は、Wikipedia、Gigaword、およびDolmaのサブセットを使用して2セットの単語埋め込みを学習した。語彙比較、直接テスト、NERタスクを通じた評価により、2024年版のベクトルは文化的・言語的に関連性の高い新しい単語を取り込んでおり、類推や類似性といった構造的タスクにおいて同等の性能を発揮し、非西洋のニュースワイヤデータなど、時間的に依存する最近のNERデータセットにおいて性能の向上を示すことが確認された。

English

This report documents, describes, and evaluates new 2024 English GloVe (Global Vectors for Word Representation) models. While the original GloVe models built in 2014 have been widely used and found useful, languages and the world continue to evolve and we thought that current usage could benefit from updated models. Moreover, the 2014 models were not carefully documented as to the exact data versions and preprocessing that were used, and we rectify this by documenting these new models. We trained two sets of word embeddings using Wikipedia, Gigaword, and a subset of Dolma. Evaluation through vocabulary comparison, direct testing, and NER tasks shows that the 2024 vectors incorporate new culturally and linguistically relevant words, perform comparably on structural tasks like analogy and similarity, and demonstrate improved performance on recent, temporally dependent NER datasets such as non-Western newswire data.

新たなGloVesのペア

A New Pair of GloVes

要旨

Support