一副全新的GloVes手套
A New Pair of GloVes
July 24, 2025
作者: Riley Carlson, John Bauer, Christopher D. Manning
cs.AI
摘要
本报告记录、描述并评估了2024年新版英语GloVe(全局词向量表示)模型。尽管2014年构建的原始GloVe模型已被广泛应用并证明其价值,但语言与世界持续演进,我们认为当前使用场景可从更新后的模型中获益。此外,2014版模型在具体数据版本及预处理步骤方面缺乏详尽记录,我们通过详细记录这些新模型来弥补这一不足。我们利用维基百科、Gigaword以及Dolma子集训练了两组词向量。通过词汇对比、直接测试及命名实体识别(NER)任务的评估表明,2024版向量融入了新的文化和语言相关词汇,在类比和相似性等结构性任务上表现相当,并在诸如非西方新闻数据等近期、时间依赖性强的NER数据集上展现出性能提升。
English
This report documents, describes, and evaluates new 2024 English GloVe
(Global Vectors for Word Representation) models. While the original GloVe
models built in 2014 have been widely used and found useful, languages and the
world continue to evolve and we thought that current usage could benefit from
updated models. Moreover, the 2014 models were not carefully documented as to
the exact data versions and preprocessing that were used, and we rectify this
by documenting these new models. We trained two sets of word embeddings using
Wikipedia, Gigaword, and a subset of Dolma. Evaluation through vocabulary
comparison, direct testing, and NER tasks shows that the 2024 vectors
incorporate new culturally and linguistically relevant words, perform
comparably on structural tasks like analogy and similarity, and demonstrate
improved performance on recent, temporally dependent NER datasets such as
non-Western newswire data.