ChatPaper.aiChatPaper

一雙全新的GloVe手套

A New Pair of GloVes

July 24, 2025
作者: Riley Carlson, John Bauer, Christopher D. Manning
cs.AI

摘要

本報告記錄、描述並評估了2024年新版英文GloVe(詞語表示全局向量)模型。雖然2014年構建的原始GloVe模型已被廣泛使用並被證明具有實用價值,但語言與世界持續演進,我們認為更新模型將有益於當前的應用。此外,2014年的模型在所用數據版本及預處理細節上缺乏詳盡記錄,我們通過對這些新模型的文檔化來彌補這一不足。我們利用維基百科、Gigaword以及Dolma的一個子集訓練了兩組詞嵌入。通過詞彙對比、直接測試及命名實體識別(NER)任務的評估表明,2024版向量融入了新的文化和語言相關詞彙,在類比和相似性等結構性任務上表現相當,並在依賴時間性的最新NER數據集(如非西方新聞數據)上展現出性能提升。
English
This report documents, describes, and evaluates new 2024 English GloVe (Global Vectors for Word Representation) models. While the original GloVe models built in 2014 have been widely used and found useful, languages and the world continue to evolve and we thought that current usage could benefit from updated models. Moreover, the 2014 models were not carefully documented as to the exact data versions and preprocessing that were used, and we rectify this by documenting these new models. We trained two sets of word embeddings using Wikipedia, Gigaword, and a subset of Dolma. Evaluation through vocabulary comparison, direct testing, and NER tasks shows that the 2024 vectors incorporate new culturally and linguistically relevant words, perform comparably on structural tasks like analogy and similarity, and demonstrate improved performance on recent, temporally dependent NER datasets such as non-Western newswire data.
PDF72July 25, 2025