ChatPaper.aiChatPaper

DHPLT:用于语义演变建模的大规模多语言历时语料库及词表征

DHPLT: large-scale multilingual diachronic corpora and word representations for semantic change modelling

February 12, 2026
作者: Mariia Fedorova, Andrey Kutuzov, Khonzoda Umarova
cs.AI

摘要

在本资源论文中,我们推出DHPLT——一个涵盖41种多样化语言的历时语料库开放集合。该集合以网络爬取的HPLT数据集为基础,利用网页抓取时间戳作为文档创建时间的近似标识。语料库覆盖三个时段:2011-2015年、2020-2021年以及2024年至今(每种语言每个时段包含100万份文档)。我们额外提供了预计算的词汇类型与标记嵌入向量,以及针对选定目标词的词汇替换表,同时允许其他研究者使用相同数据集自行设定目标词。DHPLT旨在填补当前语义演变建模领域多语言历时语料库的空白(此前仅覆盖十几种高资源语言),为该领域开创了多样化的实验可能性。本文所述所有资源均按语言分类,可通过https://data.hplt-project.org/three/diachronic/获取。
English
In this resource paper, we present DHPLT, an open collection of diachronic corpora in 41 diverse languages. DHPLT is based on the web-crawled HPLT datasets; we use web crawl timestamps as the approximate signal of document creation time. The collection covers three time periods: 2011-2015, 2020-2021 and 2024-present (1 million documents per time period for each language). We additionally provide pre-computed word type and token embeddings and lexical substitutions for our chosen target words, while at the same time leaving it open for the other researchers to come up with their own target words using the same datasets. DHPLT aims at filling in the current lack of multilingual diachronic corpora for semantic change modelling (beyond a dozen of high-resource languages). It opens the way for a variety of new experimental setups in this field. All the resources described in this paper are available at https://data.hplt-project.org/three/diachronic/, sorted by language.
PDF12February 18, 2026