DLT-Corpus:分布式账本技术领域的大规模文本语料库
DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain
February 25, 2026
作者: Walter Hernandez Cruz, Peter Devine, Nikhil Vadgama, Paolo Tasca, Jiahua Xu
cs.AI
摘要
我们推出DLT-Corpus——迄今为止分布式账本技术(DLT)研究领域规模最大的专业文本集合:该语料库涵盖科学文献(37,440篇出版物)、美国专利商标局(USPTO)专利(49,023项申请)及社交媒体(2200万条帖文),包含22.12百万份文档的29.8亿个词汇单元。现有DLT领域的自然语言处理(NLP)资源多聚焦于加密货币价格预测和智能合约,尽管该领域市值高达约3万亿美元且技术迭代迅速,其专业领域语言特性仍未被充分探索。
通过分析技术涌现模式与市场创新关联,我们验证了DLT-Corpus的实用价值。研究发现遵循传统技术转移路径:技术成果首先出现于科学文献,继而进入专利和社交媒体领域。尽管在加密寒冬期间社交媒体情绪持续乐观,但科学与专利活动独立于市场波动保持增长,并与整体市场扩张形成良性循环——研究先行推动经济增长,而经济增长又为后续创新提供资金支持。
我们完整公开DLT-Corpus语料库、领域自适应模型LedgerBERT(在DLT特定命名实体识别任务上较BERT-base提升23%),以及所有相关工具与代码。
English
We introduce DLT-Corpus, the largest domain-specific text collection for Distributed Ledger Technology (DLT) research to date: 2.98 billion tokens from 22.12 million documents spanning scientific literature (37,440 publications), United States Patent and Trademark Office (USPTO) patents (49,023 filings), and social media (22 million posts). Existing Natural Language Processing (NLP) resources for DLT focus narrowly on cryptocurrencies price prediction and smart contracts, leaving domain-specific language under explored despite the sector's ~$3 trillion market capitalization and rapid technological evolution.
We demonstrate DLT-Corpus' utility by analyzing technology emergence patterns and market-innovation correlations. Findings reveal that technologies originate in scientific literature before reaching patents and social media, following traditional technology transfer patterns. While social media sentiment remains overwhelmingly bullish even during crypto winters, scientific and patent activity grow independently of market fluctuations, tracking overall market expansion in a virtuous cycle where research precedes and enables economic growth that funds further innovation.
We publicly release the full DLT-Corpus; LedgerBERT, a domain-adapted model achieving 23% improvement over BERT-base on a DLT-specific Named Entity Recognition (NER) task; and all associated tools and code.