DLT-Corpus：面向分布式账本技术领域的大规模文本数据集

摘要

我们推出DLT-Corpus——迄今为止分布式账本技术（DLT）研究领域规模最大的专业文本集合：该语料库涵盖科学文献（37,440篇出版物）、美国专利商标局（USPTO）专利（49,023项申请）及社交媒体（2200万条帖文），总计22.12百万份文档，包含29.8亿个词汇单元。现有DLT领域的自然语言处理（NLP）资源多集中于加密货币价格预测和智能合约等狭窄方向，尽管该领域市值已达约3万亿美元且技术迭代迅速，其专业领域语言特性仍未被充分探索。通过分析技术涌现模式与市场创新关联性，我们验证了DLT-Corpus的实用价值。研究发现：技术演进遵循传统转化路径，先出现于科学文献，再延伸至专利与社交媒体领域；即便在加密货币寒冬期，社交媒体情绪仍保持高度乐观，而科研与专利活动则独立于市场波动持续增长，形成研究先行推动经济增长、经济增长反哺技术创新的良性循环——这些活动最终与整体市场扩张形成正向关联。我们全面公开DLT-Corpus语料库、领域自适应模型LedgerBERT（在DLT特定命名实体识别任务上较BERT-base提升23%性能）及全部配套工具与代码。

English

We introduce DLT-Corpus, the largest domain-specific text collection for Distributed Ledger Technology (DLT) research to date: 2.98 billion tokens from 22.12 million documents spanning scientific literature (37,440 publications), United States Patent and Trademark Office (USPTO) patents (49,023 filings), and social media (22 million posts). Existing Natural Language Processing (NLP) resources for DLT focus narrowly on cryptocurrencies price prediction and smart contracts, leaving domain-specific language under explored despite the sector's ~$3 trillion market capitalization and rapid technological evolution. We demonstrate DLT-Corpus' utility by analyzing technology emergence patterns and market-innovation correlations. Findings reveal that technologies originate in scientific literature before reaching patents and social media, following traditional technology transfer patterns. While social media sentiment remains overwhelmingly bullish even during crypto winters, scientific and patent activity grow independently of market fluctuations, tracking overall market expansion in a virtuous cycle where research precedes and enables economic growth that funds further innovation. We publicly release the full DLT-Corpus; LedgerBERT, a domain-adapted model achieving 23% improvement over BERT-base on a DLT-specific Named Entity Recognition (NER) task; and all associated tools and code.