DLT-Corpus: 分散型台帳技術ドメイン向け大規模テキストコレクション

要旨

本論文では、分散型台帳技術（DLT）研究向けの最大のドメイン特化型テキストコーパス「DLT-Corpus」を紹介する。本コーパスは、科学文献（37,440報）、米国特許商標庁（USPTO）特許（49,023件）、ソーシャルメディア（2,200万投稿）にわたる2,212万文書から抽出された29.8億トークンで構成される。既存のDLT向け自然言語処理（NLP）リソースは暗号通貨の価格予測やスマートコントラクトに偏っており、約3兆ドルもの時価総額と急速な技術進化を遂げる本分野において、ドメイン特有の言語表現は十分に探究されていなかった。本コーパスの有用性は、技術出現パターンと市場・イノベーション相関の分析によって実証する。分析結果から、技術は科学文献で発生後、特許やソーシャルメディアへと伝播するという従来型の技術移転パターンに従うことが明らかとなった。また、暗号資産の冬相場時でさえソーシャルメディアのセンチメントが圧倒的に強気である一方、科学文献と特許活動は市場変動に依存せず、市場全体の拡大に連動して成長する。これは、研究が経済成長を先導・促進し、その成長がさらなるイノベーションへ資金を供給するという好循環を形成している。本論文では、DLT-Corpus全文、DLT特化の固有表現認識タスクにおいてBERT-baseを23%上回る性能を示すドメイン適応モデル「LedgerBERT」、および関連する全ツールとコードを公開する。

English

We introduce DLT-Corpus, the largest domain-specific text collection for Distributed Ledger Technology (DLT) research to date: 2.98 billion tokens from 22.12 million documents spanning scientific literature (37,440 publications), United States Patent and Trademark Office (USPTO) patents (49,023 filings), and social media (22 million posts). Existing Natural Language Processing (NLP) resources for DLT focus narrowly on cryptocurrencies price prediction and smart contracts, leaving domain-specific language under explored despite the sector's ~$3 trillion market capitalization and rapid technological evolution. We demonstrate DLT-Corpus' utility by analyzing technology emergence patterns and market-innovation correlations. Findings reveal that technologies originate in scientific literature before reaching patents and social media, following traditional technology transfer patterns. While social media sentiment remains overwhelmingly bullish even during crypto winters, scientific and patent activity grow independently of market fluctuations, tracking overall market expansion in a virtuous cycle where research precedes and enables economic growth that funds further innovation. We publicly release the full DLT-Corpus; LedgerBERT, a domain-adapted model achieving 23% improvement over BERT-base on a DLT-specific Named Entity Recognition (NER) task; and all associated tools and code.

DLT-Corpus: 分散型台帳技術ドメイン向け大規模テキストコレクション

DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain

要旨

Support