DLT-Corpus: 분산 원장 기술 분야를 위한 대규모 텍스트 컬렉션

초록

본 연구에서는 분산원장기술(DLT) 연구를 위한 최대 규모의 도메인 특화 텍스트 컬렉션인 DLT-Corpus를 소개한다: 과학 문헌(37,440편), 미국 특허청(USPTO) 특허(49,023건), 소셜 미디어(22백만 게시물) 등 22.12백만 개 문서에서 추출한 29.8억 개의 토큰으로 구성된다. 기존 DLT용 자연어 처리(NLP) 자원은 암호화폐 가격 예측과 스마트 계약에 국한되어 있어, 약 3조 달러에 달하는 시가총액과 빠른 기술 발전에도 불구하고 도메인 특화 언어 연구가 미진한 실정이다. 본 연구는 기술 등장 패턴과 시장-혁신 상관관계 분석을 통해 DLT-Corpus의 유용성을 입증한다. 연구 결과에 따르면 기술은 과학 문헌에서 기원하여 특허와 소셜 미디어로 확산되는 전통적인 기술 이전 패턴을 따른다. 암호화폐 시장 침체기에도 소셜 미디어 정서는 압도적으로 낙관적인 반면, 과학 및 특허 활동은 시장 변동과 무관하게 성장하며 전체 시장 확장을 추종한다. 이는 연구가 경제 성장을 선도하고 가능하게 하며, 이로 인한 자금이 추가 혁신을 지원하는 선순환 구조를 보여준다. 본 연구팀은 전체 DLT-Corpus와 DLT 특화 개체명 인식(NER) 작업에서 BERT-base 대비 23% 성능 향상을 달성한 도메인 적응 모델 LedgerBERT, 관련 모든 도구 및 코드를 공개한다.

English

We introduce DLT-Corpus, the largest domain-specific text collection for Distributed Ledger Technology (DLT) research to date: 2.98 billion tokens from 22.12 million documents spanning scientific literature (37,440 publications), United States Patent and Trademark Office (USPTO) patents (49,023 filings), and social media (22 million posts). Existing Natural Language Processing (NLP) resources for DLT focus narrowly on cryptocurrencies price prediction and smart contracts, leaving domain-specific language under explored despite the sector's ~$3 trillion market capitalization and rapid technological evolution. We demonstrate DLT-Corpus' utility by analyzing technology emergence patterns and market-innovation correlations. Findings reveal that technologies originate in scientific literature before reaching patents and social media, following traditional technology transfer patterns. While social media sentiment remains overwhelmingly bullish even during crypto winters, scientific and patent activity grow independently of market fluctuations, tracking overall market expansion in a virtuous cycle where research precedes and enables economic growth that funds further innovation. We publicly release the full DLT-Corpus; LedgerBERT, a domain-adapted model achieving 23% improvement over BERT-base on a DLT-specific Named Entity Recognition (NER) task; and all associated tools and code.

DLT-Corpus: 분산 원장 기술 분야를 위한 대규모 텍스트 컬렉션

DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain

초록

Support