暗黑BERT:用於網絡黑暗面的語言模型
DarkBERT: A Language Model for the Dark Side of the Internet
May 15, 2023
作者: Youngjin Jin, Eugene Jang, Jian Cui, Jin-Woo Chung, Yongjae Lee, Seungwon Shin
cs.AI
摘要
最近的研究表明,在暗網和表層網之間使用的語言存在明顯差異。由於對暗網的研究通常需要對領域進行文本分析,專門用於暗網的語言模型可能為研究人員提供寶貴的見解。在這項工作中,我們介紹了DarkBERT,這是一個在暗網數據上預訓練的語言模型。我們描述了用於訓練DarkBERT的文本數據的過濾和編譯步驟,以應對暗網的極端詞彙和結構多樣性,這可能有損於建立對該領域的正確表示。我們評估了DarkBERT及其原始對照模型以及其他廣泛使用的語言模型,以驗證暗網領域特定模型在各種用例中提供的好處。我們的評估顯示,DarkBERT優於當前的語言模型,可能成為未來暗網研究的寶貴資源。
English
Recent research has suggested that there are clear differences in the
language used in the Dark Web compared to that of the Surface Web. As studies
on the Dark Web commonly require textual analysis of the domain, language
models specific to the Dark Web may provide valuable insights to researchers.
In this work, we introduce DarkBERT, a language model pretrained on Dark Web
data. We describe the steps taken to filter and compile the text data used to
train DarkBERT to combat the extreme lexical and structural diversity of the
Dark Web that may be detrimental to building a proper representation of the
domain. We evaluate DarkBERT and its vanilla counterpart along with other
widely used language models to validate the benefits that a Dark Web domain
specific model offers in various use cases. Our evaluations show that DarkBERT
outperforms current language models and may serve as a valuable resource for
future research on the Dark Web.