DarkBERT：用于互联网黑暗面的语言模型

摘要

最近的研究表明，暗网和表层网使用的语言存在明显差异。由于对暗网的研究通常需要对域进行文本分析，因此针对暗网的语言模型可能为研究人员提供宝贵的见解。在这项工作中，我们介绍了DarkBERT，这是一个在暗网数据上预训练的语言模型。我们描述了用于训练DarkBERT的文本数据的筛选和编译步骤，以应对暗网的极端词汇和结构多样性，这可能对构建该领域的适当表示造成不利影响。我们评估了DarkBERT及其普通对应模型以及其他广泛使用的语言模型，以验证暗网领域特定模型在各种用例中提供的好处。我们的评估结果显示，DarkBERT的性能优于当前的语言模型，并可能成为未来暗网研究的宝贵资源。

English

Recent research has suggested that there are clear differences in the language used in the Dark Web compared to that of the Surface Web. As studies on the Dark Web commonly require textual analysis of the domain, language models specific to the Dark Web may provide valuable insights to researchers. In this work, we introduce DarkBERT, a language model pretrained on Dark Web data. We describe the steps taken to filter and compile the text data used to train DarkBERT to combat the extreme lexical and structural diversity of the Dark Web that may be detrimental to building a proper representation of the domain. We evaluate DarkBERT and its vanilla counterpart along with other widely used language models to validate the benefits that a Dark Web domain specific model offers in various use cases. Our evaluations show that DarkBERT outperforms current language models and may serve as a valuable resource for future research on the Dark Web.

DarkBERT：用于互联网黑暗面的语言模型

DarkBERT: A Language Model for the Dark Side of the Internet

摘要

Support