偏见与否:用偏见检测器识别新闻中的偏见
To Bias or Not to Bias: Detecting bias in News with bias-detector
May 19, 2025
作者: Himel Ghosh, Ahmed Mosharafa, Georg Groh
cs.AI
摘要
媒体偏见检测是确保信息公平、平衡传播的关键任务,但由于偏见的主观性及高质量标注数据的稀缺,这一任务仍具挑战性。在本研究中,我们通过在专家标注的BABE数据集上微调基于RoBERTa的模型,实现了句子级别的偏见分类。通过McNemar检验和5x2交叉验证配对t检验,我们展示了与领域自适应预训练的DA-RoBERTa基线模型相比,我们的模型在性能上取得了统计学上的显著提升。此外,基于注意力的分析表明,我们的模型避免了诸如对政治敏感词汇过度敏感等常见问题,而是更加关注上下文相关的词汇。为了全面审视媒体偏见,我们提出了一种将我们的模型与现有的偏见类型分类器相结合的流程。尽管受限于句子级分析和数据集规模(因缺乏更大、更先进的偏见语料库),我们的方法仍展现出良好的泛化能力和可解释性。我们探讨了上下文感知建模、偏见中和以及高级偏见类型分类作为未来可能的研究方向。我们的研究成果为构建更健壮、可解释且社会责任感更强的自然语言处理系统,用于媒体偏见检测,做出了贡献。
English
Media bias detection is a critical task in ensuring fair and balanced
information dissemination, yet it remains challenging due to the subjectivity
of bias and the scarcity of high-quality annotated data. In this work, we
perform sentence-level bias classification by fine-tuning a RoBERTa-based model
on the expert-annotated BABE dataset. Using McNemar's test and the 5x2
cross-validation paired t-test, we show statistically significant improvements
in performance when comparing our model to a domain-adaptively pre-trained
DA-RoBERTa baseline. Furthermore, attention-based analysis shows that our model
avoids common pitfalls like oversensitivity to politically charged terms and
instead attends more meaningfully to contextually relevant tokens. For a
comprehensive examination of media bias, we present a pipeline that combines
our model with an already-existing bias-type classifier. Our method exhibits
good generalization and interpretability, despite being constrained by
sentence-level analysis and dataset size because of a lack of larger and more
advanced bias corpora. We talk about context-aware modeling, bias
neutralization, and advanced bias type classification as potential future
directions. Our findings contribute to building more robust, explainable, and
socially responsible NLP systems for media bias detection.