Hatevolution:靜態基準測試未能揭示的真相
Hatevolution: What Static Benchmarks Don't Tell Us
June 13, 2025
作者: Chiara Di Bonaventura, Barbara McGillivray, Yulan He, Albert Meroño-Peñuela
cs.AI
摘要
語言隨著時間不斷演變,包括仇恨言論領域,其變化速度緊跟社會動態與文化轉變。儘管自然語言處理(NLP)研究已探討了語言演變對模型訓練的影響,並提出了多種應對方案,但其對模型基準測試的影響仍未被充分探索。然而,仇恨言論基準測試在確保模型安全性方面扮演著至關重要的角色。本文中,我們通過兩項涉及仇恨言論演變的實驗,對20種語言模型的魯棒性進行了實證評估,並揭示了靜態評估與時間敏感性評估之間的時間錯位。我們的研究結果呼籲建立時間敏感的語言基準,以便在仇恨言論領域正確且可靠地評估語言模型。
English
Language changes over time, including in the hate speech domain, which
evolves quickly following social dynamics and cultural shifts. While NLP
research has investigated the impact of language evolution on model training
and has proposed several solutions for it, its impact on model benchmarking
remains under-explored. Yet, hate speech benchmarks play a crucial role to
ensure model safety. In this paper, we empirically evaluate the robustness of
20 language models across two evolving hate speech experiments, and we show the
temporal misalignment between static and time-sensitive evaluations. Our
findings call for time-sensitive linguistic benchmarks in order to correctly
and reliably evaluate language models in the hate speech domain.