ChatPaper.aiChatPaper

BiasEdit:通過模型編輯消除刻板印象語言模型的偏見

BiasEdit: Debiasing Stereotyped Language Models via Model Editing

March 11, 2025
作者: Xin Xu, Wei Xu, Ningyu Zhang, Julian McAuley
cs.AI

摘要

先前的研究已證實,語言模型會表現出刻板偏見。現有的去偏策略,如使用反事實數據重新訓練模型、表示投影和提示等,往往無法有效消除偏見或直接改變模型內部的偏見表徵。為解決這些問題,我們提出了BiasEdit,這是一種高效的模型編輯方法,通過輕量級網絡作為編輯器生成參數更新,從而移除語言模型中的刻板偏見。BiasEdit採用去偏損失指導編輯網絡對語言模型的部分參數進行局部編輯以實現去偏,同時通過保留損失在編輯過程中保持語言建模能力。在StereoSet和Crows-Pairs上的實驗表明,與切線去偏基線相比,BiasEdit在消除偏見方面具有高效性、有效性和魯棒性,且對語言模型的通用能力影響甚微或無影響。此外,我們進行了偏見追蹤,以探測各模塊中的偏見,並探索了偏見編輯對語言模型不同組件的影響。
English
Previous studies have established that language models manifest stereotyped biases. Existing debiasing strategies, such as retraining a model with counterfactual data, representation projection, and prompting often fail to efficiently eliminate bias or directly alter the models' biased internal representations. To address these issues, we propose BiasEdit, an efficient model editing method to remove stereotypical bias from language models through lightweight networks that act as editors to generate parameter updates. BiasEdit employs a debiasing loss guiding editor networks to conduct local edits on partial parameters of a language model for debiasing while preserving the language modeling abilities during editing through a retention loss. Experiments on StereoSet and Crows-Pairs demonstrate the effectiveness, efficiency, and robustness of BiasEdit in eliminating bias compared to tangental debiasing baselines and little to no impact on the language models' general capabilities. In addition, we conduct bias tracing to probe bias in various modules and explore bias editing impacts on different components of language models.

Summary

AI-Generated Summary

PDF62March 12, 2025