推理模型中的隱性偏見模式
Implicit Bias-Like Patterns in Reasoning Models
March 14, 2025
作者: Messi H. J. Lee, Calvin K. Lai
cs.AI
摘要
隱性偏見指的是自動或自發的心理過程,這些過程塑造了感知、判斷和行為。以往研究大型語言模型(LLMs)中的「隱性偏見」時,通常與人類研究中的方法不同,主要關注模型輸出而非模型處理過程。為了探究模型處理過程,我們提出了一種名為推理模型隱性關聯測試(RM-IAT)的方法,用於研究推理模型中的隱性偏見樣式:這些LLMs通過逐步推理來解決複雜任務。運用此方法,我們發現推理模型在處理關聯不相容信息時,比處理關聯相容信息需要更多的標記。這些發現表明,AI系統在處理信息時存在與人類隱性偏見相似的樣式。我們探討了這些隱性偏見樣式在實際應用部署中的影響。
English
Implicit bias refers to automatic or spontaneous mental processes that shape
perceptions, judgments, and behaviors. Previous research examining `implicit
bias' in large language models (LLMs) has often approached the phenomenon
differently than how it is studied in humans by focusing primarily on model
outputs rather than on model processing. To examine model processing, we
present a method called the Reasoning Model Implicit Association Test (RM-IAT)
for studying implicit bias-like patterns in reasoning models: LLMs that employ
step-by-step reasoning to solve complex tasks. Using this method, we find that
reasoning models require more tokens when processing association-incompatible
information compared to association-compatible information. These findings
suggest AI systems harbor patterns in processing information that are analogous
to human implicit bias. We consider the implications of these implicit
bias-like patterns for their deployment in real-world applications.Summary
AI-Generated Summary