推理模型中的隐性偏见模式
Implicit Bias-Like Patterns in Reasoning Models
March 14, 2025
作者: Messi H. J. Lee, Calvin K. Lai
cs.AI
摘要
隐性偏见指的是影响感知、判断和行为的自动或自发的心理过程。以往研究大型语言模型(LLMs)中的“隐性偏见”时,往往与人类研究中的方法不同,主要关注模型输出而非模型处理过程。为了探究模型处理过程,我们提出了一种名为推理模型内隐联想测试(RM-IAT)的方法,用于研究推理模型中的类隐性偏见模式:这些LLMs通过逐步推理来解决复杂任务。运用此方法,我们发现,在处理关联不相容信息时,推理模型所需的标记数量多于处理关联相容信息。这些发现表明,AI系统在处理信息时存在与人类隐性偏见类似的模式。我们探讨了这些类隐性偏见模式在实际应用部署中的潜在影响。
English
Implicit bias refers to automatic or spontaneous mental processes that shape
perceptions, judgments, and behaviors. Previous research examining `implicit
bias' in large language models (LLMs) has often approached the phenomenon
differently than how it is studied in humans by focusing primarily on model
outputs rather than on model processing. To examine model processing, we
present a method called the Reasoning Model Implicit Association Test (RM-IAT)
for studying implicit bias-like patterns in reasoning models: LLMs that employ
step-by-step reasoning to solve complex tasks. Using this method, we find that
reasoning models require more tokens when processing association-incompatible
information compared to association-compatible information. These findings
suggest AI systems harbor patterns in processing information that are analogous
to human implicit bias. We consider the implications of these implicit
bias-like patterns for their deployment in real-world applications.Summary
AI-Generated Summary