推理模型中的隱性偏見模式

摘要

隱性偏見指的是自動或自發的心理過程，這些過程塑造了感知、判斷和行為。以往研究大型語言模型（LLMs）中的「隱性偏見」時，通常與人類研究中的方法不同，主要關注模型輸出而非模型處理過程。為了探究模型處理過程，我們提出了一種名為推理模型隱性關聯測試（RM-IAT）的方法，用於研究推理模型中的隱性偏見樣式：這些LLMs通過逐步推理來解決複雜任務。運用此方法，我們發現推理模型在處理關聯不相容信息時，比處理關聯相容信息需要更多的標記。這些發現表明，AI系統在處理信息時存在與人類隱性偏見相似的樣式。我們探討了這些隱性偏見樣式在實際應用部署中的影響。

English

Implicit bias refers to automatic or spontaneous mental processes that shape perceptions, judgments, and behaviors. Previous research examining `implicit bias' in large language models (LLMs) has often approached the phenomenon differently than how it is studied in humans by focusing primarily on model outputs rather than on model processing. To examine model processing, we present a method called the Reasoning Model Implicit Association Test (RM-IAT) for studying implicit bias-like patterns in reasoning models: LLMs that employ step-by-step reasoning to solve complex tasks. Using this method, we find that reasoning models require more tokens when processing association-incompatible information compared to association-compatible information. These findings suggest AI systems harbor patterns in processing information that are analogous to human implicit bias. We consider the implications of these implicit bias-like patterns for their deployment in real-world applications.

推理模型中的隱性偏見模式

Implicit Bias-Like Patterns in Reasoning Models

摘要

Support