對齊使語言模型具備規範性,而非描述性
Alignment Makes Language Models Normative, Not Descriptive
March 17, 2026
作者: Eilam Shapira, Moshe Tennenholtz, Roi Reichart
cs.AI
摘要
後訓練對齊旨在最佳化語言模型以符合人類偏好信號,但此目標並不等同於模擬觀察到的人類行為。我們在超過10,000筆真人決策數據中,比較了120組基礎模型與對齊模型在多重回合策略遊戲(包括議價、說服、談判及重複矩陣博弈)中的表現。結果顯示,基礎模型在預測人類選擇時以近10:1的優勢持續優於其對齊版本,且此現象在不同模型家族、提示框架與遊戲配置中均保持穩健。然而,當人類行為更可能遵循規範性預測時,此模式出現反轉:對齊模型在所有12種單回合教科書式遊戲及非策略性彩票選擇任務中全面佔優,甚至在多重回合遊戲的初始回合(尚未形成互動歷史時)亦表現更佳。這種邊界條件模式表明,對齊過程會誘發規範性偏差:當人類行為較能被規範性解法捕捉時,對齊能提升預測力;但在多重回合策略情境中,當行為受互惠、報復、歷史依賴適應等描述性動態影響時,對齊反而削弱預測能力。這些結果揭示了在「為人類使用最佳化模型」與「將模型作為人類行為代理」之間存在根本性取捨。
English
Post-training alignment optimizes language models to match human preference signals, but this objective is not equivalent to modeling observed human behavior. We compare 120 base-aligned model pairs on more than 10,000 real human decisions in multi-round strategic games - bargaining, persuasion, negotiation, and repeated matrix games. In these settings, base models outperform their aligned counterparts in predicting human choices by nearly 10:1, robustly across model families, prompt formulations, and game configurations. This pattern reverses, however, in settings where human behavior is more likely to follow normative predictions: aligned models dominate on one-shot textbook games across all 12 types tested and on non-strategic lottery choices - and even within the multi-round games themselves, at round one, before interaction history develops. This boundary-condition pattern suggests that alignment induces a normative bias: it improves prediction when human behavior is relatively well captured by normative solutions, but hurts prediction in multi-round strategic settings, where behavior is shaped by descriptive dynamics such as reciprocity, retaliation, and history-dependent adaptation. These results reveal a fundamental trade-off between optimizing models for human use and using them as proxies for human behavior.