對齊使語言模型具備規範性，而非描述性

摘要

後訓練對齊旨在最佳化語言模型以符合人類偏好信號，但此目標並不等同於模擬觀察到的人類行為。我們在超過10,000筆真人決策數據中，比較了120組基礎模型與對齊模型在多重回合策略遊戲（包括議價、說服、談判及重複矩陣博弈）中的表現。結果顯示，基礎模型在預測人類選擇時以近10:1的優勢持續優於其對齊版本，且此現象在不同模型家族、提示框架與遊戲配置中均保持穩健。然而，當人類行為更可能遵循規範性預測時，此模式出現反轉：對齊模型在所有12種單回合教科書式遊戲及非策略性彩票選擇任務中全面佔優，甚至在多重回合遊戲的初始回合（尚未形成互動歷史時）亦表現更佳。這種邊界條件模式表明，對齊過程會誘發規範性偏差：當人類行為較能被規範性解法捕捉時，對齊能提升預測力；但在多重回合策略情境中，當行為受互惠、報復、歷史依賴適應等描述性動態影響時，對齊反而削弱預測能力。這些結果揭示了在「為人類使用最佳化模型」與「將模型作為人類行為代理」之間存在根本性取捨。

English

Post-training alignment optimizes language models to match human preference signals, but this objective is not equivalent to modeling observed human behavior. We compare 120 base-aligned model pairs on more than 10,000 real human decisions in multi-round strategic games - bargaining, persuasion, negotiation, and repeated matrix games. In these settings, base models outperform their aligned counterparts in predicting human choices by nearly 10:1, robustly across model families, prompt formulations, and game configurations. This pattern reverses, however, in settings where human behavior is more likely to follow normative predictions: aligned models dominate on one-shot textbook games across all 12 types tested and on non-strategic lottery choices - and even within the multi-round games themselves, at round one, before interaction history develops. This boundary-condition pattern suggests that alignment induces a normative bias: it improves prediction when human behavior is relatively well captured by normative solutions, but hurts prediction in multi-round strategic settings, where behavior is shaped by descriptive dynamics such as reciprocity, retaliation, and history-dependent adaptation. These results reveal a fundamental trade-off between optimizing models for human use and using them as proxies for human behavior.

對齊使語言模型具備規範性，而非描述性

Alignment Makes Language Models Normative, Not Descriptive

摘要

Support