对齐使语言模型具备规范性，而非描述性

摘要

后训练对齐旨在优化语言模型以匹配人类偏好信号，但该目标并不等同于对人类观察行为进行建模。我们比较了120个基础模型与对齐模型组合在超过1万次真实人类决策中的表现，这些决策来自多轮策略性游戏——包括讨价还价、说服、谈判和重复矩阵博弈。在这些情境中，基础模型在预测人类选择方面的表现以近10:1的优势超越其对应对齐模型，且该结果在不同模型家族、提示表述和游戏配置中保持稳健。然而，当人类行为更可能遵循规范性预测时，这种模式会发生逆转：对齐模型在所有12种测试的单次教科书式博弈中均占优势，在非策略性彩票选择任务中同样如此——甚至在多轮博弈内部的第一轮（尚未形成交互历史时）也表现出色。这种边界条件模式表明，对齐会引发规范性偏差：当人类行为相对符合规范性解决方案时能提升预测能力，但在多轮策略性场景中（行为受互惠、报复、历史依赖适应等描述性动态影响）反而会损害预测准确性。这些结果揭示了将模型优化用于人类服务与将其作为人类行为代理之间存在的根本性权衡。

English

Post-training alignment optimizes language models to match human preference signals, but this objective is not equivalent to modeling observed human behavior. We compare 120 base-aligned model pairs on more than 10,000 real human decisions in multi-round strategic games - bargaining, persuasion, negotiation, and repeated matrix games. In these settings, base models outperform their aligned counterparts in predicting human choices by nearly 10:1, robustly across model families, prompt formulations, and game configurations. This pattern reverses, however, in settings where human behavior is more likely to follow normative predictions: aligned models dominate on one-shot textbook games across all 12 types tested and on non-strategic lottery choices - and even within the multi-round games themselves, at round one, before interaction history develops. This boundary-condition pattern suggests that alignment induces a normative bias: it improves prediction when human behavior is relatively well captured by normative solutions, but hurts prediction in multi-round strategic settings, where behavior is shaped by descriptive dynamics such as reciprocity, retaliation, and history-dependent adaptation. These results reveal a fundamental trade-off between optimizing models for human use and using them as proxies for human behavior.

对齐使语言模型具备规范性，而非描述性

Alignment Makes Language Models Normative, Not Descriptive

摘要

Support