整合は言語モデルを記述的ではなく規範的にする

要旨

学習後アライメントは、言語モデルを人間の選好信号に適合させるように最適化するが、この目的は観測された人間の行動のモデル化と同等ではない。我々は120組のベースモデルとアライメント済みモデルを、多ラウンド戦略ゲーム（交渉、説得、協商、反復行列ゲーム）における1万件以上の実人間の意思決定と比較した。これらの設定では、ベースモデルはアライメント済みモデルに対し、人間の選択予測において約10:1の優位性を示し、この傾向はモデルファミリー・プロンプト設計・ゲーム設定を問わず頑健に確認された。しかし、人間の行動が規範的予測に従いやすい設定ではこのパターンは逆転する：アライメント済みモデルは、検証した12種類全てのワンショット教科書ゲームと非戦略的宝くじ選択で優位に立ち、さらに多ラウンドゲーム内においても、相互作用の履歴が蓄積される前の第1ラウンドでは優位性を示した。この境界条件パターンは、アライメントが規範的バイアスを誘発することを示唆する：つまり、人間の行動が規範的解によって比較的よく説明される場合には予測精度を向上させるが、互恵性・報復・履歴依存的な適応といった記述的力学が行動を形成する多ラウンド戦略設定では予測精度を損なうのである。これらの結果は、モデルを人間の利用向けに最適化することと、人間の行動の代理として利用することの間には根本的なトレードオフが存在することを明らかにしている。

English

Post-training alignment optimizes language models to match human preference signals, but this objective is not equivalent to modeling observed human behavior. We compare 120 base-aligned model pairs on more than 10,000 real human decisions in multi-round strategic games - bargaining, persuasion, negotiation, and repeated matrix games. In these settings, base models outperform their aligned counterparts in predicting human choices by nearly 10:1, robustly across model families, prompt formulations, and game configurations. This pattern reverses, however, in settings where human behavior is more likely to follow normative predictions: aligned models dominate on one-shot textbook games across all 12 types tested and on non-strategic lottery choices - and even within the multi-round games themselves, at round one, before interaction history develops. This boundary-condition pattern suggests that alignment induces a normative bias: it improves prediction when human behavior is relatively well captured by normative solutions, but hurts prediction in multi-round strategic settings, where behavior is shaped by descriptive dynamics such as reciprocity, retaliation, and history-dependent adaptation. These results reveal a fundamental trade-off between optimizing models for human use and using them as proxies for human behavior.

整合は言語モデルを記述的ではなく規範的にする

Alignment Makes Language Models Normative, Not Descriptive

要旨

Support