정렬은 언어 모델을 규범적으로 만들며 기술적으로 만들지 않는다

초록

사후 훈련 정렬은 언어 모델을 인간 선호 신호에 맞게 최적화하지만, 이 목표는 관찰된 인간 행동을 모델링하는 것과 동일하지 않습니다. 우리는 120개의 기본-정렬 모델 쌍을 협상, 설득, 협상, 반복 행렬 게임 등 다중 라운드 전략 게임에서 10,000건 이상의 실제 인간 결정과 비교했습니다. 이러한 환경에서 기본 모델은 모델 계열, 프롬프트 구성, 게임 설정에 걸쳐 견고하게 인간 선택 예측에서 정렬된 대조군을 약 10:1 차이로 능가했습니다. 그러나 인간 행동이 규범적 예측을 따를 가능성이 높은 환경에서는 이 패턴이 역전됩니다. 정렬된 모델은 테스트된 12가지 유형의 일회성 교과서 게임과 비전략적 복권 선택 전반에서 우세했으며, 심지어 다중 라운드 게임 내부에서도 상호작용 기록이 축적되기 전인 1라운드에서도 그러했습니다. 이러한 경계 조건 패턴은 정렬이 규범적 편향을 유발함을 시사합니다. 즉, 인간 행동이 규범적 해법에 의해 상대적으로 잘 포착될 때는 예측력을 향상시키지만, 상호성, 보복, 역사 의존적 적응과 같은 서술적 역학에 의해 행동이 형성되는 다중 라운드 전략 환경에서는 예측력을 저해합니다. 이러한 결과는 인간 사용을 위해 모델을 최적화하는 것과 인간 행동의 대리 지표로 사용하는 것 사이의 근본적인 상충 관계를 드러냅니다.

English

Post-training alignment optimizes language models to match human preference signals, but this objective is not equivalent to modeling observed human behavior. We compare 120 base-aligned model pairs on more than 10,000 real human decisions in multi-round strategic games - bargaining, persuasion, negotiation, and repeated matrix games. In these settings, base models outperform their aligned counterparts in predicting human choices by nearly 10:1, robustly across model families, prompt formulations, and game configurations. This pattern reverses, however, in settings where human behavior is more likely to follow normative predictions: aligned models dominate on one-shot textbook games across all 12 types tested and on non-strategic lottery choices - and even within the multi-round games themselves, at round one, before interaction history develops. This boundary-condition pattern suggests that alignment induces a normative bias: it improves prediction when human behavior is relatively well captured by normative solutions, but hurts prediction in multi-round strategic settings, where behavior is shaped by descriptive dynamics such as reciprocity, retaliation, and history-dependent adaptation. These results reveal a fundamental trade-off between optimizing models for human use and using them as proxies for human behavior.

정렬은 언어 모델을 규범적으로 만들며 기술적으로 만들지 않는다

Alignment Makes Language Models Normative, Not Descriptive

초록

Support