在模拟人类社会中训练社会对齐语言模型

摘要

AI系统中的社会对齐旨在确保这些模型的行为符合既定的社会价值观。然而，与人类不同，人类通过社会互动来达成对价值判断的共识，当前的语言模型（LMs）被训练为在孤立环境中严格复制其训练语料库，导致在陌生情境中泛化能力不足，并容易受到对抗性攻击的影响。本研究提出了一种新颖的训练范式，允许LMs从模拟社会互动中学习。与现有方法相比，我们的方法在可扩展性和效率上都更为出色，在对齐基准测试和人类评估中表现出卓越的性能。LMs训练中的这种范式转变使我们离开发能够稳健准确地反映社会规范和价值观的AI系统更近了一步。

English

Social alignment in AI systems aims to ensure that these models behave according to established societal values. However, unlike humans, who derive consensus on value judgments through social interaction, current language models (LMs) are trained to rigidly replicate their training corpus in isolation, leading to subpar generalization in unfamiliar scenarios and vulnerability to adversarial attacks. This work presents a novel training paradigm that permits LMs to learn from simulated social interactions. In comparison to existing methodologies, our approach is considerably more scalable and efficient, demonstrating superior performance in alignment benchmarks and human evaluations. This paradigm shift in the training of LMs brings us a step closer to developing AI systems that can robustly and accurately reflect societal norms and values.

在模拟人类社会中训练社会对齐语言模型

Training Socially Aligned Language Models in Simulated Human Society

摘要

Support