在模擬人類社會中訓練社會對齊語言模型

摘要

AI 系統中的社會對齊旨在確保這些模型遵循既定的社會價值觀。然而，不同於人類通過社會互動來達成價值判斷的共識，目前的語言模型（LMs）被訓練成僵化地在孤立環境中複製其訓練語料庫，導致在陌生情境下泛化能力不足並容易受到對抗攻擊的威脅。本研究提出了一種新穎的訓練範式，允許 LM 從模擬的社會互動中學習。相較於現有方法，我們的方法更具可擴展性和效率，在對齊基準和人類評估中展現出優異的表現。LMs 訓練中的這種範式轉變使我們更接近開發出能夠堅固且準確反映社會規範和價值觀的 AI 系統。

English

Social alignment in AI systems aims to ensure that these models behave according to established societal values. However, unlike humans, who derive consensus on value judgments through social interaction, current language models (LMs) are trained to rigidly replicate their training corpus in isolation, leading to subpar generalization in unfamiliar scenarios and vulnerability to adversarial attacks. This work presents a novel training paradigm that permits LMs to learn from simulated social interactions. In comparison to existing methodologies, our approach is considerably more scalable and efficient, demonstrating superior performance in alignment benchmarks and human evaluations. This paradigm shift in the training of LMs brings us a step closer to developing AI systems that can robustly and accurately reflect societal norms and values.

在模擬人類社會中訓練社會對齊語言模型

Training Socially Aligned Language Models in Simulated Human Society

摘要

Support