시뮬레이션된 인간 사회에서 사회적으로 정렬된 언어 모델 훈련하기

초록

AI 시스템의 사회적 정렬(social alignment)은 이러한 모델이 확립된 사회적 가치에 따라 행동하도록 보장하는 것을 목표로 합니다. 그러나 사회적 상호작용을 통해 가치 판단에 대한 합의를 도출하는 인간과 달리, 현재의 언어 모델(LMs)은 고립된 상태에서 훈련 코퍼스를 경직적으로 복제하도록 학습되어, 익숙하지 않은 시나리오에서의 일반화 능력이 떨어지고 적대적 공격에 취약합니다. 본 연구는 언어 모델이 시뮬레이션된 사회적 상호작용으로부터 학습할 수 있는 새로운 훈련 패러다임을 제시합니다. 기존 방법론과 비교하여, 우리의 접근 방식은 훨씬 더 확장 가능하고 효율적이며, 정렬 벤치마크와 인간 평가에서 우수한 성능을 보여줍니다. 언어 모델 훈련의 이러한 패러다임 전환은 사회적 규범과 가치를 견고하고 정확하게 반영할 수 있는 AI 시스템 개발에 한 걸음 더 가까이 다가가는 계기가 될 것입니다.

English

Social alignment in AI systems aims to ensure that these models behave according to established societal values. However, unlike humans, who derive consensus on value judgments through social interaction, current language models (LMs) are trained to rigidly replicate their training corpus in isolation, leading to subpar generalization in unfamiliar scenarios and vulnerability to adversarial attacks. This work presents a novel training paradigm that permits LMs to learn from simulated social interactions. In comparison to existing methodologies, our approach is considerably more scalable and efficient, demonstrating superior performance in alignment benchmarks and human evaluations. This paradigm shift in the training of LMs brings us a step closer to developing AI systems that can robustly and accurately reflect societal norms and values.

시뮬레이션된 인간 사회에서 사회적으로 정렬된 언어 모델 훈련하기

Training Socially Aligned Language Models in Simulated Human Society

초록

Support