헌법적 AI를 위한 특수 원칙 대 일반 원칙

초록

인간 피드백은 대화형 모델에서 지나치게 유해한 발화를 방지할 수 있지만, 자기 보존이나 권력에 대한 명시적 욕구와 같은 미묘한 문제 행동을 자동으로 완화하지는 못할 수 있습니다. 헌법적 AI(Constitutional AI)는 이러한 대안을 제공하며, 인간 피드백을 서면 원칙 목록에 기반한 AI 모델의 피드백으로 대체합니다. 우리는 이 접근 방식이 그러한 행동의 표현을 효과적으로 방지한다는 것을 발견했습니다. 단순한 원칙의 성공은 우리에게 다음과 같은 질문을 하게 합니다: 모델이 단일 서면 원칙만으로 일반적인 윤리적 행동을 배울 수 있을까요? 이를 테스트하기 위해, 우리는 "인류를 위해 최선을 다하라"라는 대략적인 원칙을 사용하여 실험을 실행했습니다. 우리는 가장 큰 대화 모델이 이 짧은 헌법에서 일반화할 수 있으며, 권력과 같은 특정 동기에 대한 명시적 관심 없이 무해한 도우미를 만들어낸다는 것을 발견했습니다. 따라서 일반적인 원칙은 잠재적으로 유해한 행동을 대상으로 하는 긴 헌법 목록의 필요성을 부분적으로 피할 수 있습니다. 그러나 더 상세한 헌법은 여전히 특정 유형의 해악에 대한 세밀한 통제를 개선합니다. 이는 일반적이고 구체적인 원칙 모두가 AI를 안전하게 조종하는 데 가치가 있음을 시사합니다.

English

Human feedback can prevent overtly harmful utterances in conversational models, but may not automatically mitigate subtle problematic behaviors such as a stated desire for self-preservation or power. Constitutional AI offers an alternative, replacing human feedback with feedback from AI models conditioned only on a list of written principles. We find this approach effectively prevents the expression of such behaviors. The success of simple principles motivates us to ask: can models learn general ethical behaviors from only a single written principle? To test this, we run experiments using a principle roughly stated as "do what's best for humanity". We find that the largest dialogue models can generalize from this short constitution, resulting in harmless assistants with no stated interest in specific motivations like power. A general principle may thus partially avoid the need for a long list of constitutions targeting potentially harmful behaviors. However, more detailed constitutions still improve fine-grained control over specific types of harms. This suggests both general and specific principles have value for steering AI safely.

헌법적 AI를 위한 특수 원칙 대 일반 원칙

Specific versus General Principles for Constitutional AI

초록

Support