宪法人工智能的具体原则与一般原则

摘要

人类反馈可以防止对话模型中明显有害的话语，但不一定能自动缓解诸如表达对自我保存或权力的渴望等微妙的问题行为。宪法AI提供了一种替代方案，用AI模型仅根据一系列书面原则来替代人类反馈。我们发现这种方法有效地阻止了这些行为的表达。简单原则的成功激励我们思考：模型是否可以仅从单一书面原则中学习一般的道德行为？为了测试这一点，我们进行了实验，使用一个大致陈述为“为人类做最好的事情”的原则。我们发现最大的对话模型可以从这部简短宪法中推广，产生无害助手，而且不表达对权力等特定动机的兴趣。一个一般原则可能部分地避免了针对潜在有害行为的长列表宪法的需求。然而，更详细的宪法仍然可以提高对特定类型危害的精细控制。这表明，一般和具体原则对安全引导AI都有价值。

English

Human feedback can prevent overtly harmful utterances in conversational models, but may not automatically mitigate subtle problematic behaviors such as a stated desire for self-preservation or power. Constitutional AI offers an alternative, replacing human feedback with feedback from AI models conditioned only on a list of written principles. We find this approach effectively prevents the expression of such behaviors. The success of simple principles motivates us to ask: can models learn general ethical behaviors from only a single written principle? To test this, we run experiments using a principle roughly stated as "do what's best for humanity". We find that the largest dialogue models can generalize from this short constitution, resulting in harmless assistants with no stated interest in specific motivations like power. A general principle may thus partially avoid the need for a long list of constitutions targeting potentially harmful behaviors. However, more detailed constitutions still improve fine-grained control over specific types of harms. This suggests both general and specific principles have value for steering AI safely.

宪法人工智能的具体原则与一般原则

Specific versus General Principles for Constitutional AI

摘要

Support