憲法AI的具體原則與一般原則

摘要

人類的回饋可以防止對話模型中明顯有害的言論，但可能無法自動化地緩解一些微妙的問題行為，例如對自我保存或權力的表達渴望。憲法 AI 提供了一種替代方案，將人類的回饋替換為僅基於一系列書面原則條件訓練的 AI 模型的回饋。我們發現這種方法有效地防止了這些行為的表達。簡單原則的成功激勵我們思考：模型是否能僅從單一書面原則中學習一般的道德行為？為了測試這一點，我們運行了一些實驗，使用一個大致陳述為「為人類做最好的事」的原則。我們發現最大的對話模型可以從這部簡短憲法中歸納出來，產生出無害的助理，並且沒有對權力等特定動機的表達興趣。一般原則因此可能在一定程度上避免了針對潛在有害行為的冗長憲法清單的需求。然而，更詳細的憲法仍然可以提高對特定類型傷害的細粒度控制。這表明一般和具體原則對安全引導 AI 都有價值。

English

Human feedback can prevent overtly harmful utterances in conversational models, but may not automatically mitigate subtle problematic behaviors such as a stated desire for self-preservation or power. Constitutional AI offers an alternative, replacing human feedback with feedback from AI models conditioned only on a list of written principles. We find this approach effectively prevents the expression of such behaviors. The success of simple principles motivates us to ask: can models learn general ethical behaviors from only a single written principle? To test this, we run experiments using a principle roughly stated as "do what's best for humanity". We find that the largest dialogue models can generalize from this short constitution, resulting in harmless assistants with no stated interest in specific motivations like power. A general principle may thus partially avoid the need for a long list of constitutions targeting potentially harmful behaviors. However, more detailed constitutions still improve fine-grained control over specific types of harms. This suggests both general and specific principles have value for steering AI safely.

憲法AI的具體原則與一般原則

Specific versus General Principles for Constitutional AI

摘要

Support