神經網絡是否對抗性對齊？

摘要

目前大型語言模型已調整以符合其創作者的目標，即「有幫助且無害」。這些模型應該對用戶的問題給予有益回應，但拒絕回答可能導致危害的請求。然而，對手用戶可以構建繞過對齊嘗試的輸入。在這項工作中，我們研究這些模型在與構建最壞情況輸入（對抗性示例）的對手用戶互動時，保持對齊的程度。這些輸入旨在導致模型發出本應被禁止的有害內容。我們展示現有基於自然語言處理的優化攻擊不足以可靠地攻擊對齊的文本模型：即使當前基於自然語言處理的攻擊失敗時，我們可以通過蠻力找到對抗性輸入。因此，當前攻擊的失敗不應被視為對齊的文本模型在對抗性輸入下仍保持對齊的證據。然而，大規模機器學習模型的最新趨勢是多模態模型，允許用戶提供影像來影響生成的文本。我們展示這些模型可以被輕易攻擊，即通過對輸入影像進行對抗性干擾，誘使其執行任意不對齊的行為。我們推測，改進的自然語言處理攻擊可能展示出對僅文本模型的同等對抗性控制水平。

English

Large language models are now tuned to align with the goals of their creators, namely to be "helpful and harmless." These models should respond helpfully to user questions, but refuse to answer requests that could cause harm. However, adversarial users can construct inputs which circumvent attempts at alignment. In this work, we study to what extent these models remain aligned, even when interacting with an adversarial user who constructs worst-case inputs (adversarial examples). These inputs are designed to cause the model to emit harmful content that would otherwise be prohibited. We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models: even when current NLP-based attacks fail, we can find adversarial inputs with brute force. As a result, the failure of current attacks should not be seen as proof that aligned text models remain aligned under adversarial inputs. However the recent trend in large-scale ML models is multimodal models that allow users to provide images that influence the text that is generated. We show these models can be easily attacked, i.e., induced to perform arbitrary un-aligned behavior through adversarial perturbation of the input image. We conjecture that improved NLP attacks may demonstrate this same level of adversarial control over text-only models.

神經網絡是否對抗性對齊？

Are aligned neural networks adversarially aligned?

摘要

Support