跨模态安全对齐

摘要

随着人工通用智能（AGI）越来越多地融入人类生活的各个方面，确保这些系统的安全性和道德对齐至关重要。先前的研究主要集中在单模态威胁上，这可能不足以应对跨模态交互的综合复杂性。我们引入了一个名为“安全输入但不安全输出”（SIUO）的新型安全对齐挑战，以评估跨模态安全对齐。具体而言，它考虑了单一模态在独立情况下是安全的，但在组合时可能导致不安全或不道德的输出的情况。为了从经验上研究这个问题，我们开发了SIUO，这是一个跨模态基准，涵盖了自残、非法活动和侵犯隐私等9个关键安全领域。我们的研究结果揭示了封闭和开源LVLMs（如GPT-4V和LLaVA）中存在重大的安全漏洞，突显了当前模型无法可靠地解释和应对复杂的现实场景的不足。

English

As Artificial General Intelligence (AGI) becomes increasingly integrated into various facets of human life, ensuring the safety and ethical alignment of such systems is paramount. Previous studies primarily focus on single-modality threats, which may not suffice given the integrated and complex nature of cross-modality interactions. We introduce a novel safety alignment challenge called Safe Inputs but Unsafe Output (SIUO) to evaluate cross-modality safety alignment. Specifically, it considers cases where single modalities are safe independently but could potentially lead to unsafe or unethical outputs when combined. To empirically investigate this problem, we developed the SIUO, a cross-modality benchmark encompassing 9 critical safety domains, such as self-harm, illegal activities, and privacy violations. Our findings reveal substantial safety vulnerabilities in both closed- and open-source LVLMs, such as GPT-4V and LLaVA, underscoring the inadequacy of current models to reliably interpret and respond to complex, real-world scenarios.

跨模态安全对齐

Cross-Modality Safety Alignment

摘要

Support