DeepSight：一体化语言模型安全工具包

摘要

随着大模型技术的快速发展，其安全性问题日益受到重视。当前大语言模型及多模态大语言模型的安全工作流程中，评估、诊断和对齐往往由独立工具完成。具体而言，安全评估仅能定位外部行为风险而无法探究内部根因，安全诊断则常脱离具体风险场景停留在可解释性层面。这种方式使得安全对齐缺乏对内部机制变化的专项解释，可能导致通用能力退化。为系统解决这些问题，我们提出开源项目DeepSight，实践评估-诊断一体化的新范式。该项目由评估工具集DeepSafe与诊断工具集DeepScan构成，具备低成本、可复现、高效率和高扩展性特点。通过统一任务与数据协议，我们构建了两个阶段间的关联，实现了安全评估从黑盒到白盒的洞察。此外，DeepSight是首个支持前沿AI风险评估、兼具安全评估与诊断功能的开源工具集。

English

As the development of Large Models (LMs) progresses rapidly, their safety is also a priority. In current Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) safety workflow, evaluation, diagnosis, and alignment are often handled by separate tools. Specifically, safety evaluation can only locate external behavioral risks but cannot figure out internal root causes. Meanwhile, safety diagnosis often drifts from concrete risk scenarios and remains at the explainable level. In this way, safety alignment lack dedicated explanations of changes in internal mechanisms, potentially degrading general capabilities. To systematically address these issues, we propose an open-source project, namely DeepSight, to practice a new safety evaluation-diagnosis integrated paradigm. DeepSight is low-cost, reproducible, efficient, and highly scalable large-scale model safety evaluation project consisting of a evaluation toolkit DeepSafe and a diagnosis toolkit DeepScan. By unifying task and data protocols, we build a connection between the two stages and transform safety evaluation from black-box to white-box insight. Besides, DeepSight is the first open source toolkit that support the frontier AI risk evaluation and joint safety evaluation and diagnosis.

DeepSight：一体化语言模型安全工具包

DeepSight: An All-in-One LM Safety Toolkit

摘要

Support