DeepSight:一体化语言模型安全工具包
DeepSight: An All-in-One LM Safety Toolkit
February 12, 2026
作者: Bo Zhang, Jiaxuan Guo, Lijun Li, Dongrui Liu, Sujin Chen, Guanxu Chen, Zhijie Zheng, Qihao Lin, Lewen Yan, Chen Qian, Yijin Zhou, Yuyao Wu, Shaoxiong Guo, Tianyi Du, Jingyi Yang, Xuhao Hu, Ziqi Miao, Xiaoya Lu, Jing Shao, Xia Hu
cs.AI
摘要
随着大模型技术的飞速发展,其安全性问题日益受到重视。当前大语言模型及多模态大语言模型的安全工作流程中,评估、诊断与对齐往往由独立工具完成。具体而言,安全评估仅能定位外部行为风险而无法探究内部根源;安全诊断则常脱离具体风险场景,停留在可解释性层面。这种方式使得安全对齐缺乏对内部机制变化的专项解释,可能导致模型通用能力下降。为系统解决这些问题,我们提出开源项目DeepSight,实践评估-诊断一体化的新范式。该项目作为低成本、可复现、高效率且高扩展性的大模型安全评估体系,由评估工具集DeepSafe与诊断工具集DeepScan构成。通过统一任务与数据协议,我们建立了两个阶段的关联,实现了安全评估从黑盒到白盒的洞察。此外,DeepSight是首个支持前沿AI风险评估、兼具安全评估与联合诊断能力的开源工具包。
English
As the development of Large Models (LMs) progresses rapidly, their safety is also a priority. In current Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) safety workflow, evaluation, diagnosis, and alignment are often handled by separate tools. Specifically, safety evaluation can only locate external behavioral risks but cannot figure out internal root causes. Meanwhile, safety diagnosis often drifts from concrete risk scenarios and remains at the explainable level. In this way, safety alignment lack dedicated explanations of changes in internal mechanisms, potentially degrading general capabilities. To systematically address these issues, we propose an open-source project, namely DeepSight, to practice a new safety evaluation-diagnosis integrated paradigm. DeepSight is low-cost, reproducible, efficient, and highly scalable large-scale model safety evaluation project consisting of a evaluation toolkit DeepSafe and a diagnosis toolkit DeepScan. By unifying task and data protocols, we build a connection between the two stages and transform safety evaluation from black-box to white-box insight. Besides, DeepSight is the first open source toolkit that support the frontier AI risk evaluation and joint safety evaluation and diagnosis.