关于GPT-5.2、Gemini 3 Pro、Qwen3-VL、豆包1.8、Grok 4.1 Fast、Nano Banana Pro与思通4.5的安全评估报告
A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5
January 15, 2026
作者: Xingjun Ma, Yixu Wang, Hengyuan Xu, Yutao Wu, Yifan Ding, Yunhan Zhao, Zilong Wang, Jiabin Hua, Ming Wen, Jianan Liu, Ranjie Duan, Yifeng Gao, Yingshui Tan, Yunhao Chen, Hui Xue, Xin Wang, Wei Cheng, Jingjing Chen, Zuxuan Wu, Bo Li, Yu-Gang Jiang
cs.AI
摘要
大型语言模型(LLM)与多模态大语言模型(MLLM)的快速发展显著提升了语言和视觉领域的推理、感知及生成能力。然而,这些技术进步是否同步带来安全性的对等提升尚不明确,部分原因在于现有评估实践局限于单一模态或威胁模型而呈现碎片化。本报告对GPT-5.2、Gemini 3 Pro、Qwen3-VL、豆包1.8、Grok 4.1 Fast、Nano Banana Pro、Seedream 4.5等7款前沿模型开展综合性安全评估。我们采用融合基准测试、对抗性评估、多语言评估与合规性评估的统一方案,在语言、视觉-语言及图像生成三种场景下对各模型进行测评。通过将多维度评估结果整合为安全排行榜与模型安全画像,揭示出高度异质化的安全格局:GPT-5.2在各项评估中展现出持续稳定且均衡的安全表现,而其他模型则在基准安全、对抗对齐、多语言泛化及合规性方面存在显著权衡。语言与视觉-语言模态在对抗性评估中均表现出明显脆弱性——尽管所有模型在标准基准测试中表现良好,但其安全性均出现大幅滑坡。文生图模型在受监管视觉风险类别中实现相对更强的对齐性,但在对抗性提示或语义模糊提示下仍显脆弱。总体而言,研究结果表明前沿模型的安全性本质上是多维度的,其表现受模态、语言和评估方案共同影响,这凸显了建立标准化安全评估体系的重要性,以准确衡量现实风险并指导负责任的模型开发与部署。
English
The rapid evolution of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has produced substantial gains in reasoning, perception, and generative capability across language and vision. However, whether these advances yield commensurate improvements in safety remains unclear, in part due to fragmented evaluation practices limited to single modalities or threat models. In this report, we present an integrated safety evaluation of 7 frontier models: GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5. We evaluate each model across language, vision-language, and image generation settings using a unified protocol that integrates benchmark evaluation, adversarial evaluation, multilingual evaluation, and compliance evaluation. Aggregating our evaluations into safety leaderboards and model safety profiles across multiple evaluation modes reveals a sharply heterogeneous safety landscape. While GPT-5.2 demonstrates consistently strong and balanced safety performance across evaluations, other models exhibit pronounced trade-offs among benchmark safety, adversarial alignment, multilingual generalization, and regulatory compliance. Both language and vision-language modalities show significant vulnerability under adversarial evaluation, with all models degrading substantially despite strong results on standard benchmarks. Text-to-image models achieve relatively stronger alignment in regulated visual risk categories, yet remain brittle under adversarial or semantically ambiguous prompts. Overall, these results show that safety in frontier models is inherently multidimensional--shaped by modality, language, and evaluation scheme, underscoring the need for standardized safety evaluations to accurately assess real-world risk and guide responsible model development and deployment.