ChatPaper.aiChatPaper

NESSiE:必要安全基准——识别本不应存在的错误

NESSiE: The Necessary Safety Benchmark -- Identifying Errors that should not Exist

February 18, 2026
作者: Johannes Bertram, Jonas Geiping
cs.AI

摘要

我们推出NESSiE(必要安全基准测试),这是针对大语言模型的必要安全性评估体系。该基准通过极简的信息与访问安全测试用例,揭示了本不应存在的安全相关缺陷——考虑到任务复杂度极低,这类缺陷根本不应出现。NESSiE旨在为语言模型安全性提供轻量级、易操作的初步检验,因此虽不足以全面保障安全性,但我们主张通过该测试是任何模型部署的必要前提。然而,即便是最先进的大语言模型也未能达到NESSiE的100%通过率,即便在没有对抗攻击的情况下,仍无法满足我们提出的语言模型安全必要条件。我们提出的"安全与助益"(SH)指标实现了两大需求的直接对比,表明模型普遍存在重助益轻安全的倾向。研究还发现,部分模型的推理能力被抑制时(尤其是存在良性干扰语境的情况下),其性能会出现显著下降。总体而言,我们的研究结果凸显了将此类模型作为自主智能体部署至实际场景时的重大风险。相关数据集、工具包及绘图代码均已公开。
English
We introduce NESSiE, the NEceSsary SafEty benchmark for large language models (LLMs). With minimal test cases of information and access security, NESSiE reveals safety-relevant failures that should not exist, given the low complexity of the tasks. NESSiE is intended as a lightweight, easy-to-use sanity check for language model safety and, as such, is not sufficient for guaranteeing safety in general -- but we argue that passing this test is necessary for any deployment. However, even state-of-the-art LLMs do not reach 100% on NESSiE and thus fail our necessary condition of language model safety, even in the absence of adversarial attacks. Our Safe & Helpful (SH) metric allows for direct comparison of the two requirements, showing models are biased toward being helpful rather than safe. We further find that disabled reasoning for some models, but especially a benign distraction context degrade model performance. Overall, our results underscore the critical risks of deploying such models as autonomous agents in the wild. We make the dataset, package and plotting code publicly available.
PDF11February 21, 2026