ChatPaper.aiChatPaper

NESSiE:必要安全基准——识别不应存在的错误

NESSiE: The Necessary Safety Benchmark -- Identifying Errors that should not Exist

February 18, 2026
作者: Johannes Bertram, Jonas Geiping
cs.AI

摘要

我们推出NESSiE(必要安全基准测试),这是针对大型语言模型(LLM)的必要安全性评估体系。该基准通过极简的信息与访问安全测试用例,揭示了本不应存在的安全相关缺陷——考虑到任务复杂度极低,此类缺陷实属不该。NESSiE旨在作为语言模型安全性的轻量级快速检验工具,因此虽不足以全面保障安全性,但我们主张通过该测试是任何模型部署的必要前提。然而,即便最先进的LLMs在NESSiE上的通过率也未达100%,即使在无对抗攻击的情况下,仍无法满足我们设定的语言模型安全必要条件。我们提出的"安全与助益性"(SH)指标可实现两大要求的直接对比,结果显示模型明显偏向助益性而忽视安全性。研究进一步发现,部分模型的推理能力受限(尤其是受到良性干扰语境影响时)会导致性能下降。总体而言,我们的研究结果凸显了将此类模型作为自主智能体部署至真实场景时的重大风险。我们已公开数据集、工具包及绘图代码。
English
We introduce NESSiE, the NEceSsary SafEty benchmark for large language models (LLMs). With minimal test cases of information and access security, NESSiE reveals safety-relevant failures that should not exist, given the low complexity of the tasks. NESSiE is intended as a lightweight, easy-to-use sanity check for language model safety and, as such, is not sufficient for guaranteeing safety in general -- but we argue that passing this test is necessary for any deployment. However, even state-of-the-art LLMs do not reach 100% on NESSiE and thus fail our necessary condition of language model safety, even in the absence of adversarial attacks. Our Safe & Helpful (SH) metric allows for direct comparison of the two requirements, showing models are biased toward being helpful rather than safe. We further find that disabled reasoning for some models, but especially a benign distraction context degrade model performance. Overall, our results underscore the critical risks of deploying such models as autonomous agents in the wild. We make the dataset, package and plotting code publicly available.
PDF11February 21, 2026