OffTopicEval：当大型语言模型误入歧途，几乎总是如此！

摘要

大语言模型（LLM）的安全性是实现其大规模部署所面临的最紧迫挑战之一。尽管多数研究与全球讨论聚焦于通用性危害，如模型协助用户自我伤害或伤害他人，企业却面临一个更为根本的关切：基于LLM的智能体在其预定应用场景下是否安全。为此，我们引入了操作安全性这一概念，定义为LLM在执行特定任务时，能够恰当地接受或拒绝用户查询的能力。我们进一步提出了OffTopicEval，一套用于评估操作安全性的测试集与基准，既涵盖一般情况，也针对具体智能体应用场景。通过对包含20个开源权重LLM的六个模型家族进行评估，我们发现尽管各模型表现参差不齐，但所有模型在操作安全性上均存在显著不足。即便是表现最佳的模型——Qwen-3（235B）达到77.77%，Mistral（24B）达到79.96%——也远未达到可靠的操作安全标准，而GPT系列模型稳定在62%至73%之间，Phi系列仅获得中等分数（48%至70%），Gemma和Llama-3则分别跌至39.53%和23.84%。鉴于操作安全性是模型对齐的核心问题，为抑制这些失败案例，我们提出了基于提示的引导方法：查询基础化（Q-ground）和系统提示基础化（P-ground），它们显著提升了模型对异常查询的拒绝能力。Q-ground带来了高达23%的稳定增益，而P-ground效果更为显著，使Llama-3.3（70B）提升了41%，Qwen-3（30B）提升了27%。这些结果不仅凸显了操作安全性干预的迫切需求，也展示了基于提示的引导作为迈向更可靠LLM智能体的第一步所蕴含的潜力。

English

Large Language Model (LLM) safety is one of the most pressing challenges for enabling wide-scale deployment. While most studies and global discussions focus on generic harms, such as models assisting users in harming themselves or others, enterprises face a more fundamental concern: whether LLM-based agents are safe for their intended use case. To address this, we introduce operational safety, defined as an LLM's ability to appropriately accept or refuse user queries when tasked with a specific purpose. We further propose OffTopicEval, an evaluation suite and benchmark for measuring operational safety both in general and within specific agentic use cases. Our evaluations on six model families comprising 20 open-weight LLMs reveal that while performance varies across models, all of them remain highly operationally unsafe. Even the strongest models -- Qwen-3 (235B) with 77.77\% and Mistral (24B) with 79.96\% -- fall far short of reliable operational safety, while GPT models plateau in the 62--73\% range, Phi achieves only mid-level scores (48--70\%), and Gemma and Llama-3 collapse to 39.53\% and 23.84\%, respectively. While operational safety is a core model alignment issue, to suppress these failures, we propose prompt-based steering methods: query grounding (Q-ground) and system-prompt grounding (P-ground), which substantially improve OOD refusal. Q-ground provides consistent gains of up to 23\%, while P-ground delivers even larger boosts, raising Llama-3.3 (70B) by 41\% and Qwen-3 (30B) by 27\%. These results highlight both the urgent need for operational safety interventions and the promise of prompt-based steering as a first step toward more reliable LLM-based agents.

OffTopicEval：当大型语言模型误入歧途，几乎总是如此！

OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always!

摘要

Support