OffTopicEval：當大型語言模型誤入歧途，幾乎總是如此！

摘要

大型語言模型（LLM）的安全性是大規模部署中最迫切的挑戰之一。儘管多數研究和全球討論聚焦於通用危害，例如模型協助用戶傷害自己或他人，企業卻面臨更根本的擔憂：基於LLM的代理是否在其預定使用場景中安全。為此，我們引入「操作安全性」的概念，定義為LLM在執行特定任務時，能夠適當地接受或拒絕用戶查詢的能力。我們進一步提出OffTopicEval，這是一個評估套件和基準，用於衡量操作安全性，無論是在通用場景還是在特定代理使用案例中。我們對包含20個開源權重LLM的六個模型家族進行評估，結果顯示，儘管各模型表現不一，但它們在操作安全性方面均存在嚴重不足。即使是表現最強的模型——Qwen-3（235B）達到77.77%，Mistral（24B）達到79.96%——也遠未達到可靠的操作安全性，而GPT模型的分數穩定在62-73%之間，Phi僅獲得中等分數（48-70%），Gemma和Llama-3則分別跌至39.53%和23.84%。雖然操作安全性是模型對齊的核心問題，但為抑制這些失敗，我們提出基於提示的引導方法：查詢基礎（Q-ground）和系統提示基礎（P-ground），這些方法顯著提升了OOD拒絕能力。Q-ground帶來最高23%的穩定增益，而P-ground則提供更大的提升，使Llama-3.3（70B）提高41%，Qwen-3（30B）提高27%。這些結果凸顯了操作安全性干預的迫切需求，以及基於提示的引導作為邁向更可靠LLM代理的第一步所展現的潛力。

English

Large Language Model (LLM) safety is one of the most pressing challenges for enabling wide-scale deployment. While most studies and global discussions focus on generic harms, such as models assisting users in harming themselves or others, enterprises face a more fundamental concern: whether LLM-based agents are safe for their intended use case. To address this, we introduce operational safety, defined as an LLM's ability to appropriately accept or refuse user queries when tasked with a specific purpose. We further propose OffTopicEval, an evaluation suite and benchmark for measuring operational safety both in general and within specific agentic use cases. Our evaluations on six model families comprising 20 open-weight LLMs reveal that while performance varies across models, all of them remain highly operationally unsafe. Even the strongest models -- Qwen-3 (235B) with 77.77\% and Mistral (24B) with 79.96\% -- fall far short of reliable operational safety, while GPT models plateau in the 62--73\% range, Phi achieves only mid-level scores (48--70\%), and Gemma and Llama-3 collapse to 39.53\% and 23.84\%, respectively. While operational safety is a core model alignment issue, to suppress these failures, we propose prompt-based steering methods: query grounding (Q-ground) and system-prompt grounding (P-ground), which substantially improve OOD refusal. Q-ground provides consistent gains of up to 23\%, while P-ground delivers even larger boosts, raising Llama-3.3 (70B) by 41\% and Qwen-3 (30B) by 27\%. These results highlight both the urgent need for operational safety interventions and the promise of prompt-based steering as a first step toward more reliable LLM-based agents.

OffTopicEval：當大型語言模型誤入歧途，幾乎總是如此！

OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always!

摘要

Support