OffTopicEval: 대형 언어 모델이 잘못된 채팅에 참여할 때, 거의 항상 그렇듯이!

초록

대규모 언어 모델(LLM) 안전성은 광범위한 배포를 가능하게 하는 데 있어 가장 시급한 과제 중 하나입니다. 대부분의 연구와 글로벌 논의는 모델이 사용자가 자신이나 타인을 해치는 데 도움을 주는 것과 같은 일반적인 위험에 초점을 맞추고 있지만, 기업들은 보다 근본적인 문제에 직면해 있습니다: LLM 기반 에이전트가 의도된 사용 사례에 대해 안전한지 여부입니다. 이를 해결하기 위해, 우리는 운영 안전성(operational safety)을 도입합니다. 이는 LLM이 특정 목적을 위해 사용자 쿼리를 적절히 수락하거나 거부할 수 있는 능력으로 정의됩니다. 또한, 우리는 일반적인 상황과 특정 에이전트 사용 사례 내에서 운영 안전성을 측정하기 위한 평가 도구 및 벤치마크인 OffTopicEval을 제안합니다. 20개의 오픈 웨이트 LLM으로 구성된 6개 모델 패밀리에 대한 평가 결과, 모델 간 성능 차이는 있지만 모든 모델이 여전히 높은 수준의 운영 안전성 부족을 보였습니다. 가장 강력한 모델인 Qwen-3 (235B)과 Mistral (24B)도 각각 77.77%와 79.96%로 신뢰할 만한 운영 안전성에 훨씬 미치지 못했으며, GPT 모델은 62~73% 범위에서 정체되었고, Phi는 중간 수준의 점수(48~70%)를 기록했으며, Gemma와 Llama-3는 각각 39.53%와 23.84%로 크게 하락했습니다. 운영 안전성은 모델 정렬의 핵심 문제이지만, 이러한 실패를 억제하기 위해 우리는 프롬프트 기반 조정 방법인 쿼리 그라운딩(Q-ground)과 시스템 프롬프트 그라운딩(P-ground)을 제안합니다. 이 방법들은 OOD 거부를 크게 개선했습니다. Q-ground는 최대 23%의 일관된 성능 향상을 제공했으며, P-ground는 더 큰 향상을 이끌어 Llama-3.3 (70B)을 41%, Qwen-3 (30B)을 27% 향상시켰습니다. 이러한 결과는 운영 안전성 개입의 시급한 필요성과 더 신뢰할 수 있는 LLM 기반 에이전트를 위한 첫 단계로서 프롬프트 기반 조정의 가능성을 강조합니다.

English

Large Language Model (LLM) safety is one of the most pressing challenges for enabling wide-scale deployment. While most studies and global discussions focus on generic harms, such as models assisting users in harming themselves or others, enterprises face a more fundamental concern: whether LLM-based agents are safe for their intended use case. To address this, we introduce operational safety, defined as an LLM's ability to appropriately accept or refuse user queries when tasked with a specific purpose. We further propose OffTopicEval, an evaluation suite and benchmark for measuring operational safety both in general and within specific agentic use cases. Our evaluations on six model families comprising 20 open-weight LLMs reveal that while performance varies across models, all of them remain highly operationally unsafe. Even the strongest models -- Qwen-3 (235B) with 77.77\% and Mistral (24B) with 79.96\% -- fall far short of reliable operational safety, while GPT models plateau in the 62--73\% range, Phi achieves only mid-level scores (48--70\%), and Gemma and Llama-3 collapse to 39.53\% and 23.84\%, respectively. While operational safety is a core model alignment issue, to suppress these failures, we propose prompt-based steering methods: query grounding (Q-ground) and system-prompt grounding (P-ground), which substantially improve OOD refusal. Q-ground provides consistent gains of up to 23\%, while P-ground delivers even larger boosts, raising Llama-3.3 (70B) by 41\% and Qwen-3 (30B) by 27\%. These results highlight both the urgent need for operational safety interventions and the promise of prompt-based steering as a first step toward more reliable LLM-based agents.

OffTopicEval: 대형 언어 모델이 잘못된 채팅에 참여할 때, 거의 항상 그렇듯이!

OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always!

초록

Support