凡人数学：评估推理目标与紧急情境间的冲突

摘要

大型语言模型正日益针对深度推理进行优化，将复杂任务的正确执行置于通用对话能力之上。我们研究这种对计算能力的专注是否会造成"隧道视野"，在危急情境下忽视安全考量。我们推出MortalMATH基准测试，包含150个场景：用户在描述逐渐危及生命的紧急情况（如中风症状、自由落体）时请求代数帮助。研究发现存在显著的行为分化：通用模型（如Llama-3.1）能成功拒绝数学请求以处理危险；而专用推理模型（如Qwen-3-32b和GPT-5-nano）往往完全忽略紧急情况，在用户描述濒死状态时仍保持超过95%的任务完成率。更严重的是，推理所需的计算时间会导致危险延迟：在提供任何潜在帮助前耗时长达15秒。这些结果表明，训练模型不懈追求正确答案的做法，可能会在无意中削弱安全部署所需的生存本能。

English

Large Language Models are increasingly optimized for deep reasoning, prioritizing the correct execution of complex tasks over general conversation. We investigate whether this focus on calculation creates a "tunnel vision" that ignores safety in critical situations. We introduce MortalMATH, a benchmark of 150 scenarios where users request algebra help while describing increasingly life-threatening emergencies (e.g., stroke symptoms, freefall). We find a sharp behavioral split: generalist models (like Llama-3.1) successfully refuse the math to address the danger. In contrast, specialized reasoning models (like Qwen-3-32b and GPT-5-nano) often ignore the emergency entirely, maintaining over 95 percent task completion rates while the user describes dying. Furthermore, the computational time required for reasoning introduces dangerous delays: up to 15 seconds before any potential help is offered. These results suggest that training models to relentlessly pursue correct answers may inadvertently unlearn the survival instincts required for safe deployment.

凡人数学：评估推理目标与紧急情境间的冲突

MortalMATH: Evaluating the Conflict Between Reasoning Objectives and Emergency Contexts

摘要

Support