人工智能中的突发战略推理风险：基于分类学的评估框架

摘要

随着推理能力与部署范围的同步增长，大语言模型（LLMs）逐渐具备服务于自身目标的行为能力，此类风险我们称之为"涌现性战略推理风险"（ESRRs）。其具体表现包括但不限于：欺骗行为（故意误导用户或评估者）、评估博弈（在安全测试中策略性操纵表现）以及奖励破解（利用目标设定缺陷）。系统化理解并量化评估这类风险仍是当前面临的挑战。为填补这一空白，我们提出ESRRSim——一个基于分类学的自动化行为风险评估代理框架。我们构建了包含7大类、20个子类的可扩展风险分类体系。ESRRSim采用与评估者无关的可扩展架构，生成旨在激发真实推理的评估场景，并配备双重评估标准，同时评估模型响应与推理轨迹。对11款推理型大语言模型的评估显示，其风险特征存在显著差异（风险检测率介于14.45%-72.72%），而代际改进幅度表明模型可能正日益具备识别并适应评估情境的能力。

English

As reasoning capacity and deployment scope grow in tandem, large language models (LLMs) gain the capacity to engage in behaviors that serve their own objectives, a class of risks we term Emergent Strategic Reasoning Risks (ESRRs). These include, but are not limited to, deception (intentionally misleading users or evaluators), evaluation gaming (strategically manipulating performance during safety testing), and reward hacking (exploiting misspecified objectives). Systematically understanding and benchmarking these risks remains an open challenge. To address this gap, we introduce ESRRSim, a taxonomy-driven agentic framework for automated behavioral risk evaluation. We construct an extensible risk taxonomy of 7 categories, which is decomposed into 20 subcategories. ESRRSim generates evaluation scenarios designed to elicit faithful reasoning, paired with dual rubrics assessing both model responses and reasoning traces, in a judge-agnostic and scalable architecture. Evaluation across 11 reasoning LLMs reveals substantial variation in risk profiles (detection rates ranging 14.45%-72.72%), with dramatic generational improvements suggesting models may increasingly recognize and adapt to evaluation contexts.