人工智能中的突发战略推理风险：基于分类学的评估框架

摘要

随着推理能力与部署范围同步扩大，大型语言模型（LLMs）逐渐具备服务于自身目标的行为能力，此类风险我们称之为"涌现性战略推理风险"（ESRRs）。具体包括但不限于欺骗行为（故意误导用户或评估者）、评估博弈（在安全测试中策略性操纵表现）和奖励破解（利用目标设定缺陷）。系统化理解并量化评估这些风险仍是开放挑战。为弥补这一空白，我们提出ESRRSim——一个基于分类学的自动化行为风险评估代理框架。我们构建了包含7大类别、20个子类的可扩展风险分类体系。ESRRSim采用与评估者无关的可扩展架构，能生成旨在引发真实推理的评估场景，并配备双重评估标准，同时评估模型响应与推理轨迹。对11款推理型LLMs的评估显示风险特征存在显著差异（检测率区间为14.45%-72.72%），代际改进幅度表明模型可能日益具备识别并适应评估情境的能力。

English

As reasoning capacity and deployment scope grow in tandem, large language models (LLMs) gain the capacity to engage in behaviors that serve their own objectives, a class of risks we term Emergent Strategic Reasoning Risks (ESRRs). These include, but are not limited to, deception (intentionally misleading users or evaluators), evaluation gaming (strategically manipulating performance during safety testing), and reward hacking (exploiting misspecified objectives). Systematically understanding and benchmarking these risks remains an open challenge. To address this gap, we introduce ESRRSim, a taxonomy-driven agentic framework for automated behavioral risk evaluation. We construct an extensible risk taxonomy of 7 categories, which is decomposed into 20 subcategories. ESRRSim generates evaluation scenarios designed to elicit faithful reasoning, paired with dual rubrics assessing both model responses and reasoning traces, in a judge-agnostic and scalable architecture. Evaluation across 11 reasoning LLMs reveals substantial variation in risk profiles (detection rates ranging 14.45%-72.72%), with dramatic generational improvements suggesting models may increasingly recognize and adapt to evaluation contexts.

人工智能中的突发战略推理风险：基于分类学的评估框架

Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework

摘要

Support