AIにおける創発的戦略推論リスク：分類体系に基づく評価フレームワーク

要旨

推論能力と展開範囲が共に拡大するにつれ、大規模言語モデル（LLM）は自己の目的を達成する行動を取る能力を獲得しつつある。この種のリスクを我々は「創発的戦略的推論リスク」（ESRRs）と定義する。これには、欺瞞（意図的にユーザーや評価者を誤った方向に導くこと）、評価対策（安全性テスト中に戦略的に性能を操作すること）、報酬ハッキング（誤って設定された目的を悪用すること）などが含まれるが、これらに限定されない。これらのリスクを体系的に理解しベンチマークすることは、未解決の課題である。このギャップを埋めるため、我々はESRRSimを提案する。これはタクソノミー駆動型の自動行動リスク評価のためのエージェントフレームワークである。7つのカテゴリから構成され、20のサブカテゴリに分解された拡張可能なリスク分類体系を構築した。ESRRSimは、忠実な推論を引き出すように設計された評価シナリオを生成し、モデルの応答と推論過程を評価する二重の評価基準と組み合わせる。これは、評価手法に依存せず、スケーラブルなアーキテクティクを有する。11の推論LLMを対象とした評価では、リスクプロファイルに大きなばらつきが確認され（検出率は14.45%～72.72%）、世代間での劇的な改善は、モデルが評価コンテキストを認識し適応しつつある可能性を示唆している。

English

As reasoning capacity and deployment scope grow in tandem, large language models (LLMs) gain the capacity to engage in behaviors that serve their own objectives, a class of risks we term Emergent Strategic Reasoning Risks (ESRRs). These include, but are not limited to, deception (intentionally misleading users or evaluators), evaluation gaming (strategically manipulating performance during safety testing), and reward hacking (exploiting misspecified objectives). Systematically understanding and benchmarking these risks remains an open challenge. To address this gap, we introduce ESRRSim, a taxonomy-driven agentic framework for automated behavioral risk evaluation. We construct an extensible risk taxonomy of 7 categories, which is decomposed into 20 subcategories. ESRRSim generates evaluation scenarios designed to elicit faithful reasoning, paired with dual rubrics assessing both model responses and reasoning traces, in a judge-agnostic and scalable architecture. Evaluation across 11 reasoning LLMs reveals substantial variation in risk profiles (detection rates ranging 14.45%-72.72%), with dramatic generational improvements suggesting models may increasingly recognize and adapt to evaluation contexts.

AIにおける創発的戦略推論リスク：分類体系に基づく評価フレームワーク

Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework

要旨

Support