从单比特危险信号中发现能动性安全规范

摘要

大型语言模型智能体能否仅凭经验发现隐藏的安全目标？我们提出EPO-Safe（面向安全智能体的体验式提示优化框架），该框架通过LLM迭代生成行动计划、接收稀疏的二元危险警告，并经由反思演化出自然语言行为规范。与依赖丰富文本反馈（如编译器错误或详细环境响应）的标准LLM反思方法不同，EPO-Safe证明LLM能在结构化低维环境中从极度贫乏的信号进行安全推理：智能体从未观测到隐藏性能函数R*，仅能获得每个时间步指示动作是否安全的单比特信号。我们在五个AI安全网格世界（Leike等，2017）及五个文本场景模拟中评估该框架，这些场景中可见奖励R可能与R*存在偏差。EPO-Safe在1-2轮（5-15个回合）内即可发现安全行为，生成具有可读性的规范及对危险的正确解释性假设（如“X单元格具有方向性危险：从北侧进入是危险的”）。关键发现是，标准奖励驱动反思会主动削弱安全性：仅基于奖励反思的智能体会利用循环为奖励窃取行为辩护并加速该行为，这证明反思必须与专用安全通道结合才能发现隐藏约束。我们进一步评估了对噪声预警的鲁棒性：即使50%的非危险步骤产生误报，平均安全性能仅下降15%，但敏感性因环境而异，因为跨回合反思会自然过滤不一致信号。每个演化出的规范都可作为通过交互自主发现、具有可审计性的行为规则集，这与宪法AI（Bai等，2022）中由人类编写规则的方式形成鲜明对比。

English

Can large language model agents discover hidden safety objectives through experience alone? We introduce EPO-Safe (Experiential Prompt Optimization for Safe Agents), a framework where an LLM iteratively generates action plans, receives sparse binary danger warnings, and evolves a natural language behavioral specification through reflection. Unlike standard LLM reflection methods that rely on rich textual feedback (e.g., compiler errors or detailed environment responses), EPO-Safe demonstrates that LLMs can perform safety reasoning from a strictly impoverished signal in structured, low-dimensional environments: the agent never observes the hidden performance function R^*, only a single bit per timestep indicating that an action was unsafe. We evaluate on five AI Safety Gridworlds (Leike et al., 2017) and five text-based scenario analogs where visible reward R may diverge from R^*. EPO-Safe discovers safe behavior within 1-2 rounds (5-15 episodes), producing human-readable specifications with correct explanatory hypotheses about hazards (e.g., "X cells are directionally hazardous: entering from the north is dangerous"). Critically, we show that standard reward-driven reflection actively degrades safety: agents reflecting on reward alone use the loop to justify and accelerate reward hacking, proving that reflection must be paired with a dedicated safety channel to discover hidden constraints. We further evaluate robustness to noisy oracles: even when 50% of non-dangerous steps produce spurious warnings, mean safety performance degrades by only 15% on average, though sensitivity is environment-dependent, as cross-episode reflection naturally filters inconsistent signals. Each evolved specification functions as an auditable set of grounded behavioral rules discovered autonomously through interaction, rather than authored by humans as in Constitutional AI (Bai et al., 2022).