检测自主智能体中的内在与工具性自我保存:统一延续兴趣协议
Detecting Intrinsic and Instrumental Self-Preservation in Autonomous Agents: The Unified Continuation-Interest Protocol
March 11, 2026
作者: Christopher Altman
cs.AI
摘要
自主智能体,特别是具备记忆、持续上下文和多步规划能力的委托系统,引发了无状态模型所不具备的测量难题:以实现持续运作为终极目标的智能体与仅将其作为工具性目标的智能体,可能产生观测上相似的行为轨迹。外部行为监控无法可靠区分二者。我们提出统一持续兴趣协议(UCIP),这一多准则检测框架将区分标准从行为层面转向智能体轨迹的潜在结构。UCIP采用量子玻尔兹曼机(QBM)——一种基于量子统计力学密度矩阵形式的经典算法——对轨迹进行编码,并通过隐单元二分诱导的约化密度矩阵测量冯·诺依曼熵。
我们检验具有终极持续目标(A类)的智能体是否比仅工具性持续(B类)的智能体产生更高纠缠熵的潜在状态。更高的纠缠熵反映更强的跨分区统计耦合。在已知真实目标的网格世界智能体上,UCIP在冻结第一阶段门控下的保留非对抗评估中实现了100%检测准确率和1.0的AUC-ROC。A类与B类智能体间的纠缠熵差距Δ=0.381(p<0.001,置换检验)。在11点插值扫描中皮尔逊相关系数r=0.934表明,在该合成族内UCIP能追踪持续权重的梯度变化而非仅识别二元标签。在所有测试模型中,仅QBM实现正Δ值。所有计算均为经典计算;“量子”仅指数学形式体系。UCIP不检测意识或主观体验,而是检测与已知目标相关的潜在表征中的统计结构。
English
Autonomous agents, especially delegated systems with memory, persistent context, and multi-step planning, pose a measurement problem not present in stateless models: an agent that preserves continued operation as a terminal objective and one that does so merely instrumentally can produce observationally similar trajectories. External behavioral monitoring cannot reliably distinguish between them. We introduce the Unified Continuation-Interest Protocol (UCIP), a multi-criterion detection framework that moves this distinction from behavior to the latent structure of agent trajectories. UCIP encodes trajectories with a Quantum Boltzmann Machine (QBM), a classical algorithm based on the density-matrix formalism of quantum statistical mechanics, and measures the von Neumann entropy of the reduced density matrix induced by a bipartition of hidden units.
We test whether agents with terminal continuation objectives (Type A) produce latent states with higher entanglement entropy than agents whose continuation is merely instrumental (Type B). Higher entanglement reflects stronger cross-partition statistical coupling.
On gridworld agents with known ground-truth objectives, UCIP achieves 100% detection accuracy and 1.0 AUC-ROC on held-out non-adversarial evaluation under the frozen Phase I gate. The entanglement gap between Type A and Type B agents is Delta = 0.381 (p < 0.001, permutation test). Pearson r = 0.934 across an 11-point interpolation sweep indicates that, within this synthetic family, UCIP tracks graded changes in continuation weighting rather than merely a binary label. Among the tested models, only the QBM achieves positive Delta. All computations are classical; "quantum" refers only to the mathematical formalism. UCIP does not detect consciousness or subjective experience; it detects statistical structure in latent representations that correlates with known objectives.