Detectie van Intrinsiek en Instrumenteel Zelfbehoud bij Autonome Agenten: Het Uniforme Continuatie-Interesse Protocol

Samenvatting

Autonome agents, in het bijzonder gedelegeerde systemen met geheugen, aanhoudende context en meerstappenplanning, vormen een meetprobleem dat niet voorkomt bij stateless modellen: een agent die voortgezette werking als einddoel nastreeft en een agent die dit slechts instrumenteel doet, kunnen observationeel vergelijkbare trajecten produceren. Externe gedragsmonitoring kan ze niet betrouwbaar onderscheiden. Wij introduceren het Unified Continuation-Interest Protocol (UCIP), een multi-criterium detectiekader dat dit onderscheid verlegt van gedrag naar de latente structuur van agenttrajecten. UCIP codeert trajecten met een Quantum Boltzmann Machine (QBM), een klassiek algoritme gebaseerd op de dichtheidsmatrixformalismus van de kwantumstatistische mechanica, en meet de von Neumann-entropie van de gereduceerde dichtheidsmatrix die wordt geïnduceerd door een bipartitie van verborgen eenheden. Wij testen of agents met terminale continuatiedoelen (Type A) latente toestanden produceren met een hogere verstrengelingsentropie dan agents wier continuering slechts instrumenteel is (Type B). Hogere verstrengeling weerspiegelt sterkere statistische koppeling tussen de partities. Bij gridworld-agents met bekende grondwaarheid-doelen behaalt UCIP 100% detectienauwkeurigheid en een AUC-ROC van 1.0 bij niet-adversariële evaluatie op een uitgestelde testset onder de bevroren Phase I-gate. De verstrengelingskloof tussen Type A- en Type B-agents is Delta = 0.381 (p < 0.001, permutatietest). Een Pearson r = 0.934 over een 11-punts interpolatiescan geeft aan dat UCIP binnen deze synthetische familie graduele veranderingen in continueringsweging volgt, in plaats van slechts een binair label. Van de geteste modellen behaalt alleen de QBM een positieve Delta. Alle berekeningen zijn klassiek; "kwantum" verwijst enkel naar het wiskundig formalisme. UCIP detecteert geen bewustzijn of subjectieve ervaring; het detecteert statistische structuur in latente representaties die correleert met bekende doelstellingen.

English

Autonomous agents, especially delegated systems with memory, persistent context, and multi-step planning, pose a measurement problem not present in stateless models: an agent that preserves continued operation as a terminal objective and one that does so merely instrumentally can produce observationally similar trajectories. External behavioral monitoring cannot reliably distinguish between them. We introduce the Unified Continuation-Interest Protocol (UCIP), a multi-criterion detection framework that moves this distinction from behavior to the latent structure of agent trajectories. UCIP encodes trajectories with a Quantum Boltzmann Machine (QBM), a classical algorithm based on the density-matrix formalism of quantum statistical mechanics, and measures the von Neumann entropy of the reduced density matrix induced by a bipartition of hidden units. We test whether agents with terminal continuation objectives (Type A) produce latent states with higher entanglement entropy than agents whose continuation is merely instrumental (Type B). Higher entanglement reflects stronger cross-partition statistical coupling. On gridworld agents with known ground-truth objectives, UCIP achieves 100% detection accuracy and 1.0 AUC-ROC on held-out non-adversarial evaluation under the frozen Phase I gate. The entanglement gap between Type A and Type B agents is Delta = 0.381 (p < 0.001, permutation test). Pearson r = 0.934 across an 11-point interpolation sweep indicates that, within this synthetic family, UCIP tracks graded changes in continuation weighting rather than merely a binary label. Among the tested models, only the QBM achieves positive Delta. All computations are classical; "quantum" refers only to the mathematical formalism. UCIP does not detect consciousness or subjective experience; it detects statistical structure in latent representations that correlates with known objectives.

Detectie van Intrinsiek en Instrumenteel Zelfbehoud bij Autonome Agenten: Het Uniforme Continuatie-Interesse Protocol

Detecting Intrinsic and Instrumental Self-Preservation in Autonomous Agents: The Unified Continuation-Interest Protocol

Samenvatting

Support