自我修改代理中的效用学习张力

摘要

隨著系統趨向超級智能，一個自然的建模前提是智能體能夠在自身設計的各個方面進行自我改進。我們通過五軸分解和決策層來形式化這一過程，將激勵與學習行為分離，並單獨分析各軸。我們的核心成果揭示並引入了一種尖銳的效用-學習張力，即在自我修改系統中，以提高即時或預期性能為導向的效用驅動變更，也可能侵蝕可靠學習與泛化的統計前提條件。研究發現，當且僅當策略可達模型族具有均勻的容量限制時，無分佈保證得以保留；若容量可無限增長，基於效用理性的自我變更可能使原本可學習的任務變得不可學習。在實踐中常見的標準假設下，這些軸簡化為相同的容量準則，從而為安全的自我修改劃定單一邊界。跨多軸的數值實驗通過比較破壞性效用策略與我們提出的保留可學習性的雙門策略，驗證了理論的正確性。

English

As systems trend toward superintelligence, a natural modeling premise is that agents can self-improve along every facet of their own design. We formalize this with a five-axis decomposition and a decision layer, separating incentives from learning behavior and analyzing axes in isolation. Our central result identifies and introduces a sharp utility--learning tension, the structural conflict in self-modifying systems whereby utility-driven changes that improve immediate or expected performance can also erode the statistical preconditions for reliable learning and generalization. Our findings show that distribution-free guarantees are preserved iff the policy-reachable model family is uniformly capacity-bounded; when capacity can grow without limit, utility-rational self-changes can render learnable tasks unlearnable. Under standard assumptions common in practice, these axes reduce to the same capacity criterion, yielding a single boundary for safe self-modification. Numerical experiments across several axes validate the theory by comparing destructive utility policies against our proposed two-gate policies that preserve learnability.

自我修改代理中的效用学习张力

Utility-Learning Tension in Self-Modifying Agents

摘要

Support