深度学习的哈密顿-雅可比理论

摘要

本文准确地将神经网络的训练识别为哈密顿-雅可比初值问题上的搜索：每个梯度步选择粘性哈密顿-雅可比方程的初始数据，其霍普夫-科尔传播子最拟合观测值；在推理时，输入是该解被评估的空间点，且初始条件已编码于权重中。该对应关系对对数-求和-指数层是精确的，对更广泛架构（残差网络、变换器、循环架构如RNN、LSTM、SSM）则是结构性的——它们离散化同一类哈密顿-雅可比方程，仅哈密顿量和粘性因架构而异。单个形变参数ε将网络、热带代数、粘性偏微分方程、凸优化四种视角统一于一个满足Lipschitz条件的交换图中。定量结果包括：固定t时的极小化最优泛化速率O(n^{-1/(d+2)})；由ε控制的对抗鲁棒性；残差网络中反向传播等同于哈密顿系统的协态方程（庞特里亚金最大值原理）；通过偏微分方程求积得到与数据本征维度一致的标度指数；以及闭式O(N)影响函数（softmax归因权重π_j），其熵景观随ε增加经历折叠分岔，每次合并归因盆地。

English

In this paper, training a neural network is identified, exactly, as a search through Hamilton--Jacobi initial-value problems: each gradient step selects the initial data of a viscous Hamilton--Jacobi equation whose Hopf--Cole propagator best fits the observations; at inference, the input is the spatial point at which that solution is evaluated and the initial condition is already encoded in the weights. The correspondence is exact for log-sum-exp layers and structural for broader architectures: residual networks, transformers, and recurrent architectures (RNNs, LSTMs, SSMs) each discretize the same class of Hamilton--Jacobi equations, with architecture-dependent Hamiltonian and viscosity. A single deformation parameter varepsilon unifies all four perspectives (network, tropical algebra, viscous PDE, convex optimization) in a commutative diagram closed under Lipschitz conditions. Quantitative consequences include: the minimax optimal generalization rate O(n^{-1/(d+2)}) for fixed t; adversarial robustness controlled by varepsilon; backpropagation as the co-state equation of the Hamiltonian system for residual networks (Pontryagin Maximum Principle); scaling exponents consistent with data intrinsic dimension via PDE quadrature; and a closed-form O(N) influence function (softmax attribution weights π_j) whose entropy landscape undergoes fold bifurcations as varepsilon increases, each merging attribution basins.