深層学習のハミルトン-ヤコビ理論

要旨

本論文では，ニューラルネットワークの訓練が厳密にハミルトン–ヤコビ初期値問題の探索として同定される。すなわち，各勾配ステップは粘性ハミルトン–ヤコビ方程式の初期データを選択し，そのホップ–コール伝播子が観測に最も適合するようにする。推論時には，入力は解が評価される空間点であり，初期条件は既に重みに符号化されている。この対応は対数和指数（log-sum-exp）層に対して厳密であり，より広範なアーキテクチャ（残差ネットワーク，トランスフォーマー，リカレントアーキテクチャ（RNN，LSTM，SSM））に対しては構造的な対応となる。これらはいずれも同じクラスのハミルトン–ヤコビ方程式を離散化しており，ハミルトニアンと粘性はアーキテクチャに依存する。単一の変形パラメータεが，ネットワーク，トロピカル代数，粘性偏微分方程式，凸最適化という四つの視点すべてを，リプシッツ条件の下で閉じた可換図式として統合する。定量的な結果として以下が得られる：固定されたtに対するミニマックス最適汎化率O(n^{-1/(d+2)})，εによって制御される敵対的ロバスト性，残差ネットワークに対するハミルトン系の共状態方程式としての誤差逆伝播（ポントリャーギンの最大原理），PDE求積を介したデータ内在次元と整合するスケーリング指数，そして閉形式O(N)の影響関数（ソフトマックス帰属重みπ_j）が得られ，そのエントロピーランドスケープはεの増加に伴って折れ曲がり分岐を起こし，各帰属流域が融合する。

English

In this paper, training a neural network is identified, exactly, as a search through Hamilton--Jacobi initial-value problems: each gradient step selects the initial data of a viscous Hamilton--Jacobi equation whose Hopf--Cole propagator best fits the observations; at inference, the input is the spatial point at which that solution is evaluated and the initial condition is already encoded in the weights. The correspondence is exact for log-sum-exp layers and structural for broader architectures: residual networks, transformers, and recurrent architectures (RNNs, LSTMs, SSMs) each discretize the same class of Hamilton--Jacobi equations, with architecture-dependent Hamiltonian and viscosity. A single deformation parameter varepsilon unifies all four perspectives (network, tropical algebra, viscous PDE, convex optimization) in a commutative diagram closed under Lipschitz conditions. Quantitative consequences include: the minimax optimal generalization rate O(n^{-1/(d+2)}) for fixed t; adversarial robustness controlled by varepsilon; backpropagation as the co-state equation of the Hamiltonian system for residual networks (Pontryagin Maximum Principle); scaling exponents consistent with data intrinsic dimension via PDE quadrature; and a closed-form O(N) influence function (softmax attribution weights π_j) whose entropy landscape undergoes fold bifurcations as varepsilon increases, each merging attribution basins.