딥러닝의 해밀턴-자코비 이론

초록

본 논문에서는 신경망 학습이 정확히 해밀턴-야코비 초기값 문제에 대한 탐색으로 식별된다: 각 그래디언트 단계는 호프-콜 전파자가 관측치에 가장 잘 적합하는 점성 해밀턴-야코비 방정식의 초기 데이터를 선택하며, 추론 시 입력은 해당 해가 평가되는 공간적 지점이고 초기 조건은 이미 가중치에 인코딩되어 있다. 이러한 대응 관계는 로그-섬-익스프 계층에 대해 정확하며, 더 넓은 아키텍처(잔차 네트워크, 트랜스포머, 순환 아키텍처(RNN, LSTM, SSM))에 대해서는 구조적 일치를 보인다: 각각은 동일한 종류의 해밀턴-야코비 방정식을 이산화하며, 아키텍처에 의존하는 해밀토니안과 점성을 갖는다. 단일 변형 매개변수 ε은 네 가지 관점(네트워크, 열대 대수, 점성 편미분방정식, 볼록 최적화)을 립시츠 조건 하에서 폐쇄된 가환 다이어그램으로 통합한다. 양적 결과로는: 고정된 t에 대한 미니맥스 최적 일반화 속도 O(n^{-1/(d+2)}), ε에 의해 제어되는 적대적 강건성, 잔차 네트워크에 대한 해밀턴 시스템의 공-상태 방정식으로서의 역전파(폰트랴긴 최대 원리), PDE 구적법을 통한 데이터 내재 차원과 일관된 스케일링 지수, 그리고 폐쇄형 O(N) 영향 함수(소프트맥스 귀속 가중치 π_j)가 있으며, 이 함수의 엔트로피 경관은 ε이 증가함에 따라 접기 분기점을 겪으며 각 귀속 분지를 병합한다.

English

In this paper, training a neural network is identified, exactly, as a search through Hamilton--Jacobi initial-value problems: each gradient step selects the initial data of a viscous Hamilton--Jacobi equation whose Hopf--Cole propagator best fits the observations; at inference, the input is the spatial point at which that solution is evaluated and the initial condition is already encoded in the weights. The correspondence is exact for log-sum-exp layers and structural for broader architectures: residual networks, transformers, and recurrent architectures (RNNs, LSTMs, SSMs) each discretize the same class of Hamilton--Jacobi equations, with architecture-dependent Hamiltonian and viscosity. A single deformation parameter varepsilon unifies all four perspectives (network, tropical algebra, viscous PDE, convex optimization) in a commutative diagram closed under Lipschitz conditions. Quantitative consequences include: the minimax optimal generalization rate O(n^{-1/(d+2)}) for fixed t; adversarial robustness controlled by varepsilon; backpropagation as the co-state equation of the Hamiltonian system for residual networks (Pontryagin Maximum Principle); scaling exponents consistent with data intrinsic dimension via PDE quadrature; and a closed-form O(N) influence function (softmax attribution weights π_j) whose entropy landscape undergoes fold bifurcations as varepsilon increases, each merging attribution basins.