De Hamilton-Jacobi-theorie van deep learning

Samenvatting

In dit artikel wordt training van een neuraal netwerk exact geïdentificeerd als een zoektocht door beginwaardeproblemen van Hamilton–Jacobi: elke gradiëntstap selecteert de beginvoorwaarden van een viskeuze Hamilton–Jacobi-vergelijking waarvan de Hopf–Cole-voortplanter het beste past bij de waarnemingen; bij inferentie is de invoer het ruimtelijke punt waarop die oplossing wordt geëvalueerd en de beginconditie is reeds gecodeerd in de gewichten. De correspondentie is exact voor log-som-exp-lagen en structureel voor bredere architecturen: residunetwerken, transformatoren en recurrente architecturen (RNN's, LSTM's, SSM's) discretiseren elk dezelfde klasse van Hamilton–Jacobi-vergelijkingen, met architectuurafhankelijke Hamiltoniaan en viscositeit. Een enkele vervormingsparameter ε verenigt alle vier perspectieven (netwerk, tropische algebra, viskeuze PDV, convexe optimalisatie) in een commutatief diagram dat gesloten is onder Lipschitz-condities. Kwantitatieve consequenties omvatten: de minimax optimale generalisatiesnelheid O(n^{-1/(d+2)}) voor vaste t; adversariële robuustheid gecontroleerd door ε; backpropagatie als de co-toestandsvergelijking van het Hamiltoniaanse systeem voor residunetwerken (Pontryagin-maximumprincipe); schalingsexponenten consistent met de intrinsieke dimensie van data via PDV-kwadratuur; en een gesloten-vorm O(N) invloedsfunctie (softmax-attributiegewichten π_j) waarvan het entropielandschap vouwvertakkingen ondergaat naarmate ε toeneemt, waarbij telkens attributiebekkens samensmelten.

English

In this paper, training a neural network is identified, exactly, as a search through Hamilton--Jacobi initial-value problems: each gradient step selects the initial data of a viscous Hamilton--Jacobi equation whose Hopf--Cole propagator best fits the observations; at inference, the input is the spatial point at which that solution is evaluated and the initial condition is already encoded in the weights. The correspondence is exact for log-sum-exp layers and structural for broader architectures: residual networks, transformers, and recurrent architectures (RNNs, LSTMs, SSMs) each discretize the same class of Hamilton--Jacobi equations, with architecture-dependent Hamiltonian and viscosity. A single deformation parameter varepsilon unifies all four perspectives (network, tropical algebra, viscous PDE, convex optimization) in a commutative diagram closed under Lipschitz conditions. Quantitative consequences include: the minimax optimal generalization rate O(n^{-1/(d+2)}) for fixed t; adversarial robustness controlled by varepsilon; backpropagation as the co-state equation of the Hamiltonian system for residual networks (Pontryagin Maximum Principle); scaling exponents consistent with data intrinsic dimension via PDE quadrature; and a closed-form O(N) influence function (softmax attribution weights π_j) whose entropy landscape undergoes fold bifurcations as varepsilon increases, each merging attribution basins.