热力学自然梯度下降

摘要

二阶训练方法比梯度下降具有更好的收敛性能，但由于计算开销大，在大规模训练中很少被实际使用。这可以被视为一种硬件限制（由数字计算机所施加）。在这里，我们展示了自然梯度下降（NGD），一种二阶方法，在使用适当的硬件时，每次迭代的计算复杂度可以与一阶方法相似。我们提出了一种新的混合数字-模拟算法，用于训练神经网络，在某些参数范围内等效于NGD，但避免了代价高昂的线性系统求解。我们的算法利用了模拟系统在平衡状态下的热力学特性，因此需要模拟热力学计算机。训练发生在混合数字-模拟循环中，在此过程中，在给定时间间隔内计算梯度和Fisher信息矩阵（或任何其他半正定曲率矩阵），同时模拟动态发生。我们通过数值方法展示了这种方法在分类任务和语言模型微调任务上优于最先进的数字一阶和二阶训练方法的优越性。

English

Second-order training methods have better convergence properties than gradient descent but are rarely used in practice for large-scale training due to their computational overhead. This can be viewed as a hardware limitation (imposed by digital computers). Here we show that natural gradient descent (NGD), a second-order method, can have a similar computational complexity per iteration to a first-order method, when employing appropriate hardware. We present a new hybrid digital-analog algorithm for training neural networks that is equivalent to NGD in a certain parameter regime but avoids prohibitively costly linear system solves. Our algorithm exploits the thermodynamic properties of an analog system at equilibrium, and hence requires an analog thermodynamic computer. The training occurs in a hybrid digital-analog loop, where the gradient and Fisher information matrix (or any other positive semi-definite curvature matrix) are calculated at given time intervals while the analog dynamics take place. We numerically demonstrate the superiority of this approach over state-of-the-art digital first- and second-order training methods on classification tasks and language model fine-tuning tasks.

热力学自然梯度下降

Thermodynamic Natural Gradient Descent

摘要

Support