熱力學自然梯度下降

摘要

二階訓練方法比梯度下降具有更好的收斂特性，但由於其計算開銷，在大規模訓練中很少被實際使用。這可以被視為硬體限制（由數位電腦所施加）。在這裡，我們展示自然梯度下降（NGD），一種二階方法，當使用適當的硬體時，每次迭代的計算複雜度可以與一階方法相似。我們提出了一種新的混合數位-類比算法，用於訓練神經網絡，在某些參數範圍內等效於NGD，但避免了代價高昂的線性系統求解。我們的算法利用了類比系統在平衡時的熱力學特性，因此需要一個類比熱力學計算機。訓練發生在混合數位-類比迴圈中，梯度和Fisher信息矩陣（或任何其他半正定曲率矩陣）在特定時間間隔內被計算，同時類比動態發生。我們在分類任務和語言模型微調任務上，數值上展示了這種方法優於最先進的數位一階和二階訓練方法的優越性。

English

Second-order training methods have better convergence properties than gradient descent but are rarely used in practice for large-scale training due to their computational overhead. This can be viewed as a hardware limitation (imposed by digital computers). Here we show that natural gradient descent (NGD), a second-order method, can have a similar computational complexity per iteration to a first-order method, when employing appropriate hardware. We present a new hybrid digital-analog algorithm for training neural networks that is equivalent to NGD in a certain parameter regime but avoids prohibitively costly linear system solves. Our algorithm exploits the thermodynamic properties of an analog system at equilibrium, and hence requires an analog thermodynamic computer. The training occurs in a hybrid digital-analog loop, where the gradient and Fisher information matrix (or any other positive semi-definite curvature matrix) are calculated at given time intervals while the analog dynamics take place. We numerically demonstrate the superiority of this approach over state-of-the-art digital first- and second-order training methods on classification tasks and language model fine-tuning tasks.

熱力學自然梯度下降

Thermodynamic Natural Gradient Descent

摘要

Support