熱力學自然梯度下降
Thermodynamic Natural Gradient Descent
May 22, 2024
作者: Kaelan Donatella, Samuel Duffield, Maxwell Aifer, Denis Melanson, Gavin Crooks, Patrick J. Coles
cs.AI
摘要
二階訓練方法比梯度下降具有更好的收斂特性,但由於其計算開銷,在大規模訓練中很少被實際使用。這可以被視為硬體限制(由數位電腦所施加)。在這裡,我們展示自然梯度下降(NGD),一種二階方法,當使用適當的硬體時,每次迭代的計算複雜度可以與一階方法相似。我們提出了一種新的混合數位-類比算法,用於訓練神經網絡,在某些參數範圍內等效於NGD,但避免了代價高昂的線性系統求解。我們的算法利用了類比系統在平衡時的熱力學特性,因此需要一個類比熱力學計算機。訓練發生在混合數位-類比迴圈中,梯度和Fisher信息矩陣(或任何其他半正定曲率矩陣)在特定時間間隔內被計算,同時類比動態發生。我們在分類任務和語言模型微調任務上,數值上展示了這種方法優於最先進的數位一階和二階訓練方法的優越性。
English
Second-order training methods have better convergence properties than
gradient descent but are rarely used in practice for large-scale training due
to their computational overhead. This can be viewed as a hardware limitation
(imposed by digital computers). Here we show that natural gradient descent
(NGD), a second-order method, can have a similar computational complexity per
iteration to a first-order method, when employing appropriate hardware. We
present a new hybrid digital-analog algorithm for training neural networks that
is equivalent to NGD in a certain parameter regime but avoids prohibitively
costly linear system solves. Our algorithm exploits the thermodynamic
properties of an analog system at equilibrium, and hence requires an analog
thermodynamic computer. The training occurs in a hybrid digital-analog loop,
where the gradient and Fisher information matrix (or any other positive
semi-definite curvature matrix) are calculated at given time intervals while
the analog dynamics take place. We numerically demonstrate the superiority of
this approach over state-of-the-art digital first- and second-order training
methods on classification tasks and language model fine-tuning tasks.Summary
AI-Generated Summary