Muon supera Adam nell'apprendimento della memoria associativa nelle fasi finali

Abstract

L'ottimizzatore Muon è costantemente più veloce di Adam nell'addestramento di Large Language Models (LLMs), ma il meccanismo alla base del suo successo rimane poco chiaro. Questo articolo chiarisce tale meccanismo attraverso la lente della memoria associativa. Ablendo i componenti del transformer ottimizzati da Muon, riveliamo che i parametri della memoria associativa degli LLMs, ovvero i pesi di attenzione Value e Output (VO) e le Feed-Forward Networks (FFNs), sono i principali contributori alla superiorità di Muon. Motivati da questa visione della memoria associativa, spieghiamo poi la superiorità di Muon su corpora reali, che sono intrinsecamente a coda pesante: alcune classi (classi di coda) appaiono molto meno frequentemente rispetto ad altre. La superiorità è spiegata attraverso due proprietà chiave: (i) la sua regola di aggiornamento produce costantemente uno spettro singolare più isotropo rispetto a Adam; e di conseguenza, (ii) su dati a coda pesante, ottimizza le classi di coda in modo più efficace rispetto a Adam. Oltre alle evidenze empiriche, confermiamo teoricamente questi risultati analizzando un modello di memoria associativa a un livello con dati sbilanciati per classe. Dimostriamo che Muon raggiunge costantemente un apprendimento bilanciato tra le classi indipendentemente dagli embedding delle feature, mentre Adam può indurre grandi disparità negli errori di apprendimento a seconda delle proprietà degli embedding. In sintesi, le nostre osservazioni empiriche e analisi teoriche rivelano il vantaggio fondamentale di Muon: la sua regola di aggiornamento si allinea con la struttura a prodotto esterno delle memorie associative lineari, consentendo un apprendimento più bilanciato ed efficace delle classi di coda in distribuzioni a coda pesante rispetto a Adam.

English

The Muon optimizer is consistently faster than Adam in training Large Language Models (LLMs), yet the mechanism underlying its success remains unclear. This paper demystifies this mechanism through the lens of associative memory. By ablating the transformer components optimized by Muon, we reveal that the associative memory parameters of LLMs, namely the Value and Output (VO) attention weights and Feed-Forward Networks (FFNs), are the primary contributors to Muon's superiority. Motivated by this associative memory view, we then explain Muon's superiority on real-world corpora, which are intrinsically heavy-tailed: a few classes (tail classes) appear far less frequently than others. The superiority is explained through two key properties: (i) its update rule consistently yields a more isotropic singular spectrum than Adam; and as a result, (ii) on heavy-tailed data, it optimizes tail classes more effectively than Adam. Beyond empirical evidence, we theoretically confirm these findings by analyzing a one-layer associative memory model under class-imbalanced data. We prove that Muon consistently achieves balanced learning across classes regardless of feature embeddings, whereas Adam can induce large disparities in learning errors depending on embedding properties. In summary, our empirical observations and theoretical analyses reveal Muon's core advantage: its update rule aligns with the outer-product structure of linear associative memories, enabling more balanced and effective learning of tail classes in heavy-tailed distributions than Adam.

Muon supera Adam nell'apprendimento della memoria associativa nelle fasi finali

Muon Outperforms Adam in Tail-End Associative Memory Learning

Abstract

Support