多域強化學習中跨域干擾與恢復的局部擾動理論

摘要

強化學習（RL）後訓練能提升大型語言模型（LLMs）在個別領域的表現，例如數學推理、程式碼生成、問答及創意寫作（CW），但針對單一領域的訓練往往會導致其他領域的效能下降。現有的解釋基於災難性遺忘或全局梯度衝突，但這些解釋並不完整：即使在全模型梯度近乎正交的情況下，仍可能發生顯著的干擾。我們發現，單領域強化學習會產生稀疏且幅度微小的參數調整，且受影響最顯著的神經元之間重疊程度薄弱，然而不同領域仍共享大量活躍的計算路徑，而這些路徑上的更新方向決定了它們是產生協同效應還是相互衝突。根據此觀察，我們在多領域強化學習的局部擾動模型下證明了：後續領域的訓練主要透過二階損傷項對先前的領域造成損害，而在我們觀察到的稀疏路徑結構下，此損傷項集中於低維度的共享衝突子空間中。此外，簡短的領域刷新能收縮該子空間上的有害成分，從而在有限附帶損害下實現選擇性的恢復。與理論一致，在程式碼→數學→問答→創意寫作的序列訓練後，進行簡短的重新學習數學（Re-Math）刷新，能將數學效能從57.66提升至66.04，同時大致維持其他領域的表現，最終獲得最佳平均分數66.39。除了刷新之外，針對數學-問答這組任務，在稀疏代理衝突座標集上進行無需額外訓練的回滾操作，也能部分恢復數學效能，為局部損傷提供了直接的代理層級證據。這些結果為多領域強化學習中的干擾與恢復提供了局部的機制性解釋。

English

Reinforcement learning (RL) post-training improves large language models (LLMs) on individual domains such as mathematical reasoning, code generation, question answering, and creative writing (CW), but training on one domain often degrades performance on others. Existing explanations based on catastrophic forgetting or global gradient conflict are incomplete: substantial interference can occur even when full-model gradients are nearly orthogonal. We show that single-domain RL produces sparse, small-magnitude parameter edits with weak overlap among top-changed neurons, while different domains still share substantial active computation routes on which update directions determine whether they act synergistically or conflict. Guided by this observation, we prove under a local perturbation model of multi-domain RL that later-domain training harms an earlier domain mainly through a second-order damage term, which under the observed sparse route structure concentrates in a low-dimensional shared conflict subspace. Moreover, a short domain refresh contracts the harmful component on this subspace, enabling selective recovery with limited collateral damage. Consistent with the theory, a brief Re-Math refresh after Code rightarrow Math rightarrow QA rightarrow CW recovers Math from 57.66 to 66.04 while largely preserving performance on the other domains, yielding the best average score of 66.39. Beyond refresh, a training-free rollback on a sparse proxy conflict coordinate set for the Math-QA pair partially restores Math, providing direct proxy-level evidence for localized damage. These results provide a localized mechanistic account of interference and recovery in multi-domain RL.