マルチドメインRLにおけるクロスドメイン干渉と回復のための局所摂動理論

要旨

強化学習（RL）による事後学習は、数学的推論、コード生成、質問応答、創造的作文（CW）などの個別ドメインにおいて大規模言語モデル（LLMs）を改善するが、あるドメインの学習は他のドメインの性能を低下させることが多い。破滅的忘却や全体的な勾配競合に基づく既存の説明は不完全である：全モデルの勾配がほぼ直交している場合でも、顕著な干渉が発生し得る。我々は、単一ドメインRLが、変化の大きいニューロン間で重複が弱い、疎で小さな大きさのパラメータ更新を生成する一方、異なるドメイン間では依然として実質的な活性計算経路を共有しており、その上での更新方向が、それらが相乗的に作用するか競合するかを決定することを示す。この観察に基づき、我々は多ドメインRLの局所摂動モデルの下で、後続ドメインの学習が主に二次損害項を通じて先行ドメインを損なうこと、そしてこの項が観察された疎な経路構造の下で低次元の共有競合部分空間に集中することを証明する。さらに、短いドメインリフレッシュはこの部分空間上の有害成分を収縮させ、限られた副次的損害で選択的な回復を可能にする。理論と一致して、Code→Math→QA→CWの後の短いRe-Mathリフレッシュは、Mathを57.66から66.04に回復させ、他のドメインの性能を概ね維持し、最高平均スコア66.39をもたらす。リフレッシュに加えて、Math-QAペアに対する疎な代理競合座標集合での学習不要のロールバックがMathを部分的に回復させ、局所的な損害に対する直接的な代理レベルの証拠を提供する。これらの結果は、多ドメインRLにおける干渉と回復の局所メカニズムの説明を提供する。

English

Reinforcement learning (RL) post-training improves large language models (LLMs) on individual domains such as mathematical reasoning, code generation, question answering, and creative writing (CW), but training on one domain often degrades performance on others. Existing explanations based on catastrophic forgetting or global gradient conflict are incomplete: substantial interference can occur even when full-model gradients are nearly orthogonal. We show that single-domain RL produces sparse, small-magnitude parameter edits with weak overlap among top-changed neurons, while different domains still share substantial active computation routes on which update directions determine whether they act synergistically or conflict. Guided by this observation, we prove under a local perturbation model of multi-domain RL that later-domain training harms an earlier domain mainly through a second-order damage term, which under the observed sparse route structure concentrates in a low-dimensional shared conflict subspace. Moreover, a short domain refresh contracts the harmful component on this subspace, enabling selective recovery with limited collateral damage. Consistent with the theory, a brief Re-Math refresh after Code rightarrow Math rightarrow QA rightarrow CW recovers Math from 57.66 to 66.04 while largely preserving performance on the other domains, yielding the best average score of 66.39. Beyond refresh, a training-free rollback on a sparse proxy conflict coordinate set for the Math-QA pair partially restores Math, providing direct proxy-level evidence for localized damage. These results provide a localized mechanistic account of interference and recovery in multi-domain RL.