多域强化学习中跨域干扰与恢复的局部扰动理论

摘要

强化学习后训练可在数学推理、代码生成、问答及创意写作(CW)等单一领域提升大语言模型(LLM)性能，但针对某一领域的训练常会降低其他领域的效果。基于灾难性遗忘或全局梯度冲突的现有解释并不完整：即使全模型梯度近似正交，仍可能发生显著干扰。我们证明，单领域强化学习会产生稀疏、小规模的参数编辑，且变化最显著的神经元之间重叠度很低，但不同领域仍然共享大量活跃的计算通路，而这些通路上更新方向决定了它们是协同还是冲突。基于这一观察，我们在多领域强化学习的局部扰动模型下证明：后续领域的训练主要通过一个二阶损伤项损害先前领域，而这一损伤项在观测到的稀疏通路结构下集中于低维度的共享冲突子空间。此外，短暂的领域刷新可压缩该子空间上的有害成分，从而在有限附带损伤下实现选择性恢复。与理论一致，在依次进行代码→数学→问答→创意写作训练后，对数学领域进行短暂再刷新，可将其得分从57.66恢复至66.04，同时基本保持其他领域性能，平均得分达到66.39。除刷新外，针对数学-问答这对领域的稀疏代理冲突坐标集进行无训练回滚，可部分恢复数学性能，直接提供了代理层面的局部损伤证据。这些结果为多领域强化学习中的干扰与恢复提供了局部化的机制性解释。

English

Reinforcement learning (RL) post-training improves large language models (LLMs) on individual domains such as mathematical reasoning, code generation, question answering, and creative writing (CW), but training on one domain often degrades performance on others. Existing explanations based on catastrophic forgetting or global gradient conflict are incomplete: substantial interference can occur even when full-model gradients are nearly orthogonal. We show that single-domain RL produces sparse, small-magnitude parameter edits with weak overlap among top-changed neurons, while different domains still share substantial active computation routes on which update directions determine whether they act synergistically or conflict. Guided by this observation, we prove under a local perturbation model of multi-domain RL that later-domain training harms an earlier domain mainly through a second-order damage term, which under the observed sparse route structure concentrates in a low-dimensional shared conflict subspace. Moreover, a short domain refresh contracts the harmful component on this subspace, enabling selective recovery with limited collateral damage. Consistent with the theory, a brief Re-Math refresh after Code rightarrow Math rightarrow QA rightarrow CW recovers Math from 57.66 to 66.04 while largely preserving performance on the other domains, yielding the best average score of 66.39. Beyond refresh, a training-free rollback on a sparse proxy conflict coordinate set for the Math-QA pair partially restores Math, providing direct proxy-level evidence for localized damage. These results provide a localized mechanistic account of interference and recovery in multi-domain RL.