다중 도메인 RL에서의 교차 도메인 간섭 및 복구를 위한 국소적 섭동 이론

초록

강화학습(RL) 사후 학습은 수학적 추론, 코드 생성, 질의응답, 창작 글쓰기(CW) 등 개별 도메인에서 대규모 언어 모델(LLM)의 성능을 향상시키지만, 한 도메인에서의 학습은 종종 다른 도메인의 성능을 저하시킨다. 파괴적 망각 또는 전역 기울기 충돌에 기반한 기존 설명은 완전하지 않다. 전체 모델 기울기가 거의 직교하는 경우에도 상당한 간섭이 발생할 수 있기 때문이다. 본 연구는 단일 도메인 RL이 상위 변화 뉴런 간 중복이 약한 희소하고 작은 크기의 매개변수 편집을 생성하는 반면, 서로 다른 도메인은 여전히 상당한 활성 계산 경로를 공유하며, 이 경로에서 업데이트 방향이 상승적 또는 충돌적 작용을 결정함을 보여준다. 이러한 관찰에 기반하여, 다중 도메인 RL의 국소 섭동 모델 하에서 후속 도메인 학습이 주로 2차 손상 항을 통해 이전 도메인에 해를 끼친다는 것을 증명하며, 이 손상 항은 관찰된 희소 경로 구조 하에서 저차원 공유 충돌 부분공간에 집중된다. 또한, 짧은 도메인 갱신은 이 부분공간에서 유해한 구성 요소를 수축시켜 제한된 부수적 손상으로 선택적 회복을 가능하게 한다. 이론과 일관되게, Code → Math → QA → CW 후의 짧은 Re-Math 갱신은 Math를 57.66에서 66.04로 회복시키면서 다른 도메인의 성능을 대부분 유지하여 최고 평균 점수 66.39를 달성한다. 갱신 외에도, Math-QA 쌍에 대한 희소 대리 충돌 좌표 집합에서의 훈련 없는 롤백이 Math를 부분적으로 회복시켜, 국소화된 손상에 대한 직접적인 대리 수준 증거를 제공한다. 이러한 결과는 다중 도메인 RL에서 간섭과 회복의 국소화된 기계론적 설명을 제시한다.

English

Reinforcement learning (RL) post-training improves large language models (LLMs) on individual domains such as mathematical reasoning, code generation, question answering, and creative writing (CW), but training on one domain often degrades performance on others. Existing explanations based on catastrophic forgetting or global gradient conflict are incomplete: substantial interference can occur even when full-model gradients are nearly orthogonal. We show that single-domain RL produces sparse, small-magnitude parameter edits with weak overlap among top-changed neurons, while different domains still share substantial active computation routes on which update directions determine whether they act synergistically or conflict. Guided by this observation, we prove under a local perturbation model of multi-domain RL that later-domain training harms an earlier domain mainly through a second-order damage term, which under the observed sparse route structure concentrates in a low-dimensional shared conflict subspace. Moreover, a short domain refresh contracts the harmful component on this subspace, enabling selective recovery with limited collateral damage. Consistent with the theory, a brief Re-Math refresh after Code rightarrow Math rightarrow QA rightarrow CW recovers Math from 57.66 to 66.04 while largely preserving performance on the other domains, yielding the best average score of 66.39. Beyond refresh, a training-free rollback on a sparse proxy conflict coordinate set for the Math-QA pair partially restores Math, providing direct proxy-level evidence for localized damage. These results provide a localized mechanistic account of interference and recovery in multi-domain RL.