강화 미세 조정을 통한 시각적 지속 학습에서의 파괴적 망각 극복

초록

최근 연구들은 강화 미세 조정(RFT)이 지도 미세 조정(SFT)보다 파국적 망각에 본질적으로 더 강하다고 제안한다. 그러나 RFT(예: GRPO)가 클래스 증가 학습(CIL) 및 도메인 증가 학습(DIL)과 같은 까다로운 시각적 지속 학습 환경에서 망각을 효과적으로 극복할 수 있는지는 여전히 미해결 문제로 남아 있다. 파일럿 연구를 통해 우리는 RFT가 SFT를 일관되게 능가하지만 여전히 무시할 수 없는 망각을 겪는다는 것을 확인했다. 우리는 이 병목 현상을 궤적 수준 드리프트 무관심(Trajectory-level Drift Agnosticism)으로 경험적으로 추적한다: 동일한 작업 보상을 달성하는 후보 롤아웃들 사이에서 이전 작업 정책과의 KL 발산이 상당히 달라지며, 이는 순차적 작업 간 파국적 망각과 강한 상관관계를 보인다. 이 통찰에 기반하여, 우리는 궤적 수준 보상 형성을 통해 망각을 명시적으로 완화하는 간단하면서도 효과적인 RFT 방법인 유지 인식 정책 최적화(RaPO)를 제안한다. 구체적으로, RaPO는 두 가지 핵심 구성 요소로 구성된다: (1) 궤적 수준 분포 드리프트를 연속적인 보상 신호로 변환하여 각 그룹 내에서 지식 보존 롤아웃을 우선적으로 강화하는 유지 보상(Retention Reward); (2) 작업 경계를 넘어 보상 통계의 지속적인 지수 이동 평균을 유지하여 지속 학습 중 최적화 진행을 안정화하는 교차 작업 이점 정규화(CTAN). MLLM의 자유 형식 텍스트 일반화를 활용하여, 우리는 다섯 가지 시각적 지속 학습 환경에서 RaPO를 포괄적으로 평가한다. 광범위한 실험을 통해 RaPO가 선도적인 성능을 달성하며, 강한 가소성을 유지하면서 파국적 망각을 상당히 감소시킴을 입증한다. 우리가 아는 한, 이 연구는 시각적 지속 학습에서 RFT의 첫 번째 체계적인 탐구를 대표하며, 향후 연구에 영감을 주기를 바라는 통찰을 제공한다.

English

Recent studies suggest that Reinforcement Fine-Tuning (RFT) is inherently more resilient to catastrophic forgetting than Supervised Fine-Tuning (SFT). However, whether RFT (e.g., GRPO) can effectively overcome forgetting in challenging visual continual learning settings, such as class-incremental learning (CIL) and domain-incremental learning (DIL), remains an open problem. Through a pilot study, we confirm that while RFT consistently outperforms SFT, it still suffers from non-negligible forgetting. We empirically trace this bottleneck to Trajectory-level Drift Agnosticism: among candidate rollouts achieving identical task rewards, the KL divergence from the preceding-task policy varies substantially, which strongly correlates with catastrophic forgetting across sequential tasks. Motivated by this insight, we propose Retention-aware Policy Optimization (RaPO), a simple yet effective RFT method that explicitly mitigates forgetting through trajectory-level reward shaping. Specifically, RaPO comprises two core components: (1) Retention Reward that converts trajectory-level distribution drift into a continuous reward signal, preferentially reinforcing knowledge-preserving rollouts within each group; (2) Cross-Task Advantage Normalization (CTAN), which maintains a persistent exponential moving average of reward statistics across task boundaries to stabilize the optimization progress during continual learning. Leveraging the free-form textual generalization of MLLMs, we comprehensively evaluate RaPO across five visual continual learning settings. Extensive experiments demonstrate that RaPO achieves leading performance, substantially reducing catastrophic forgetting while preserving strong plasticity. To the best of our knowledge, this work represents the first systematic exploration of RFT in visual continual learning, offering insights that we hope will inspire future research.