利用強化微調克服視覺持續學習中的災難性遺忘

摘要

近期研究表明，强化微调（RFT）相较于监督微调（SFT）在抵抗灾难性遗忘方面具有天然优势。然而，RFT方法（例如GRPO）能否在类增量学习（CIL）和域增量学习（DIL）等具有挑战性的视觉持续学习场景中有效克服遗忘，仍是一个开放性问题。通过初步研究，我们证实虽然RFT的性能始终优于SFT，但其仍存在不可忽视的遗忘问题。我们通过实证将这一瓶颈追溯至"轨迹级漂移不可知性"：在达到相同任务奖励的候选轨迹中，与前序任务策略的KL散度差异显著，且这种差异与跨序列任务的灾难性遗忘高度相关。基于此发现，我们提出"保留感知策略优化"（RaPO）——一种简洁高效的RFT方法，通过轨迹级奖励塑形显式缓解遗忘。具体而言，RaPO包含两个核心组件：（1）保留奖励，将轨迹级分布漂移转化为连续奖励信号，优先强化每组内保留知识的轨迹；（2）跨任务优势归一化（CTAN），在任务边界维持奖励统计量的指数移动平均，以稳定持续学习过程中的优化进程。借助多模态大语言模型（MLLM）的自由形式文本泛化能力，我们在五个视觉持续学习场景中对RaPO进行了全面评估。大量实验表明，RaPO实现了领先性能，在保持强可塑性的同时大幅减少了灾难性遗忘。据我们所知，本工作是视觉持续学习中RFT方法的首次系统探索，其研究启示有望为未来工作提供借鉴。

English

Recent studies suggest that Reinforcement Fine-Tuning (RFT) is inherently more resilient to catastrophic forgetting than Supervised Fine-Tuning (SFT). However, whether RFT (e.g., GRPO) can effectively overcome forgetting in challenging visual continual learning settings, such as class-incremental learning (CIL) and domain-incremental learning (DIL), remains an open problem. Through a pilot study, we confirm that while RFT consistently outperforms SFT, it still suffers from non-negligible forgetting. We empirically trace this bottleneck to Trajectory-level Drift Agnosticism: among candidate rollouts achieving identical task rewards, the KL divergence from the preceding-task policy varies substantially, which strongly correlates with catastrophic forgetting across sequential tasks. Motivated by this insight, we propose Retention-aware Policy Optimization (RaPO), a simple yet effective RFT method that explicitly mitigates forgetting through trajectory-level reward shaping. Specifically, RaPO comprises two core components: (1) Retention Reward that converts trajectory-level distribution drift into a continuous reward signal, preferentially reinforcing knowledge-preserving rollouts within each group; (2) Cross-Task Advantage Normalization (CTAN), which maintains a persistent exponential moving average of reward statistics across task boundaries to stabilize the optimization progress during continual learning. Leveraging the free-form textual generalization of MLLMs, we comprehensively evaluate RaPO across five visual continual learning settings. Extensive experiments demonstrate that RaPO achieves leading performance, substantially reducing catastrophic forgetting while preserving strong plasticity. To the best of our knowledge, this work represents the first systematic exploration of RFT in visual continual learning, offering insights that we hope will inspire future research.