通过强化微调克服视觉持续学习中的灾难性遗忘

摘要

近期研究表明，强化微调（RFT）相较于监督微调（SFT）对灾难性遗忘具有更强的鲁棒性。然而，RFT（如GRPO）能否在具有挑战性的视觉持续学习场景（如类增量学习CIL和域增量学习DIL）中有效克服遗忘仍是一个开放性问题。通过初步研究，我们证实尽管RFT始终优于SFT，但其仍存在不可忽视的遗忘问题。我们通过实证追踪发现，这一瓶颈源于轨迹级漂移不可知性：在达到相同任务奖励的候选轨迹中，其与前一任务策略的KL散度存在显著差异，而这种差异与跨序列任务中的灾难性遗忘高度相关。基于这一发现，我们提出保留感知策略优化（RaPO），这是一种简单而有效的RFT方法，通过轨迹级奖励塑造显式缓解遗忘。具体而言，RaPO包含两个核心组件：（1）保留奖励——将轨迹级分布漂移转化为连续奖励信号，优先强化每组中保留知识的轨迹；（2）跨任务优势归一化（CTAN）——在任务边界间维护奖励统计量的持久指数移动平均，以稳定持续学习过程中的优化进程。利用多模态大语言模型（MLLM）的自由形式文本泛化能力，我们在五种视觉持续学习设置中全面评估了RaPO。大量实验表明，RaPO实现了领先性能，在保持强大可塑性的同时大幅减少灾难性遗忘。据我们所知，本工作首次系统探索了视觉持续学习中的RFT方法，其见解或将为未来研究提供启示。

English

Recent studies suggest that Reinforcement Fine-Tuning (RFT) is inherently more resilient to catastrophic forgetting than Supervised Fine-Tuning (SFT). However, whether RFT (e.g., GRPO) can effectively overcome forgetting in challenging visual continual learning settings, such as class-incremental learning (CIL) and domain-incremental learning (DIL), remains an open problem. Through a pilot study, we confirm that while RFT consistently outperforms SFT, it still suffers from non-negligible forgetting. We empirically trace this bottleneck to Trajectory-level Drift Agnosticism: among candidate rollouts achieving identical task rewards, the KL divergence from the preceding-task policy varies substantially, which strongly correlates with catastrophic forgetting across sequential tasks. Motivated by this insight, we propose Retention-aware Policy Optimization (RaPO), a simple yet effective RFT method that explicitly mitigates forgetting through trajectory-level reward shaping. Specifically, RaPO comprises two core components: (1) Retention Reward that converts trajectory-level distribution drift into a continuous reward signal, preferentially reinforcing knowledge-preserving rollouts within each group; (2) Cross-Task Advantage Normalization (CTAN), which maintains a persistent exponential moving average of reward statistics across task boundaries to stabilize the optimization progress during continual learning. Leveraging the free-form textual generalization of MLLMs, we comprehensively evaluate RaPO across five visual continual learning settings. Extensive experiments demonstrate that RaPO achieves leading performance, substantially reducing catastrophic forgetting while preserving strong plasticity. To the best of our knowledge, this work represents the first systematic exploration of RFT in visual continual learning, offering insights that we hope will inspire future research.