視覚的継続学習における破滅的忘却を強化学習微調整で克服する

要旨

近年の研究では、Reinforcement Fine-Tuning（RFT）がSupervised Fine-Tuning（SFT）よりも本質的に破滅的忘却に対して耐性があることが示唆されている。しかし、クラス増分学習（CIL）やドメイン増分学習（DIL）などの困難なビジュアル継続学習設定において、RFT（例：GRPO）が忘却を効果的に克服できるかどうかは、依然として未解決の問題である。パイロット研究を通じて、RFTが一貫してSFTを上回る一方で、無視できない忘却が依然として生じることを確認した。我々はこのボトルネックを経験的に「軌跡レベルのドリフト無依存性」に起因するものと特定した。すなわち、同一のタスク報酬を達成する候補ロールアウト間で、先行タスク方策からのKLダイバージェンスが大きく変動し、これが逐次タスク間の破滅的忘却と強く相関する。この洞察に基づき、我々は軌跡レベルの報酬整形を通じて忘却を明示的に軽減する、シンプルかつ効果的なRFT手法である「Retention-aware Policy Optimization（RaPO）」を提案する。具体的には、RaPOは以下の二つの中核要素から構成される。（1）保持報酬：軌跡レベルの分布ドリフトを連続的な報酬信号に変換し、各グループ内で知識保持的なロールアウトを優先的に強化する。（2）タスク間アドバンテージ正規化（CTAN）：タスク境界を越えて報酬統計量の指数移動平均を維持し、継続学習中の最適化進行を安定化させる。MLLMの自由形式テキスト生成能力を活用し、五つのビジュアル継続学習設定でRaPOを包括的に評価した。大規模な実験により、RaPOが最先端の性能を達成し、強力な可塑性を維持しつつ破滅的忘却を大幅に低減することを実証した。我々の知る限り、本研究はビジュアル継続学習におけるRFTの初の体系的探求であり、将来の研究に刺激を与える洞察を提供することを願っている。

English

Recent studies suggest that Reinforcement Fine-Tuning (RFT) is inherently more resilient to catastrophic forgetting than Supervised Fine-Tuning (SFT). However, whether RFT (e.g., GRPO) can effectively overcome forgetting in challenging visual continual learning settings, such as class-incremental learning (CIL) and domain-incremental learning (DIL), remains an open problem. Through a pilot study, we confirm that while RFT consistently outperforms SFT, it still suffers from non-negligible forgetting. We empirically trace this bottleneck to Trajectory-level Drift Agnosticism: among candidate rollouts achieving identical task rewards, the KL divergence from the preceding-task policy varies substantially, which strongly correlates with catastrophic forgetting across sequential tasks. Motivated by this insight, we propose Retention-aware Policy Optimization (RaPO), a simple yet effective RFT method that explicitly mitigates forgetting through trajectory-level reward shaping. Specifically, RaPO comprises two core components: (1) Retention Reward that converts trajectory-level distribution drift into a continuous reward signal, preferentially reinforcing knowledge-preserving rollouts within each group; (2) Cross-Task Advantage Normalization (CTAN), which maintains a persistent exponential moving average of reward statistics across task boundaries to stabilize the optimization progress during continual learning. Leveraging the free-form textual generalization of MLLMs, we comprehensively evaluate RaPO across five visual continual learning settings. Extensive experiments demonstrate that RaPO achieves leading performance, substantially reducing catastrophic forgetting while preserving strong plasticity. To the best of our knowledge, this work represents the first systematic exploration of RFT in visual continual learning, offering insights that we hope will inspire future research.