解耦以泛化:面向資料稀缺視覺語言推理的上下文優先自演化學習
Decouple to Generalize: Context-First Self-Evolving Learning for Data-Scarce Vision-Language Reasoning
December 7, 2025
作者: Tingyu Li, Zheng Sun, Jingxuan Wei, Siyuan Li, Conghui He, Lijun Wu, Cheng Tan
cs.AI
摘要
近期,視覺語言模型(VLM)透過強化學習(RL)實現了卓越的推理能力,這為在經驗時代實現持續自我演化的大型視覺語言模型(LVLM)提供了可行方案。然而,VLM的強化學習需要大量高品質多模態數據,尤其在化學、地球科學和多模態數學等專業領域面臨挑戰。現有策略如合成數據和自我獎勵機制存在分佈侷限性和對齊困難,最終導致獎勵破解:模型利用高獎勵模式,使策略熵崩潰並破壞訓練穩定性。我們提出DoGe(解耦以泛化)框架,透過雙重解耦引導模型首先從上下文學習而非直接解決問題,重新聚焦於合成數據方法所忽略的問題情境場景。通過將學習過程解耦為雙組件(思考者與解決者),我們合理量化此過程的獎勵信號,並提出從自由探索上下文到實際解決任務的兩階段RL後訓練方法。其次,為提升訓練數據多樣性,DoGe建構了演化課程學習流程:擴展的原始領域知識庫與迭代演化的種子問題池。實驗表明,我們的方法在各類基準測試中持續超越基線,為實現自我演化LVLM提供了可擴展路徑。
English
Recent vision-language models (VLMs) achieve remarkable reasoning through reinforcement learning (RL), which provides a feasible solution for realizing continuous self-evolving large vision-language models (LVLMs) in the era of experience. However, RL for VLMs requires abundant high-quality multimodal data, especially challenging in specialized domains like chemistry, earth sciences, and multimodal mathematics. Existing strategies such as synthetic data and self-rewarding mechanisms suffer from limited distributions and alignment difficulties, ultimately causing reward hacking: models exploit high-reward patterns, collapsing policy entropy and destabilizing training. We propose DoGe (Decouple to Generalize), a dual-decoupling framework that guides models to first learn from context rather than problem solving by refocusing on the problem context scenarios overlooked by synthetic data methods. By decoupling learning process into dual components (Thinker and Solver), we reasonably quantify the reward signals of this process and propose a two-stage RL post-training approach from freely exploring context to practically solving tasks. Second, to increase the diversity of training data, DoGe constructs an evolving curriculum learning pipeline: an expanded native domain knowledge corpus and an iteratively evolving seed problems pool. Experiments show that our method consistently outperforms the baseline across various benchmarks, providing a scalable pathway for realizing self-evolving LVLMs.