ChatPaper.aiChatPaper

解耦以泛化:面向数据稀缺视觉语言推理的上下文优先自演进学习

Decouple to Generalize: Context-First Self-Evolving Learning for Data-Scarce Vision-Language Reasoning

December 7, 2025
作者: Tingyu Li, Zheng Sun, Jingxuan Wei, Siyuan Li, Conghui He, Lijun Wu, Cheng Tan
cs.AI

摘要

近期,视觉语言模型(VLM)通过强化学习(RL)实现了卓越的推理能力,这为在经验时代实现持续自我进化的大型视觉语言模型(LVLM)提供了可行路径。然而,VLM的强化学习需要大量高质量多模态数据,在化学、地球科学和多模态数学等专业领域尤为困难。现有策略如合成数据和自奖励机制存在分布局限和对齐难题,最终导致奖励破解:模型利用高奖励模式,致使策略熵崩溃并破坏训练稳定性。我们提出DoGe(解耦以泛化)框架,通过双重解耦机制引导模型首先从上下文而非问题求解中学习,重点关注被合成数据方法忽视的问题情境场景。该框架将学习过程解耦为双组件(思考器与求解器),合理量化该过程的奖励信号,并提出从自由探索上下文到实际任务求解的两阶段RL后训练方法。其次,为提升训练数据多样性,DoGe构建了渐进式课程学习流程:扩展的原始领域知识库与迭代进化的种子问题池。实验表明,我们的方法在多个基准测试中持续超越基线,为实现自我进化的LVLM提供了可扩展路径。
English
Recent vision-language models (VLMs) achieve remarkable reasoning through reinforcement learning (RL), which provides a feasible solution for realizing continuous self-evolving large vision-language models (LVLMs) in the era of experience. However, RL for VLMs requires abundant high-quality multimodal data, especially challenging in specialized domains like chemistry, earth sciences, and multimodal mathematics. Existing strategies such as synthetic data and self-rewarding mechanisms suffer from limited distributions and alignment difficulties, ultimately causing reward hacking: models exploit high-reward patterns, collapsing policy entropy and destabilizing training. We propose DoGe (Decouple to Generalize), a dual-decoupling framework that guides models to first learn from context rather than problem solving by refocusing on the problem context scenarios overlooked by synthetic data methods. By decoupling learning process into dual components (Thinker and Solver), we reasonably quantify the reward signals of this process and propose a two-stage RL post-training approach from freely exploring context to practically solving tasks. Second, to increase the diversity of training data, DoGe constructs an evolving curriculum learning pipeline: an expanded native domain knowledge corpus and an iteratively evolving seed problems pool. Experiments show that our method consistently outperforms the baseline across various benchmarks, providing a scalable pathway for realizing self-evolving LVLMs.
PDF32December 10, 2025