PyVision-RL:通过强化学习锻造开放式智能视觉模型

PyVision-RL: Forging Open Agentic Vision Models via RL

February 24, 2026
作者: Shitian Zhao, Shaoheng Lin, Ming Li, Haoquan Zhang, Wenshuo Peng, Kaipeng Zhang, Chen Wei
cs.AI

摘要

面向智能体多模态模型的强化学习常面临交互崩溃问题,即模型倾向于减少工具使用与多轮推理,限制了智能体行为的优势。我们提出PyVision-RL——一个面向开源权重多模态模型的强化学习框架,通过稳定训练来维持交互持续性。该方法结合过采样-过滤-排序的轨迹生成策略与累积式工具奖励机制,既能防止交互崩溃,又能促进多轮工具使用。基于统一训练流程,我们开发了面向图像与视频理解的PyVision-Image和PyVision-Video模型。在视频推理任务中,PyVision-Video采用按需上下文构建技术,在推理过程中选择性采样任务相关帧,显著减少视觉标记的使用。实验表明,该方法在保持强劲性能的同时提升了效率,证明持续交互与按需视觉处理对可扩展多模态智能体具有关键意义。
English
Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction. Our approach combines an oversampling-filtering-ranking rollout strategy with an accumulative tool reward to prevent collapse and encourage multi-turn tool use. Using a unified training pipeline, we develop PyVision-Image and PyVision-Video for image and video understanding. For video reasoning, PyVision-Video employs on-demand context construction, selectively sampling task-relevant frames during reasoning to significantly reduce visual token usage. Experiments show strong performance and improved efficiency, demonstrating that sustained interaction and on-demand visual processing are critical for scalable multimodal agents.
PDF312March 28, 2026