PyVision-RL：通过强化学习锻造开放式智能视觉模型

摘要

面向智能体多模态模型的强化学习常面临交互崩溃问题，即模型倾向于减少工具使用与多轮推理，限制了智能体行为的优势。我们提出PyVision-RL——一个面向开源权重多模态模型的强化学习框架，通过稳定训练来维持交互持续性。该方法结合过采样-过滤-排序的轨迹生成策略与累积式工具奖励机制，既能防止交互崩溃，又能促进多轮工具使用。基于统一训练流程，我们开发了面向图像与视频理解的PyVision-Image和PyVision-Video模型。在视频推理任务中，PyVision-Video采用按需上下文构建技术，在推理过程中选择性采样任务相关帧，显著减少视觉标记的使用。实验表明，该方法在保持强劲性能的同时提升了效率，证明持续交互与按需视觉处理对可扩展多模态智能体具有关键意义。

English

Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction. Our approach combines an oversampling-filtering-ranking rollout strategy with an accumulative tool reward to prevent collapse and encourage multi-turn tool use. Using a unified training pipeline, we develop PyVision-Image and PyVision-Video for image and video understanding. For video reasoning, PyVision-Video employs on-demand context construction, selectively sampling task-relevant frames during reasoning to significantly reduce visual token usage. Experiments show strong performance and improved efficiency, demonstrating that sustained interaction and on-demand visual processing are critical for scalable multimodal agents.

PyVision-RL：通过强化学习锻造开放式智能视觉模型

PyVision-RL: Forging Open Agentic Vision Models via RL

摘要

Support