PyVision-RL:透過強化學習鍛造開放式智能視覺模型
PyVision-RL: Forging Open Agentic Vision Models via RL
February 24, 2026
作者: Shitian Zhao, Shaoheng Lin, Ming Li, Haoquan Zhang, Wenshuo Peng, Kaipeng Zhang, Chen Wei
cs.AI
摘要
強化學習在多模態智慧體模型的應用中常面臨互動崩潰問題,即模型傾向減少工具使用與多輪推理,從而限制智慧體行為的優勢。我們提出PyVision-RL——一個針對開放權重多模態模型的強化學習框架,能穩定訓練並維持互動持續性。該方法結合超取樣-篩選-排序的滾動策略與累積式工具獎勵機制,既可防止崩潰又能促進多輪工具使用。通過統一訓練流程,我們開發了用於圖像與影片理解的PyVision-Image和PyVision-Video模型。在影片推理任務中,PyVision-Video採用按需上下文建構技術,於推理過程中選擇性抽樣任務相關影格,顯著降低視覺標記的使用量。實驗結果顯示出卓越的性能與效率提升,證實持續互動與按需視覺處理對可擴展多模態智慧體的關鍵作用。
English
Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction. Our approach combines an oversampling-filtering-ranking rollout strategy with an accumulative tool reward to prevent collapse and encourage multi-turn tool use. Using a unified training pipeline, we develop PyVision-Image and PyVision-Video for image and video understanding. For video reasoning, PyVision-Video employs on-demand context construction, selectively sampling task-relevant frames during reasoning to significantly reduce visual token usage. Experiments show strong performance and improved efficiency, demonstrating that sustained interaction and on-demand visual processing are critical for scalable multimodal agents.