PyVision-RL：透過強化學習鍛造開放式智能視覺模型

摘要

強化學習在多模態智慧體模型的應用中常面臨互動崩潰問題，即模型傾向減少工具使用與多輪推理，從而限制智慧體行為的優勢。我們提出PyVision-RL——一個針對開放權重多模態模型的強化學習框架，能穩定訓練並維持互動持續性。該方法結合超取樣-篩選-排序的滾動策略與累積式工具獎勵機制，既可防止崩潰又能促進多輪工具使用。通過統一訓練流程，我們開發了用於圖像與影片理解的PyVision-Image和PyVision-Video模型。在影片推理任務中，PyVision-Video採用按需上下文建構技術，於推理過程中選擇性抽樣任務相關影格，顯著降低視覺標記的使用量。實驗結果顯示出卓越的性能與效率提升，證實持續互動與按需視覺處理對可擴展多模態智慧體的關鍵作用。

English

Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction. Our approach combines an oversampling-filtering-ranking rollout strategy with an accumulative tool reward to prevent collapse and encourage multi-turn tool use. Using a unified training pipeline, we develop PyVision-Image and PyVision-Video for image and video understanding. For video reasoning, PyVision-Video employs on-demand context construction, selectively sampling task-relevant frames during reasoning to significantly reduce visual token usage. Experiments show strong performance and improved efficiency, demonstrating that sustained interaction and on-demand visual processing are critical for scalable multimodal agents.

PyVision-RL：透過強化學習鍛造開放式智能視覺模型

PyVision-RL: Forging Open Agentic Vision Models via RL

摘要

Support