一視同仁的強化學習:視覺三元統一強化學習
One RL to See Them All: Visual Triple Unified Reinforcement Learning
May 23, 2025
作者: Yan Ma, Linge Du, Xuyang Shen, Shaoxiang Chen, Pengfei Li, Qibing Ren, Lizhuang Ma, Yuchao Dai, Pengfei Liu, Junjie Yan
cs.AI
摘要
強化學習(RL)顯著提升了視覺語言模型(VLMs)的推理能力。然而,RL在推理任務之外的應用仍大多未被探索,尤其是在物體檢測和定位等感知密集型任務上。我們提出了V-Triune,一個視覺三重統一強化學習系統,使VLMs能在單一訓練管道中同時學習視覺推理和感知任務。V-Triune包含三重互補組件:樣本級數據格式化(以統一多樣任務輸入)、驗證級獎勵計算(通過專門驗證器提供定制獎勵)和源級指標監控(在數據源層面診斷問題)。我們進一步引入了一種新穎的動態IoU獎勵,為V-Triune處理的感知任務提供自適應、漸進且明確的反饋。我們的方法在現成的RL訓練框架中實現,使用了開源的7B和32B骨幹模型。由此產生的模型,名為Orsta(一RL以觀全局),在推理和感知任務上均展現出一致的改進。這種廣泛的能力很大程度上得益於其在多樣化數據集上的訓練,該數據集圍繞四種代表性視覺推理任務(數學、謎題、圖表和科學)和四種視覺感知任務(定位、檢測、計數和OCR)構建。隨後,Orsta在MEGA-Bench Core上取得了顯著提升,其多種7B和32B模型變體的改進範圍從+2.1到令人印象深刻的+14.1,且性能優勢延伸至廣泛的下游任務。這些結果凸顯了我們統一RL方法對VLMs的有效性和可擴展性。V-Triune系統及Orsta模型已公開於https://github.com/MiniMax-AI。
English
Reinforcement learning (RL) has significantly advanced the reasoning
capabilities of vision-language models (VLMs). However, the use of RL beyond
reasoning tasks remains largely unexplored, especially for perceptionintensive
tasks like object detection and grounding. We propose V-Triune, a Visual Triple
Unified Reinforcement Learning system that enables VLMs to jointly learn visual
reasoning and perception tasks within a single training pipeline. V-Triune
comprises triple complementary components: Sample-Level Data Formatting (to
unify diverse task inputs), Verifier-Level Reward Computation (to deliver
custom rewards via specialized verifiers) , and Source-Level Metric Monitoring
(to diagnose problems at the data-source level). We further introduce a novel
Dynamic IoU reward, which provides adaptive, progressive, and definite feedback
for perception tasks handled by V-Triune. Our approach is instantiated within
off-the-shelf RL training framework using open-source 7B and 32B backbone
models. The resulting model, dubbed Orsta (One RL to See Them All),
demonstrates consistent improvements across both reasoning and perception
tasks. This broad capability is significantly shaped by its training on a
diverse dataset, constructed around four representative visual reasoning tasks
(Math, Puzzle, Chart, and Science) and four visual perception tasks (Grounding,
Detection, Counting, and OCR). Subsequently, Orsta achieves substantial gains
on MEGA-Bench Core, with improvements ranging from +2.1 to an impressive +14.1
across its various 7B and 32B model variants, with performance benefits
extending to a wide range of downstream tasks. These results highlight the
effectiveness and scalability of our unified RL approach for VLMs. The V-Triune
system, along with the Orsta models, is publicly available at
https://github.com/MiniMax-AI.Summary
AI-Generated Summary