ChatPaper.aiChatPaper

一统视觉三重奏:视觉三元统一强化学习

One RL to See Them All: Visual Triple Unified Reinforcement Learning

May 23, 2025
作者: Yan Ma, Linge Du, Xuyang Shen, Shaoxiang Chen, Pengfei Li, Qibing Ren, Lizhuang Ma, Yuchao Dai, Pengfei Liu, Junjie Yan
cs.AI

摘要

强化学习(RL)显著提升了视觉语言模型(VLMs)的推理能力。然而,在推理任务之外,尤其是在物体检测与定位等感知密集型任务中,RL的应用仍鲜有探索。我们提出了V-Triune,一个视觉三元统一强化学习系统,它使得VLMs能够在单一训练流程中同时学习视觉推理与感知任务。V-Triune包含三个互补组件:样本级数据格式化(统一多样任务输入)、验证器级奖励计算(通过专用验证器提供定制奖励)和源级指标监控(在数据源层面诊断问题)。我们进一步引入了一种新颖的动态IoU奖励机制,为V-Triune处理的感知任务提供自适应、渐进且明确的反馈。我们的方法在现成的RL训练框架中实现,采用了开源的7B和32B骨干模型。由此产生的模型,命名为Orsta(一RL观万象),在推理与感知任务上均展现出持续改进。这一广泛能力很大程度上得益于其在多样化数据集上的训练,该数据集围绕四项代表性视觉推理任务(数学、谜题、图表、科学)和四项视觉感知任务(定位、检测、计数、OCR)构建。随后,Orsta在MEGA-Bench Core上取得了显著提升,其7B和32B模型变体的改进幅度从+2.1到令人印象深刻的+14.1不等,且性能优势延伸至广泛的下游任务。这些成果凸显了我们统一RL方法在VLMs中的有效性和可扩展性。V-Triune系统及Orsta模型已公开于https://github.com/MiniMax-AI。
English
Reinforcement learning (RL) has significantly advanced the reasoning capabilities of vision-language models (VLMs). However, the use of RL beyond reasoning tasks remains largely unexplored, especially for perceptionintensive tasks like object detection and grounding. We propose V-Triune, a Visual Triple Unified Reinforcement Learning system that enables VLMs to jointly learn visual reasoning and perception tasks within a single training pipeline. V-Triune comprises triple complementary components: Sample-Level Data Formatting (to unify diverse task inputs), Verifier-Level Reward Computation (to deliver custom rewards via specialized verifiers) , and Source-Level Metric Monitoring (to diagnose problems at the data-source level). We further introduce a novel Dynamic IoU reward, which provides adaptive, progressive, and definite feedback for perception tasks handled by V-Triune. Our approach is instantiated within off-the-shelf RL training framework using open-source 7B and 32B backbone models. The resulting model, dubbed Orsta (One RL to See Them All), demonstrates consistent improvements across both reasoning and perception tasks. This broad capability is significantly shaped by its training on a diverse dataset, constructed around four representative visual reasoning tasks (Math, Puzzle, Chart, and Science) and four visual perception tasks (Grounding, Detection, Counting, and OCR). Subsequently, Orsta achieves substantial gains on MEGA-Bench Core, with improvements ranging from +2.1 to an impressive +14.1 across its various 7B and 32B model variants, with performance benefits extending to a wide range of downstream tasks. These results highlight the effectiveness and scalability of our unified RL approach for VLMs. The V-Triune system, along with the Orsta models, is publicly available at https://github.com/MiniMax-AI.

Summary

AI-Generated Summary

PDF552May 26, 2025