하나의 강화 학습으로 모두 보기: 시각적 트리플 통합 강화 학습

초록

강화 학습(Reinforcement Learning, RL)은 시각-언어 모델(Vision-Language Models, VLMs)의 추론 능력을 크게 발전시켰습니다. 그러나 추론 작업을 넘어서는 RL의 활용, 특히 객체 탐지 및 그라운딩과 같은 지각 중심 작업에 대한 연구는 아직 미흡한 상태입니다. 본 연구에서는 V-Triune이라는 시각적 삼위일체 통합 강화 학습 시스템을 제안합니다. V-Triune은 VLMs이 단일 학습 파이프라인 내에서 시각적 추론과 지각 작업을 동시에 학습할 수 있도록 설계되었습니다. V-Triune은 세 가지 상호 보완적인 구성 요소로 이루어져 있습니다: 샘플 수준 데이터 포맷팅(다양한 작업 입력을 통합하기 위해), 검증자 수준 보상 계산(전문 검증자를 통해 맞춤형 보상을 제공하기 위해), 그리고 소스 수준 메트릭 모니터링(데이터 소스 수준에서 문제를 진단하기 위해). 또한, V-Triune이 처리하는 지각 작업에 대해 적응적, 점진적, 명확한 피드백을 제공하는 새로운 Dynamic IoU 보상을 도입했습니다. 이 접근 방식은 오픈소스 7B 및 32B 백본 모델을 사용한 기성 RL 학습 프레임워크 내에서 구현되었습니다. 그 결과물인 Orsta(One RL to See Them All) 모델은 추론 및 지각 작업 전반에 걸쳐 일관된 성능 향상을 보여줍니다. 이러한 광범위한 능력은 네 가지 대표적인 시각적 추론 작업(수학, 퍼즐, 차트, 과학)과 네 가지 시각적 지각 작업(그라운딩, 탐지, 계수, OCR)을 중심으로 구성된 다양한 데이터셋에 대한 학습에 의해 크게 형성되었습니다. 이후 Orsta는 MEGA-Bench Core에서 7B 및 32B 모델 변종에 걸쳐 +2.1에서 +14.1에 이르는 상당한 성능 향상을 달성하며, 다양한 다운스트림 작업으로까지 그 성능 이점이 확장되었습니다. 이러한 결과는 VLMs을 위한 통합 RL 접근 방식의 효과성과 확장성을 강조합니다. V-Triune 시스템과 Orsta 모델은 https://github.com/MiniMax-AI에서 공개적으로 제공됩니다.

English

Reinforcement learning (RL) has significantly advanced the reasoning capabilities of vision-language models (VLMs). However, the use of RL beyond reasoning tasks remains largely unexplored, especially for perceptionintensive tasks like object detection and grounding. We propose V-Triune, a Visual Triple Unified Reinforcement Learning system that enables VLMs to jointly learn visual reasoning and perception tasks within a single training pipeline. V-Triune comprises triple complementary components: Sample-Level Data Formatting (to unify diverse task inputs), Verifier-Level Reward Computation (to deliver custom rewards via specialized verifiers) , and Source-Level Metric Monitoring (to diagnose problems at the data-source level). We further introduce a novel Dynamic IoU reward, which provides adaptive, progressive, and definite feedback for perception tasks handled by V-Triune. Our approach is instantiated within off-the-shelf RL training framework using open-source 7B and 32B backbone models. The resulting model, dubbed Orsta (One RL to See Them All), demonstrates consistent improvements across both reasoning and perception tasks. This broad capability is significantly shaped by its training on a diverse dataset, constructed around four representative visual reasoning tasks (Math, Puzzle, Chart, and Science) and four visual perception tasks (Grounding, Detection, Counting, and OCR). Subsequently, Orsta achieves substantial gains on MEGA-Bench Core, with improvements ranging from +2.1 to an impressive +14.1 across its various 7B and 32B model variants, with performance benefits extending to a wide range of downstream tasks. These results highlight the effectiveness and scalability of our unified RL approach for VLMs. The V-Triune system, along with the Orsta models, is publicly available at https://github.com/MiniMax-AI.

하나의 강화 학습으로 모두 보기: 시각적 트리플 통합 강화 학습

One RL to See Them All: Visual Triple Unified Reinforcement Learning

초록

Support