すべてを見通す一つのRL：視覚的三重統一強化学習

要旨

強化学習（Reinforcement Learning, RL）は、視覚言語モデル（Vision-Language Models, VLMs）の推論能力を大幅に進化させてきた。しかし、推論タスクを超えたRLの活用、特に物体検出やグラウンディングといった知覚集約型タスクへの応用は、ほとんど未開拓の領域である。本論文では、V-Triune（Visual Triple Unified Reinforcement Learning）を提案する。これは、VLMsが単一のトレーニングパイプライン内で視覚推論と知覚タスクを同時に学習することを可能にするシステムである。V-Triuneは、3つの補完的コンポーネントで構成される：サンプルレベルのデータフォーマット（多様なタスク入力を統一するため）、検証者レベルの報酬計算（専門検証者を通じてカスタム報酬を提供するため）、およびソースレベルのメトリック監視（データソースレベルで問題を診断するため）。さらに、V-Triuneが扱う知覚タスクに対して、適応的、漸進的、かつ明確なフィードバックを提供する新しいDynamic IoU報酬を導入する。我々のアプローチは、オープンソースの7Bおよび32Bバックボーンモデルを使用した既存のRLトレーニングフレームワーク内で実装されている。その結果得られたモデル、Orsta（One RL to See Them All）は、推論と知覚タスクの両方で一貫した改善を示す。この広範な能力は、4つの代表的な視覚推論タスク（数学、パズル、チャート、科学）と4つの視覚知覚タスク（グラウンディング、検出、計数、OCR）を中心に構築された多様なデータセットでのトレーニングによって大きく形作られている。その後、OrstaはMEGA-Bench Coreにおいて、7Bおよび32Bモデルの各バリエーションで+2.1から印象的な+14.1までの改善を達成し、その性能向上は幅広い下流タスクにも及んでいる。これらの結果は、VLMsに対する我々の統合RLアプローチの有効性と拡張性を強調している。V-TriuneシステムとOrstaモデルは、https://github.com/MiniMax-AI で公開されている。

English

Reinforcement learning (RL) has significantly advanced the reasoning capabilities of vision-language models (VLMs). However, the use of RL beyond reasoning tasks remains largely unexplored, especially for perceptionintensive tasks like object detection and grounding. We propose V-Triune, a Visual Triple Unified Reinforcement Learning system that enables VLMs to jointly learn visual reasoning and perception tasks within a single training pipeline. V-Triune comprises triple complementary components: Sample-Level Data Formatting (to unify diverse task inputs), Verifier-Level Reward Computation (to deliver custom rewards via specialized verifiers) , and Source-Level Metric Monitoring (to diagnose problems at the data-source level). We further introduce a novel Dynamic IoU reward, which provides adaptive, progressive, and definite feedback for perception tasks handled by V-Triune. Our approach is instantiated within off-the-shelf RL training framework using open-source 7B and 32B backbone models. The resulting model, dubbed Orsta (One RL to See Them All), demonstrates consistent improvements across both reasoning and perception tasks. This broad capability is significantly shaped by its training on a diverse dataset, constructed around four representative visual reasoning tasks (Math, Puzzle, Chart, and Science) and four visual perception tasks (Grounding, Detection, Counting, and OCR). Subsequently, Orsta achieves substantial gains on MEGA-Bench Core, with improvements ranging from +2.1 to an impressive +14.1 across its various 7B and 32B model variants, with performance benefits extending to a wide range of downstream tasks. These results highlight the effectiveness and scalability of our unified RL approach for VLMs. The V-Triune system, along with the Orsta models, is publicly available at https://github.com/MiniMax-AI.

すべてを見通す一つのRL：視覚的三重統一強化学習

One RL to See Them All: Visual Triple Unified Reinforcement Learning

要旨

Support