Mini-o3: 視覚探索のための推論パターンとインタラクションターンのスケールアップ

要旨

大規模マルチモーダルモデルの最近の進展により、画像ベースのツールと強化学習を組み合わせて視覚的問題に取り組む手法が開発されてきた。しかし、既存のオープンソースアプローチはしばしば単調な推論パターンを示し、限られた数のインタラクションターンしか許容しないため、試行錯誤的な探索を必要とする困難なタスクには不十分である。本研究では、この制限を克服するため、ツールベースのインタラクションをスケールアップし、数十ステップにわたる深い多ターン推論を実行するシステム「Mini-o3」を導入し、困難な視覚探索タスクにおいて最先端の性能を達成する。OpenAIのo3スタイルの動作を再現するためのレシピは、3つの主要な要素から構成される。まず、探索的推論のために設計された数千の困難な視覚探索問題を集めた「Visual Probe Dataset」を構築する。次に、深さ優先探索、試行錯誤、目標維持など多様な推論パターンを示すコールドスタート軌跡を取得するための反復的なデータ収集パイプラインを開発する。第三に、強化学習中に最大ターン数に達した応答（オーバーターン応答）のペナルティを防ぐ「オーバーターンマスキング戦略」を提案し、トレーニング時の効率性とテスト時のスケーラビリティを両立させる。6ターンの上限でトレーニングされたにもかかわらず、我々のモデルは推論時に数十ターンに自然にスケールする軌跡を生成し、ターン数が増えるにつれて精度が向上する。大規模な実験により、Mini-o3が豊かな推論パターンと深い思考経路を生成し、困難な視覚探索問題を効果的に解決することが実証された。

English

Recent advances in large multimodal models have leveraged image-based tools with reinforcement learning to tackle visual problems. However, existing open-source approaches often exhibit monotonous reasoning patterns and allow only a limited number of interaction turns, making them inadequate for difficult tasks that require trial-and-error exploration. In this work, we address this limitation by scaling up tool-based interactions and introduce Mini-o3, a system that executes deep, multi-turn reasoning -- spanning tens of steps -- and achieves state-of-the-art performance on challenging visual search tasks. Our recipe for reproducing OpenAI o3-style behaviors comprises three key components. First, we construct the Visual Probe Dataset, a collection of thousands of challenging visual search problems designed for exploratory reasoning. Second, we develop an iterative data collection pipeline to obtain cold-start trajectories that exhibit diverse reasoning patterns, including depth-first search, trial-and-error, and goal maintenance. Third, we propose an over-turn masking strategy that prevents penalization of over-turn responses (those that hit the maximum number of turns) during reinforcement learning, thereby balancing training-time efficiency with test-time scalability. Despite training with an upper bound of only six interaction turns, our model generates trajectories that naturally scale to tens of turns at inference time, with accuracy improving as the number of turns increases. Extensive experiments demonstrate that Mini-o3 produces rich reasoning patterns and deep thinking paths, effectively solving challenging visual search problems.

Mini-o3: 視覚探索のための推論パターンとインタラクションターンのスケールアップ

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

要旨

Support