Mini-o3:擴展視覺搜索中的推理模式與交互輪次
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
September 9, 2025
作者: Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, Hengshuang Zhao
cs.AI
摘要
近期,大型多模态模型的進展已利用基於圖像的工具與強化學習來解決視覺問題。然而,現有的開源方法往往表現出單調的推理模式,並且僅允許有限的互動輪次,這使得它們難以應對需要試錯探索的複雜任務。在本研究中,我們通過擴展基於工具的互動來解決這一限制,並引入了Mini-o3,這是一個能夠執行深度、多輪次推理(跨越數十步)的系統,並在具有挑戰性的視覺搜索任務中達到了最先進的性能。我們重現OpenAI o3風格行為的配方包含三個關鍵組件。首先,我們構建了視覺探測數據集,這是一個包含數千個設計用於探索性推理的挑戰性視覺搜索問題的集合。其次,我們開發了一個迭代數據收集管道,以獲取展示多樣推理模式(包括深度優先搜索、試錯和目標維護)的冷啟動軌跡。第三,我們提出了一種超輪次掩碼策略,該策略在強化學習過程中防止對超輪次響應(達到最大輪次數的響應)進行懲罰,從而平衡訓練時的效率與測試時的可擴展性。儘管訓練時僅設置了六輪互動的上限,我們的模型在推理時生成的軌跡自然擴展到數十輪,並且隨著輪次增加,準確率也隨之提升。大量實驗表明,Mini-o3產生了豐富的推理模式和深層的思考路徑,有效解決了具有挑戰性的視覺搜索問題。
English
Recent advances in large multimodal models have leveraged image-based tools
with reinforcement learning to tackle visual problems. However, existing
open-source approaches often exhibit monotonous reasoning patterns and allow
only a limited number of interaction turns, making them inadequate for
difficult tasks that require trial-and-error exploration. In this work, we
address this limitation by scaling up tool-based interactions and introduce
Mini-o3, a system that executes deep, multi-turn reasoning -- spanning tens of
steps -- and achieves state-of-the-art performance on challenging visual search
tasks. Our recipe for reproducing OpenAI o3-style behaviors comprises three key
components. First, we construct the Visual Probe Dataset, a collection of
thousands of challenging visual search problems designed for exploratory
reasoning. Second, we develop an iterative data collection pipeline to obtain
cold-start trajectories that exhibit diverse reasoning patterns, including
depth-first search, trial-and-error, and goal maintenance. Third, we propose an
over-turn masking strategy that prevents penalization of over-turn responses
(those that hit the maximum number of turns) during reinforcement learning,
thereby balancing training-time efficiency with test-time scalability. Despite
training with an upper bound of only six interaction turns, our model generates
trajectories that naturally scale to tens of turns at inference time, with
accuracy improving as the number of turns increases. Extensive experiments
demonstrate that Mini-o3 produces rich reasoning patterns and deep thinking
paths, effectively solving challenging visual search problems.