Mini-o3:扩展视觉搜索中的推理模式与交互轮次
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
September 9, 2025
作者: Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, Hengshuang Zhao
cs.AI
摘要
近期,大型多模态模型的发展通过结合图像工具与强化学习,有效应对了视觉问题。然而,现有的开源方法往往表现出单一的推理模式,且仅允许有限的交互轮次,这使得它们在需要反复试错探索的复杂任务中显得力不从心。本研究通过扩大基于工具的交互规模,引入了Mini-o3系统,该系统能够执行深度、多轮次的推理——跨越数十步——并在具有挑战性的视觉搜索任务中实现了最先进的性能。我们复现OpenAI o3风格行为的方案包含三个关键组成部分。首先,我们构建了视觉探测数据集,这是一个包含数千个设计用于探索性推理的复杂视觉搜索问题的集合。其次,我们开发了一个迭代数据收集管道,以获取展现多样化推理模式(包括深度优先搜索、试错法和目标维持)的冷启动轨迹。第三,我们提出了一种超轮次掩码策略,在强化学习过程中避免对达到最大轮次的响应进行惩罚,从而在训练效率与测试可扩展性之间取得平衡。尽管训练时仅设定了最多六轮交互的上限,我们的模型在推理时能够自然扩展到数十轮,且随着轮次增加,准确性也随之提升。大量实验证明,Mini-o3能够生成丰富的推理模式和深度的思考路径,有效解决复杂的视觉搜索问题。
English
Recent advances in large multimodal models have leveraged image-based tools
with reinforcement learning to tackle visual problems. However, existing
open-source approaches often exhibit monotonous reasoning patterns and allow
only a limited number of interaction turns, making them inadequate for
difficult tasks that require trial-and-error exploration. In this work, we
address this limitation by scaling up tool-based interactions and introduce
Mini-o3, a system that executes deep, multi-turn reasoning -- spanning tens of
steps -- and achieves state-of-the-art performance on challenging visual search
tasks. Our recipe for reproducing OpenAI o3-style behaviors comprises three key
components. First, we construct the Visual Probe Dataset, a collection of
thousands of challenging visual search problems designed for exploratory
reasoning. Second, we develop an iterative data collection pipeline to obtain
cold-start trajectories that exhibit diverse reasoning patterns, including
depth-first search, trial-and-error, and goal maintenance. Third, we propose an
over-turn masking strategy that prevents penalization of over-turn responses
(those that hit the maximum number of turns) during reinforcement learning,
thereby balancing training-time efficiency with test-time scalability. Despite
training with an upper bound of only six interaction turns, our model generates
trajectories that naturally scale to tens of turns at inference time, with
accuracy improving as the number of turns increases. Extensive experiments
demonstrate that Mini-o3 produces rich reasoning patterns and deep thinking
paths, effectively solving challenging visual search problems.