Active-O3:通过GRPO赋能多模态大语言模型的主动感知能力
Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO
May 27, 2025
作者: Muzhi Zhu, Hao Zhong, Canyu Zhao, Zongze Du, Zheng Huang, Mingyu Liu, Hao Chen, Cheng Zou, Jingdong Chen, Ming Yang, Chunhua Shen
cs.AI
摘要
主动视觉,亦称主动感知,是指通过主动选择观察位置与方式以收集任务相关信息的过程。它是人类及高级具身智能体实现高效感知与决策的关键组成部分。近期,多模态大语言模型(MLLMs)作为机器人系统中的核心规划与决策模块,受到了广泛关注。然而,尽管主动感知在具身智能中至关重要,关于如何使MLLMs具备或学习主动感知能力的研究却鲜有涉猎。本文首先系统性地定义了基于MLLM的主动感知任务,并指出近期提出的GPT-o3模型的放大搜索策略可视为主动感知的一个特例,但其仍存在搜索效率低及区域选择不精准的问题。为解决这些问题,我们提出了ACTIVE-O3,一个完全基于强化学习、构建于GRPO之上的训练框架,旨在赋予MLLMs主动感知能力。我们进一步建立了一套全面的基准测试集,用于评估ACTIVE-O3在通用开放世界任务(如小物体与密集物体定位)及特定领域场景(包括遥感中的小物体检测、自动驾驶以及细粒度交互式分割)中的表现。此外,ACTIVE-O3在V*基准测试上也展现了强大的零样本推理能力,且无需依赖任何显式推理数据。我们期望本工作能提供一个简洁的代码库与评估协议,以促进未来在MLLMs主动感知领域的研究。
English
Active vision, also known as active perception, refers to the process of
actively selecting where and how to look in order to gather task-relevant
information. It is a critical component of efficient perception and
decision-making in humans and advanced embodied agents. Recently, the use of
Multimodal Large Language Models (MLLMs) as central planning and
decision-making modules in robotic systems has gained extensive attention.
However, despite the importance of active perception in embodied intelligence,
there is little to no exploration of how MLLMs can be equipped with or learn
active perception capabilities. In this paper, we first provide a systematic
definition of MLLM-based active perception tasks. We point out that the
recently proposed GPT-o3 model's zoom-in search strategy can be regarded as a
special case of active perception; however, it still suffers from low search
efficiency and inaccurate region selection. To address these issues, we propose
ACTIVE-O3, a purely reinforcement learning based training framework built on
top of GRPO, designed to equip MLLMs with active perception capabilities. We
further establish a comprehensive benchmark suite to evaluate ACTIVE-O3 across
both general open-world tasks, such as small-object and dense object grounding,
and domain-specific scenarios, including small object detection in remote
sensing and autonomous driving, as well as fine-grained interactive
segmentation. In addition, ACTIVE-O3 also demonstrates strong zero-shot
reasoning abilities on the V* Benchmark, without relying on any explicit
reasoning data. We hope that our work can provide a simple codebase and
evaluation protocol to facilitate future research on active perception in
MLLMs.Summary
AI-Generated Summary