Active-O3:通過GRPO賦能多模態大型語言模型的主動感知能力
Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO
May 27, 2025
作者: Muzhi Zhu, Hao Zhong, Canyu Zhao, Zongze Du, Zheng Huang, Mingyu Liu, Hao Chen, Cheng Zou, Jingdong Chen, Ming Yang, Chunhua Shen
cs.AI
摘要
主動視覺,亦稱主動感知,指的是主動選擇觀察位置與方式以收集任務相關信息的過程。它是人類及高級具身智能體實現高效感知與決策的關鍵組成部分。近年來,多模態大語言模型(MLLMs)作為機器人系統中的核心規劃與決策模塊,已引起廣泛關注。然而,儘管主動感知在具身智能中至關重要,關於如何使MLLMs具備或學習主動感知能力的研究卻寥寥無幾。本文首先系統地定義了基於MLLM的主動感知任務,並指出近期提出的GPT-o3模型的放大搜索策略可視為主動感知的一種特例,但其仍存在搜索效率低和區域選擇不準確的問題。為解決這些問題,我們提出了ACTIVE-O3,這是一個完全基於強化學習的訓練框架,構建於GRPO之上,旨在賦予MLLMs主動感知能力。我們進一步建立了一套全面的基準測試集,用於評估ACTIVE-O3在通用開放世界任務(如小物體和密集物體定位)及特定領域場景(包括遙感小物體檢測、自動駕駛以及細粒度交互式分割)中的表現。此外,ACTIVE-O3在V*基準測試中展現了強大的零樣本推理能力,而無需依賴任何顯式推理數據。我們希望本工作能提供一個簡潔的代碼庫和評估協議,以促進未來在MLLMs中主動感知研究的發展。
English
Active vision, also known as active perception, refers to the process of
actively selecting where and how to look in order to gather task-relevant
information. It is a critical component of efficient perception and
decision-making in humans and advanced embodied agents. Recently, the use of
Multimodal Large Language Models (MLLMs) as central planning and
decision-making modules in robotic systems has gained extensive attention.
However, despite the importance of active perception in embodied intelligence,
there is little to no exploration of how MLLMs can be equipped with or learn
active perception capabilities. In this paper, we first provide a systematic
definition of MLLM-based active perception tasks. We point out that the
recently proposed GPT-o3 model's zoom-in search strategy can be regarded as a
special case of active perception; however, it still suffers from low search
efficiency and inaccurate region selection. To address these issues, we propose
ACTIVE-O3, a purely reinforcement learning based training framework built on
top of GRPO, designed to equip MLLMs with active perception capabilities. We
further establish a comprehensive benchmark suite to evaluate ACTIVE-O3 across
both general open-world tasks, such as small-object and dense object grounding,
and domain-specific scenarios, including small object detection in remote
sensing and autonomous driving, as well as fine-grained interactive
segmentation. In addition, ACTIVE-O3 also demonstrates strong zero-shot
reasoning abilities on the V* Benchmark, without relying on any explicit
reasoning data. We hope that our work can provide a simple codebase and
evaluation protocol to facilitate future research on active perception in
MLLMs.Summary
AI-Generated Summary