Active-O3: GRPO를 통한 능동적 인지로 다중모달 대형 언어 모델 강화

초록

능동 시각(Active vision), 또는 능동 지각(Active perception)은 작업과 관련된 정보를 수집하기 위해 어디를 어떻게 볼지 능동적으로 선택하는 과정을 의미합니다. 이는 인간과 고급 구현 에이전트(embodied agents)에서 효율적인 지각과 의사결정의 중요한 구성 요소입니다. 최근, 로봇 시스템에서 중앙 계획 및 의사결정 모듈로 다중모드 대형 언어 모델(Multimodal Large Language Models, MLLMs)을 사용하는 것이 광범위한 관심을 받고 있습니다. 그러나 구현된 지능에서 능동 지각의 중요성에도 불구하고, MLLMs가 능동 지각 능력을 갖추거나 학습할 수 있는 방법에 대한 탐구는 거의 이루어지지 않았습니다. 본 논문에서는 먼저 MLLM 기반 능동 지각 작업에 대한 체계적인 정의를 제공합니다. 최근 제안된 GPT-o3 모델의 확대 검색 전략이 능동 지각의 특수한 경우로 간주될 수 있음을 지적하지만, 여전히 검색 효율성이 낮고 영역 선택이 부정확한 문제가 있습니다. 이러한 문제를 해결하기 위해, 우리는 GRPO를 기반으로 한 순수 강화 학습 기반 훈련 프레임워크인 ACTIVE-O3를 제안합니다. 이 프레임워크는 MLLMs에 능동 지각 능력을 부여하도록 설계되었습니다. 또한, 우리는 ACTIVE-O3를 일반적인 오픈 월드 작업(예: 소형 객체 및 밀집 객체 그라운딩)과 도메인 특화 시나리오(예: 원격 감지 및 자율 주행에서의 소형 객체 탐지, 세분화된 상호작용 세분화)를 아우르는 포괄적인 벤치마크 제품군을 구축했습니다. 더 나아가, ACTIVE-O3는 V* 벤치마크에서 명시적인 추론 데이터에 의존하지 않고도 강력한 제로샷 추론 능력을 보여줍니다. 우리의 작업이 MLLMs에서의 능동 지각 연구를 촉진하기 위한 간단한 코드베이스와 평가 프로토콜을 제공할 수 있기를 바랍니다.

English

Active vision, also known as active perception, refers to the process of actively selecting where and how to look in order to gather task-relevant information. It is a critical component of efficient perception and decision-making in humans and advanced embodied agents. Recently, the use of Multimodal Large Language Models (MLLMs) as central planning and decision-making modules in robotic systems has gained extensive attention. However, despite the importance of active perception in embodied intelligence, there is little to no exploration of how MLLMs can be equipped with or learn active perception capabilities. In this paper, we first provide a systematic definition of MLLM-based active perception tasks. We point out that the recently proposed GPT-o3 model's zoom-in search strategy can be regarded as a special case of active perception; however, it still suffers from low search efficiency and inaccurate region selection. To address these issues, we propose ACTIVE-O3, a purely reinforcement learning based training framework built on top of GRPO, designed to equip MLLMs with active perception capabilities. We further establish a comprehensive benchmark suite to evaluate ACTIVE-O3 across both general open-world tasks, such as small-object and dense object grounding, and domain-specific scenarios, including small object detection in remote sensing and autonomous driving, as well as fine-grained interactive segmentation. In addition, ACTIVE-O3 also demonstrates strong zero-shot reasoning abilities on the V* Benchmark, without relying on any explicit reasoning data. We hope that our work can provide a simple codebase and evaluation protocol to facilitate future research on active perception in MLLMs.

Active-O3: GRPO를 통한 능동적 인지로 다중모달 대형 언어 모델 강화

Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

초록

Support