현명하게 행동하기: 능동적 멀티모달 모델에서 메타인지적 도구 사용의 함양

초록

에이전트형 멀티모달 모델의 등장으로 시스템이 외부 환경과 능동적으로 상호작용할 수 있게 되었다. 그러나 현재의 에이전트는 심각한 메타인지 결핍을 겪고 있다. 즉, 내부 지식을 활용하는 것과 외부 유틸리티를 조회하는 것 사이에서 적절히 중재하지 못한다. 그 결과 원시 시각적 맥락에서 해결 가능한 질의에도 불구하고 반사적으로 도구를 실행하는 등 무분별한 도구 호출에 쉽게 빠진다. 이러한 병리적 행동은 심각한 지연 시간 병목 현상을 초래하고, 건전한 추론을 벗어나게 하는 외부 잡음을 유입시킨다. 기존 강화학습 프로토콜은 도구 사용에 페널티를 부여하는 스칼라화된 보상을 통해 이를 완화하려 시도한다. 그러나 이러한 결합형 구성은 해결할 수 없는 최적화 딜레마를 야기한다. 공격적인 페널티는 필수적인 도구 사용을 억제하는 반면, 약한 페널티는 이점 정규화 과정에서 정확도 보상의 분산에 완전히 흡수되어 도구 과사용에 무력해지기 때문이다. 이러한 병목 현상을 극복하기 위해 우리는 도구 효율성을 경쟁적 스칼라 목표에서 엄격한 조건부 목표로 재구성하는 HDPO 프레임워크를 제안한다. HDPO는 보상 스칼라화를 배제함으로써 두 개의 직교적인 최적화 채널을 유지한다. 과제 정확도를 극대화하는 정확도 채널과, 조건부 이점 추정을 통해 정확한 궤적 내에서만 실행 경제성을 강제하는 효율성 채널이 그것이다. 이 분리된 아키텍처는 인지적 커리큘럼을 자연스럽게 유도하여 에이전트가 자기 신뢰성을 개선하기 전에 먼저 과제 해결을 숙달하도록 한다. 광범위한 평가를 통해 우리의 결과 모델인 Metis가 추론 정확도를 향상시키면서도 도구 호출을 몇 차례에 걸쳐 크게 줄임을 입증하였다.

English

The advent of agentic multimodal models has empowered systems to actively interact with external environments. However, current agents suffer from a profound meta-cognitive deficit: they struggle to arbitrate between leveraging internal knowledge and querying external utilities. Consequently, they frequently fall prey to blind tool invocation, resorting to reflexive tool execution even when queries are resolvable from the raw visual context. This pathological behavior precipitates severe latency bottlenecks and injects extraneous noise that derails sound reasoning. Existing reinforcement learning protocols attempt to mitigate this via a scalarized reward that penalizes tool usage. Yet, this coupled formulation creates an irreconcilable optimization dilemma: an aggressive penalty suppresses essential tool use, whereas a mild penalty is entirely subsumed by the variance of the accuracy reward during advantage normalization, rendering it impotent against tool overuse. To transcend this bottleneck, we propose HDPO, a framework that reframes tool efficiency from a competing scalar objective to a strictly conditional one. By eschewing reward scalarization, HDPO maintains two orthogonal optimization channels: an accuracy channel that maximizes task correctness, and an efficiency channel that enforces execution economy exclusively within accurate trajectories via conditional advantage estimation. This decoupled architecture naturally induces a cognitive curriculum-compelling the agent to first master task resolution before refining its self-reliance. Extensive evaluations demonstrate that our resulting model, Metis, reduces tool invocations by orders of magnitude while simultaneously elevating reasoning accuracy.

현명하게 행동하기: 능동적 멀티모달 모델에서 메타인지적 도구 사용의 함양

Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

초록

Support