ChatPaper.aiChatPaper

OmniAgent:面向全模态音视频理解的音频引导主动感知代理

OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding

December 29, 2025
作者: Keda Tao, Wenjie Du, Bohan Yu, Weiqiang Wang, Jian Liu, Huan Wang
cs.AI

摘要

全模态大语言模型在音视频模态统一方面取得显著进展,但常缺乏细粒度跨模态理解能力,且难以实现精准的多模态对齐。为解决这些局限,我们提出OmniAgent——一种完全由音频引导的主动感知智能体,通过动态调度专用工具实现更精细的视听推理。与依赖静态工作流和密集帧描述的传统方法不同,本文展示了从被动响应生成到主动多模态探索的范式转变。OmniAgent采用动态规划机制,按需自主调度工具调用,策略性地将感知注意力集中于任务相关线索。我们方法的核心在于新颖的"由粗到精"音频引导感知范式,利用音频线索定位时序事件并引导后续推理。在三个音视频理解基准上的大量实验表明,OmniAgent以10%-20%的准确率优势超越主流开源与商用模型,达到最先进性能水平。
English
Omnimodal large language models have made significant strides in unifying audio and visual modalities; however, they often lack the fine-grained cross-modal understanding and have difficulty with multimodal alignment. To address these limitations, we introduce OmniAgent, a fully audio-guided active perception agent that dynamically orchestrates specialized tools to achieve more fine-grained audio-visual reasoning. Unlike previous works that rely on rigid, static workflows and dense frame-captioning, this paper demonstrates a paradigm shift from passive response generation to active multimodal inquiry. OmniAgent employs dynamic planning to autonomously orchestrate tool invocation on demand, strategically concentrating perceptual attention on task-relevant cues. Central to our approach is a novel coarse-to-fine audio-guided perception paradigm, which leverages audio cues to localize temporal events and guide subsequent reasoning. Extensive empirical evaluations on three audio-video understanding benchmarks demonstrate that OmniAgent achieves state-of-the-art performance, surpassing leading open-source and proprietary models by substantial margins of 10% - 20% accuracy.
PDF81December 31, 2025