迈向全模态表达与推理的视听参考分割
Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation
July 30, 2025
作者: Kaining Ying, Henghui Ding, Guanquan Jie, Yu-Gang Jiang
cs.AI
摘要
视听参考分割(RAVS)领域近期取得了显著进展,但在整合多模态信息以及深入理解和推理视听内容方面仍面临挑战。为了拓展RAVS的边界并推动该领域的未来研究,我们提出了全模态参考视听分割(OmniAVS)数据集,包含2,098个视频和59,458条多模态参考表达。OmniAVS凭借三大创新点脱颖而出:(1)灵活结合文本、语音、声音和视觉线索的8种多模态表达方式;(2)强调超越单纯检测音频存在的音频内容理解;(3)在表达中融入复杂推理和世界知识。此外,我们引入了全模态指令分割助手(OISA),以应对OmniAVS中多模态推理和视听内容细粒度理解的挑战。OISA利用多模态大语言模型(MLLM)来理解复杂线索并执行基于推理的分割。大量实验表明,OISA在OmniAVS上超越了现有方法,并在其他相关任务中取得了具有竞争力的成果。
English
Referring audio-visual segmentation (RAVS) has recently seen significant
advancements, yet challenges remain in integrating multimodal information and
deeply understanding and reasoning about audiovisual content. To extend the
boundaries of RAVS and facilitate future research in this field, we propose
Omnimodal Referring Audio-Visual Segmentation (OmniAVS), a new dataset
containing 2,098 videos and 59,458 multimodal referring expressions. OmniAVS
stands out with three key innovations: (1) 8 types of multimodal expressions
that flexibly combine text, speech, sound, and visual cues; (2) an emphasis on
understanding audio content beyond just detecting their presence; and (3) the
inclusion of complex reasoning and world knowledge in expressions. Furthermore,
we introduce Omnimodal Instructed Segmentation Assistant (OISA), to address the
challenges of multimodal reasoning and fine-grained understanding of
audiovisual content in OmniAVS. OISA uses MLLM to comprehend complex cues and
perform reasoning-based segmentation. Extensive experiments show that OISA
outperforms existing methods on OmniAVS and achieves competitive results on
other related tasks.