迈向全模态表达与推理的视听参考分割

摘要

视听参考分割（RAVS）领域近期取得了显著进展，但在整合多模态信息以及深入理解和推理视听内容方面仍面临挑战。为了拓展RAVS的边界并推动该领域的未来研究，我们提出了全模态参考视听分割（OmniAVS）数据集，包含2,098个视频和59,458条多模态参考表达。OmniAVS凭借三大创新点脱颖而出：（1）灵活结合文本、语音、声音和视觉线索的8种多模态表达方式；（2）强调超越单纯检测音频存在的音频内容理解；（3）在表达中融入复杂推理和世界知识。此外，我们引入了全模态指令分割助手（OISA），以应对OmniAVS中多模态推理和视听内容细粒度理解的挑战。OISA利用多模态大语言模型（MLLM）来理解复杂线索并执行基于推理的分割。大量实验表明，OISA在OmniAVS上超越了现有方法，并在其他相关任务中取得了具有竞争力的成果。

English

Referring audio-visual segmentation (RAVS) has recently seen significant advancements, yet challenges remain in integrating multimodal information and deeply understanding and reasoning about audiovisual content. To extend the boundaries of RAVS and facilitate future research in this field, we propose Omnimodal Referring Audio-Visual Segmentation (OmniAVS), a new dataset containing 2,098 videos and 59,458 multimodal referring expressions. OmniAVS stands out with three key innovations: (1) 8 types of multimodal expressions that flexibly combine text, speech, sound, and visual cues; (2) an emphasis on understanding audio content beyond just detecting their presence; and (3) the inclusion of complex reasoning and world knowledge in expressions. Furthermore, we introduce Omnimodal Instructed Segmentation Assistant (OISA), to address the challenges of multimodal reasoning and fine-grained understanding of audiovisual content in OmniAVS. OISA uses MLLM to comprehend complex cues and perform reasoning-based segmentation. Extensive experiments show that OISA outperforms existing methods on OmniAVS and achieves competitive results on other related tasks.

迈向全模态表达与推理的视听参考分割

Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation

摘要

Support