迈向全模态表达与推理的视听指称分割

摘要

參考視聽分割（RAVS）領域近期取得了顯著進展，但在整合多模態信息及深入理解與推理視聽內容方面仍面臨挑戰。為拓展RAVS的邊界並推動該領域的未來研究，我們提出了全模態參考視聽分割（OmniAVS），這是一個包含2,098個視頻和59,458條多模態參考表達的新數據集。OmniAVS以其三大創新點脫穎而出：（1）八種類型的多模態表達，靈活結合了文本、語音、聲音及視覺線索；（2）強調對音頻內容的理解，而非僅限於檢測其存在；（3）在表達中融入了複雜推理與世界知識。此外，我們引入了全模態指令分割助手（OISA），以應對OmniAVS中多模態推理與視聽內容細粒度理解的挑戰。OISA利用多模態大語言模型（MLLM）來理解複雜線索並執行基於推理的分割。大量實驗表明，OISA在OmniAVS上超越了現有方法，並在其他相關任務中取得了競爭力的結果。

English

Referring audio-visual segmentation (RAVS) has recently seen significant advancements, yet challenges remain in integrating multimodal information and deeply understanding and reasoning about audiovisual content. To extend the boundaries of RAVS and facilitate future research in this field, we propose Omnimodal Referring Audio-Visual Segmentation (OmniAVS), a new dataset containing 2,098 videos and 59,458 multimodal referring expressions. OmniAVS stands out with three key innovations: (1) 8 types of multimodal expressions that flexibly combine text, speech, sound, and visual cues; (2) an emphasis on understanding audio content beyond just detecting their presence; and (3) the inclusion of complex reasoning and world knowledge in expressions. Furthermore, we introduce Omnimodal Instructed Segmentation Assistant (OISA), to address the challenges of multimodal reasoning and fine-grained understanding of audiovisual content in OmniAVS. OISA uses MLLM to comprehend complex cues and perform reasoning-based segmentation. Extensive experiments show that OISA outperforms existing methods on OmniAVS and achieves competitive results on other related tasks.

迈向全模态表达与推理的视听指称分割

Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation

摘要

Support