迈向全模态表达与推理的视听指称分割
Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation
July 30, 2025
作者: Kaining Ying, Henghui Ding, Guanquan Jie, Yu-Gang Jiang
cs.AI
摘要
參考視聽分割(RAVS)領域近期取得了顯著進展,但在整合多模態信息及深入理解與推理視聽內容方面仍面臨挑戰。為拓展RAVS的邊界並推動該領域的未來研究,我們提出了全模態參考視聽分割(OmniAVS),這是一個包含2,098個視頻和59,458條多模態參考表達的新數據集。OmniAVS以其三大創新點脫穎而出:(1)八種類型的多模態表達,靈活結合了文本、語音、聲音及視覺線索;(2)強調對音頻內容的理解,而非僅限於檢測其存在;(3)在表達中融入了複雜推理與世界知識。此外,我們引入了全模態指令分割助手(OISA),以應對OmniAVS中多模態推理與視聽內容細粒度理解的挑戰。OISA利用多模態大語言模型(MLLM)來理解複雜線索並執行基於推理的分割。大量實驗表明,OISA在OmniAVS上超越了現有方法,並在其他相關任務中取得了競爭力的結果。
English
Referring audio-visual segmentation (RAVS) has recently seen significant
advancements, yet challenges remain in integrating multimodal information and
deeply understanding and reasoning about audiovisual content. To extend the
boundaries of RAVS and facilitate future research in this field, we propose
Omnimodal Referring Audio-Visual Segmentation (OmniAVS), a new dataset
containing 2,098 videos and 59,458 multimodal referring expressions. OmniAVS
stands out with three key innovations: (1) 8 types of multimodal expressions
that flexibly combine text, speech, sound, and visual cues; (2) an emphasis on
understanding audio content beyond just detecting their presence; and (3) the
inclusion of complex reasoning and world knowledge in expressions. Furthermore,
we introduce Omnimodal Instructed Segmentation Assistant (OISA), to address the
challenges of multimodal reasoning and fine-grained understanding of
audiovisual content in OmniAVS. OISA uses MLLM to comprehend complex cues and
perform reasoning-based segmentation. Extensive experiments show that OISA
outperforms existing methods on OmniAVS and achieves competitive results on
other related tasks.