ChatPaper.aiChatPaper

持续音视频分割中的模态纠缠驯服

Taming Modality Entanglement in Continual Audio-Visual Segmentation

October 20, 2025
作者: Yuyang Hong, Qi Yang, Tao Zhang, Zili Wang, Zhaojin Fu, Kun Ding, Bin Fan, Shiming Xiang
cs.AI

摘要

近期,多模态持续学习领域取得重要进展,其目标是在多模态环境中顺序学习新任务的同时保持对已学任务的性能。然而现有方法主要聚焦粗粒度任务,在解决细粒度持续学习场景中的模态纠缠问题上存在局限。为填补这一空白,我们提出新型持续音视频分割任务(CAVS),旨在通过音频引导持续分割新类别。经综合分析,我们识别出两大关键挑战:1)多模态语义漂移,即发声物体在连续任务中被误标为背景;2)共现混淆,即频繁共现的类别易产生混淆。本研究设计基于碰撞的多模态回放框架(CMR)应对这些挑战。具体而言,针对多模态语义漂移,提出多模态样本选择策略(MSS),筛选具有高模态一致性的样本进行回放;针对共现混淆,设计基于碰撞的样本回放机制(CSR),在训练过程中提升易混淆类别的回放样本频率。此外,我们构建了三种音视频增量场景验证方法有效性。综合实验表明,本方法显著优于单模态持续学习方法。
English
Recently, significant progress has been made in multi-modal continual learning, aiming to learn new tasks sequentially in multi-modal settings while preserving performance on previously learned ones. However, existing methods mainly focus on coarse-grained tasks, with limitations in addressing modality entanglement in fine-grained continual learning settings. To bridge this gap, we introduce a novel Continual Audio-Visual Segmentation (CAVS) task, aiming to continuously segment new classes guided by audio. Through comprehensive analysis, two critical challenges are identified: 1) multi-modal semantic drift, where a sounding objects is labeled as background in sequential tasks; 2) co-occurrence confusion, where frequent co-occurring classes tend to be confused. In this work, a Collision-based Multi-modal Rehearsal (CMR) framework is designed to address these challenges. Specifically, for multi-modal semantic drift, a Multi-modal Sample Selection (MSS) strategy is proposed to select samples with high modal consistency for rehearsal. Meanwhile, for co-occurence confusion, a Collision-based Sample Rehearsal (CSR) mechanism is designed, allowing for the increase of rehearsal sample frequency of those confusable classes during training process. Moreover, we construct three audio-visual incremental scenarios to verify effectiveness of our method. Comprehensive experiments demonstrate that our method significantly outperforms single-modal continual learning methods.
PDF41December 17, 2025