持续音视频分割中的模态纠缠驯服
Taming Modality Entanglement in Continual Audio-Visual Segmentation
October 20, 2025
作者: Yuyang Hong, Qi Yang, Tao Zhang, Zili Wang, Zhaojin Fu, Kun Ding, Bin Fan, Shiming Xiang
cs.AI
摘要
近日,多模态持续学习领域取得重要进展,其目标是在多模态环境中顺序学习新任务,同时保持对已学任务的性能。然而现有方法主要关注粗粒度任务,在解决细粒度持续学习场景中的模态纠缠问题上存在局限。为弥补这一空白,我们提出了新颖的持续音视频分割任务(CAVS),旨在通过音频引导持续分割新类别。通过全面分析,我们识别出两大关键挑战:1)多模态语义漂移,即发声物体在连续任务中被错误标注为背景;2)共现混淆,即频繁共同出现的类别易产生混淆。本研究设计了基于碰撞的多模态回放框架(CMR)应对这些挑战。具体而言,针对多模态语义漂移问题,提出多模态样本选择策略(MSS),筛选具有高模态一致性的样本进行回放;针对共现混淆现象,设计基于碰撞的样本回放机制(CSR),在训练过程中增加易混淆类别的回放样本频率。此外,我们构建了三种音视频增量场景以验证方法的有效性。综合实验表明,本方法显著优于单模态持续学习方法。
English
Recently, significant progress has been made in multi-modal continual
learning, aiming to learn new tasks sequentially in multi-modal settings while
preserving performance on previously learned ones. However, existing methods
mainly focus on coarse-grained tasks, with limitations in addressing modality
entanglement in fine-grained continual learning settings. To bridge this gap,
we introduce a novel Continual Audio-Visual Segmentation (CAVS) task, aiming to
continuously segment new classes guided by audio. Through comprehensive
analysis, two critical challenges are identified: 1) multi-modal semantic
drift, where a sounding objects is labeled as background in sequential tasks;
2) co-occurrence confusion, where frequent co-occurring classes tend to be
confused. In this work, a Collision-based Multi-modal Rehearsal (CMR) framework
is designed to address these challenges. Specifically, for multi-modal semantic
drift, a Multi-modal Sample Selection (MSS) strategy is proposed to select
samples with high modal consistency for rehearsal. Meanwhile, for co-occurence
confusion, a Collision-based Sample Rehearsal (CSR) mechanism is designed,
allowing for the increase of rehearsal sample frequency of those confusable
classes during training process. Moreover, we construct three audio-visual
incremental scenarios to verify effectiveness of our method. Comprehensive
experiments demonstrate that our method significantly outperforms single-modal
continual learning methods.