MOSEv2:面向复杂场景视频目标分割的更具挑战性数据集
MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes
August 7, 2025
作者: Henghui Ding, Kaining Ying, Chang Liu, Shuting He, Xudong Jiang, Yu-Gang Jiang, Philip H. S. Torr, Song Bai
cs.AI
摘要
视频目标分割(VOS)旨在在整个视频中分割指定的目标对象。尽管最先进的方法在现有基准测试(如DAVIS和YouTube-VOS)上取得了令人印象深刻的性能(例如,J&F超过90%),但这些数据集主要包含显著、主导且孤立的对象,限制了它们对现实世界场景的泛化能力。为了推动VOS向更真实的环境发展,复杂视频目标分割(MOSEv1)被引入,以促进复杂场景中的VOS研究。基于MOSEv1的优势和局限性,我们提出了MOSEv2,这是一个显著更具挑战性的数据集,旨在进一步推动VOS方法在现实世界条件下的发展。MOSEv2包含5,024个视频和超过701,976个高质量掩码,涵盖了200个类别的10,074个对象。与前一版本相比,MOSEv2引入了显著更高的场景复杂性,包括更频繁的对象消失和重现、严重的遮挡和拥挤、更小的对象,以及一系列新的挑战,如恶劣天气(例如,雨、雪、雾)、低光场景(例如,夜间、水下)、多镜头序列、伪装对象、非物理目标(例如,阴影、反射)、需要外部知识的场景等。我们在5种不同设置下对20种代表性VOS方法进行了基准测试,并观察到一致性的性能下降。例如,SAM2在MOSEv1上的76.4%下降到MOSEv2上的仅50.9%。我们进一步评估了9种视频目标跟踪方法,并发现了类似的下降,表明MOSEv2在跨任务中提出了挑战。这些结果突显了尽管现有数据集上的高精度,当前VOS方法在现实世界的复杂性下仍然面临困难。MOSEv2公开在https://MOSE.video。
English
Video object segmentation (VOS) aims to segment specified target objects
throughout a video. Although state-of-the-art methods have achieved impressive
performance (e.g., 90+% J&F) on existing benchmarks such as DAVIS and
YouTube-VOS, these datasets primarily contain salient, dominant, and isolated
objects, limiting their generalization to real-world scenarios. To advance VOS
toward more realistic environments, coMplex video Object SEgmentation (MOSEv1)
was introduced to facilitate VOS research in complex scenes. Building on the
strengths and limitations of MOSEv1, we present MOSEv2, a significantly more
challenging dataset designed to further advance VOS methods under real-world
conditions. MOSEv2 consists of 5,024 videos and over 701,976 high-quality masks
for 10,074 objects across 200 categories. Compared to its predecessor, MOSEv2
introduces significantly greater scene complexity, including more frequent
object disappearance and reappearance, severe occlusions and crowding, smaller
objects, as well as a range of new challenges such as adverse weather (e.g.,
rain, snow, fog), low-light scenes (e.g., nighttime, underwater), multi-shot
sequences, camouflaged objects, non-physical targets (e.g., shadows,
reflections), scenarios requiring external knowledge, etc. We benchmark 20
representative VOS methods under 5 different settings and observe consistent
performance drops. For example, SAM2 drops from 76.4% on MOSEv1 to only 50.9%
on MOSEv2. We further evaluate 9 video object tracking methods and find similar
declines, demonstrating that MOSEv2 presents challenges across tasks. These
results highlight that despite high accuracy on existing datasets, current VOS
methods still struggle under real-world complexities. MOSEv2 is publicly
available at https://MOSE.video.