MOSEv2：面向复杂场景视频对象分割的更具挑战性数据集

摘要

視頻目標分割（Video Object Segmentation, VOS）旨在對視頻中指定的目標物體進行全程分割。儘管現有的頂尖方法在諸如DAVIS和YouTube-VOS等基準測試中取得了令人矚目的成績（例如，J&F指標超過90%），但這些數據集主要包含顯著、主導且孤立的物體，限制了其在現實場景中的泛化能力。為了推動VOS技術向更為真實的環境邁進，複雜視頻目標分割（MOSEv1）被提出，以促進複雜場景下的VOS研究。基於MOSEv1的優勢與不足，我們推出了MOSEv2，這是一個旨在進一步提升VOS方法在現實條件下應對能力、難度顯著增加的數據集。MOSEv2包含5,024段視頻，涵蓋200個類別的10,074個物體，提供了超過701,976個高質量掩碼。與前代相比，MOSEv2引入了更高的場景複雜性，包括更頻繁的物體消失與重現、嚴重的遮擋與擁擠、更小的物體，以及一系列新的挑戰，如惡劣天氣（如雨、雪、霧）、低光場景（如夜間、水下）、多鏡頭序列、偽裝物體、非實體目標（如陰影、反射）、需要外部知識的場景等。我們在五種不同設置下對20種代表性VOS方法進行了基準測試，觀察到性能的普遍下降。例如，SAM2在MOSEv1上的76.4%下降至MOSEv2上的僅50.9%。我們進一步評估了9種視頻目標跟踪方法，發現了相似的性能下滑，這表明MOSEv2在跨任務層面均提出了挑戰。這些結果強調，儘管現有數據集上已達到高精度，當前VOS方法在面對現實世界的複雜性時仍顯不足。MOSEv2已公開於https://MOSE.video，供公眾使用。

English

Video object segmentation (VOS) aims to segment specified target objects throughout a video. Although state-of-the-art methods have achieved impressive performance (e.g., 90+% J&F) on existing benchmarks such as DAVIS and YouTube-VOS, these datasets primarily contain salient, dominant, and isolated objects, limiting their generalization to real-world scenarios. To advance VOS toward more realistic environments, coMplex video Object SEgmentation (MOSEv1) was introduced to facilitate VOS research in complex scenes. Building on the strengths and limitations of MOSEv1, we present MOSEv2, a significantly more challenging dataset designed to further advance VOS methods under real-world conditions. MOSEv2 consists of 5,024 videos and over 701,976 high-quality masks for 10,074 objects across 200 categories. Compared to its predecessor, MOSEv2 introduces significantly greater scene complexity, including more frequent object disappearance and reappearance, severe occlusions and crowding, smaller objects, as well as a range of new challenges such as adverse weather (e.g., rain, snow, fog), low-light scenes (e.g., nighttime, underwater), multi-shot sequences, camouflaged objects, non-physical targets (e.g., shadows, reflections), scenarios requiring external knowledge, etc. We benchmark 20 representative VOS methods under 5 different settings and observe consistent performance drops. For example, SAM2 drops from 76.4% on MOSEv1 to only 50.9% on MOSEv2. We further evaluate 9 video object tracking methods and find similar declines, demonstrating that MOSEv2 presents challenges across tasks. These results highlight that despite high accuracy on existing datasets, current VOS methods still struggle under real-world complexities. MOSEv2 is publicly available at https://MOSE.video.

MOSEv2：面向复杂场景视频对象分割的更具挑战性数据集

MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes

摘要

Support