打击：时间分段

摘要

在视频中对物体进行分割面临着重大挑战。每个像素必须被准确标记，并且这些标签必须在帧间保持一致。当分割具有任意粒度时，困难会增加，这意味着分段数量可以任意变化，并且基于仅一个或少数几个样本图像定义掩模。在本文中，我们通过采用预训练的文本到图像扩散模型并辅以额外的跟踪机制来解决这个问题。我们展示了我们的方法可以有效地处理各种分割场景，并优于最先进的替代方案。

English

Segmenting an object in a video presents significant challenges. Each pixel must be accurately labelled, and these labels must remain consistent across frames. The difficulty increases when the segmentation is with arbitrary granularity, meaning the number of segments can vary arbitrarily, and masks are defined based on only one or a few sample images. In this paper, we address this issue by employing a pre-trained text to image diffusion model supplemented with an additional tracking mechanism. We demonstrate that our approach can effectively manage various segmentation scenarios and outperforms state-of-the-art alternatives.