ChatPaper.aiChatPaper

LVD-2M:具有時間密集標題的長片段視頻數據集

LVD-2M: A Long-take Video Dataset with Temporally Dense Captions

October 14, 2024
作者: Tianwei Xiong, Yuqing Wang, Daquan Zhou, Zhijie Lin, Jiashi Feng, Xihui Liu
cs.AI

摘要

影片生成模型的效能在很大程度上取決於其訓練數據集的質量。大多數先前的影片生成模型是在短影片片段上進行訓練的,但最近開始越來越多地對直接在較長影片上訓練長影片生成模型感興趣。然而,缺乏高質量的長影片阻礙了長影片生成技術的進步。為了推動長影片生成的研究,我們希望有一個新的數據集,具備訓練長影片生成模型所需的四個關鍵特徵:(1)至少包含10秒的長影片、(2)無剪輯的長鏡頭影片、(3)大範圍運動和多樣內容、以及(4)時間上密集的字幕。為了實現這一目標,我們引入了一個新的流程,用於選擇高質量的無剪輯長影片並生成時間上密集的字幕。具體來說,我們定義了一組評估影片質量的指標,包括場景切換、動態程度和語義級別質量,這使我們能夠從大量來源影片中篩選出高質量的無剪輯長影片。隨後,我們開發了一個分層影片字幕生成流程,用於為長影片添加時間上密集的字幕。通過這個流程,我們編纂了第一個長鏡頭影片數據集 LVD-2M,包括 200 萬個長鏡頭影片,每個影片長度超過 10 秒,並標註了時間上密集的字幕。我們進一步通過微調影片生成模型以生成具有動態運動的長影片,驗證了 LVD-2M 的有效性。我們相信我們的工作將對未來的長影片生成研究做出重大貢獻。
English
The efficacy of video generation models heavily depends on the quality of their training datasets. Most previous video generation models are trained on short video clips, while recently there has been increasing interest in training long video generation models directly on longer videos. However, the lack of such high-quality long videos impedes the advancement of long video generation. To promote research in long video generation, we desire a new dataset with four key features essential for training long video generation models: (1) long videos covering at least 10 seconds, (2) long-take videos without cuts, (3) large motion and diverse contents, and (4) temporally dense captions. To achieve this, we introduce a new pipeline for selecting high-quality long-take videos and generating temporally dense captions. Specifically, we define a set of metrics to quantitatively assess video quality including scene cuts, dynamic degrees, and semantic-level quality, enabling us to filter high-quality long-take videos from a large amount of source videos. Subsequently, we develop a hierarchical video captioning pipeline to annotate long videos with temporally-dense captions. With this pipeline, we curate the first long-take video dataset, LVD-2M, comprising 2 million long-take videos, each covering more than 10 seconds and annotated with temporally dense captions. We further validate the effectiveness of LVD-2M by fine-tuning video generation models to generate long videos with dynamic motions. We believe our work will significantly contribute to future research in long video generation.

Summary

AI-Generated Summary

PDF213November 16, 2024