ChatPaper.aiChatPaper

VideoUFO:一個百萬規模、以用戶為中心的文本到視頻生成數據集

VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation

March 3, 2025
作者: Wenhao Wang, Yi Yang
cs.AI

摘要

文本到視頻生成模型能夠將文字提示轉化為動態視覺內容,在電影製作、遊戲和教育等領域具有廣泛應用。然而,這些模型在實際應用中的表現往往未能達到用戶的期望。一個關鍵原因在於,這些模型並未針對用戶想要創建的某些主題相關的視頻進行訓練。本文中,我們提出了VideoUFO,這是首個專門針對現實場景中用戶關注點(Users' Focus)精心策劃的視頻數據集。此外,我們的VideoUFO還具備以下特點:(1) 與現有視頻數據集的重疊率極低(僅0.29%),以及(2) 所有視頻均通過YouTube官方API在創意共享許可下獨家搜索獲取。這兩大特性為未來研究者提供了更大的自由度,以擴展其訓練資源。VideoUFO包含超過109萬個視頻片段,每個片段均配備簡短和詳細的說明文字(描述)。具體而言,通過聚類分析,我們首先從百萬級別的實際文本到視頻提示數據集VidProM中識別出1,291個用戶關注的主題。隨後,我們利用這些主題從YouTube上檢索視頻,將檢索到的視頻分割成片段,並為每個片段生成簡短和詳細的說明文字。在驗證這些片段與指定主題的匹配度後,我們最終保留了約109萬個視頻片段。我們的實驗表明:(1) 現有的16種文本到視頻模型在所有用戶關注主題上的表現並不穩定;(2) 在VideoUFO上訓練的簡單模型在表現最差的主題上優於其他模型。該數據集已根據CC BY 4.0許可公開於https://huggingface.co/datasets/WenhaoWang/VideoUFO。
English
Text-to-video generative models convert textual prompts into dynamic visual content, offering wide-ranging applications in film production, gaming, and education. However, their real-world performance often falls short of user expectations. One key reason is that these models have not been trained on videos related to some topics users want to create. In this paper, we propose VideoUFO, the first Video dataset specifically curated to align with Users' FOcus in real-world scenarios. Beyond this, our VideoUFO also features: (1) minimal (0.29%) overlap with existing video datasets, and (2) videos searched exclusively via YouTube's official API under the Creative Commons license. These two attributes provide future researchers with greater freedom to broaden their training sources. The VideoUFO comprises over 1.09 million video clips, each paired with both a brief and a detailed caption (description). Specifically, through clustering, we first identify 1,291 user-focused topics from the million-scale real text-to-video prompt dataset, VidProM. Then, we use these topics to retrieve videos from YouTube, split the retrieved videos into clips, and generate both brief and detailed captions for each clip. After verifying the clips with specified topics, we are left with about 1.09 million video clips. Our experiments reveal that (1) current 16 text-to-video models do not achieve consistent performance across all user-focused topics; and (2) a simple model trained on VideoUFO outperforms others on worst-performing topics. The dataset is publicly available at https://huggingface.co/datasets/WenhaoWang/VideoUFO under the CC BY 4.0 License.

Summary

AI-Generated Summary

PDF82March 4, 2025