Any2Caption：解讀任意條件以生成可控影片的字幕

摘要

為解決當前視頻生成領域中精確解讀用戶意圖的瓶頸問題，我們提出了Any2Caption，這是一個新穎的框架，旨在實現任意條件下的可控視頻生成。其核心思想在於將多種條件解讀步驟與視頻合成步驟分離。通過利用現代多模態大語言模型（MLLMs），Any2Caption能夠將文本、圖像、視頻以及特定提示（如區域、運動和相機姿態）等多樣化輸入轉化為密集且結構化的描述，從而為骨幹視頻生成器提供更優的指導。此外，我們還引入了Any2CapIns，這是一個包含337K個實例和407K種條件的大規模數據集，專為任意條件到描述的指令微調而設計。全面評估表明，我們的系統在現有視頻生成模型的多個方面均顯著提升了可控性和視頻質量。項目頁面：https://sqwu.top/Any2Cap/

English

To address the bottleneck of accurate user intent interpretation within the current video generation community, we present Any2Caption, a novel framework for controllable video generation under any condition. The key idea is to decouple various condition interpretation steps from the video synthesis step. By leveraging modern multimodal large language models (MLLMs), Any2Caption interprets diverse inputs--text, images, videos, and specialized cues such as region, motion, and camera poses--into dense, structured captions that offer backbone video generators with better guidance. We also introduce Any2CapIns, a large-scale dataset with 337K instances and 407K conditions for any-condition-to-caption instruction tuning. Comprehensive evaluations demonstrate significant improvements of our system in controllability and video quality across various aspects of existing video generation models. Project Page: https://sqwu.top/Any2Cap/

Any2Caption：解讀任意條件以生成可控影片的字幕

Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation

摘要

Support