ChatPaper.aiChatPaper

分享GPT4Video:通過更好的標題來改善視頻理解和生成

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

June 6, 2024
作者: Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, Jiaqi Wang
cs.AI

摘要

我們提出了ShareGPT4Video系列,旨在通過密集而精確的字幕來促進大規模視頻語言模型(LVLMs)的視頻理解和文本到視頻模型(T2VMs)的視頻生成。該系列包括:1)ShareGPT4Video,包含40K個GPT4V標註的視頻密集字幕,覆蓋各種長度和來源的視頻,通過精心設計的數據過濾和標註策略開發而成。2)ShareCaptioner-Video,一個高效而強大的任意視頻字幕模型,通過對4.8M個高質量美學視頻進行標註。3)ShareGPT4Video-8B,一個簡單而出色的LVLM,在三個不斷進步的視頻基準上實現了SOTA性能。為實現這一目標,我們發現,除去不可擴展的昂貴人工標註者,使用GPT4V對視頻進行字幕標註,採用天真的多幀或幀串接輸入策略,導致結果較少詳細,有時會混淆時間。我們認為設計高質量視頻字幕策略的挑戰在於三個方面:1)幀間精確的時間變化理解。2)幀內詳細的內容描述。3)對於任意長度的視頻,幀數的可擴展性。為此,我們精心設計了一種差異化的視頻字幕策略,穩定、可擴展且高效,適用於生成具有任意分辨率、寬高比和長度的視頻字幕。基於此,我們構建了ShareGPT4Video,其中包含40K個高質量視頻,涵蓋各種類別,生成的字幕包含豐富的世界知識、物體屬性、攝像機運動,重要的是,事件的詳細和精確的時間描述。基於ShareGPT4Video,我們進一步開發了ShareCaptioner-Video,一個優秀的字幕生成器,能夠高效生成任意視頻的高質量字幕...
English
We present the ShareGPT4Video series, aiming to facilitate the video understanding of large video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) via dense and precise captions. The series comprises: 1) ShareGPT4Video, 40K GPT4V annotated dense captions of videos with various lengths and sources, developed through carefully designed data filtering and annotating strategy. 2) ShareCaptioner-Video, an efficient and capable captioning model for arbitrary videos, with 4.8M high-quality aesthetic videos annotated by it. 3) ShareGPT4Video-8B, a simple yet superb LVLM that reached SOTA performance on three advancing video benchmarks. To achieve this, taking aside the non-scalable costly human annotators, we find using GPT4V to caption video with a naive multi-frame or frame-concatenation input strategy leads to less detailed and sometimes temporal-confused results. We argue the challenge of designing a high-quality video captioning strategy lies in three aspects: 1) Inter-frame precise temporal change understanding. 2) Intra-frame detailed content description. 3) Frame-number scalability for arbitrary-length videos. To this end, we meticulously designed a differential video captioning strategy, which is stable, scalable, and efficient for generating captions for videos with arbitrary resolution, aspect ratios, and length. Based on it, we construct ShareGPT4Video, which contains 40K high-quality videos spanning a wide range of categories, and the resulting captions encompass rich world knowledge, object attributes, camera movements, and crucially, detailed and precise temporal descriptions of events. Based on ShareGPT4Video, we further develop ShareCaptioner-Video, a superior captioner capable of efficiently generating high-quality captions for arbitrary videos...

Summary

AI-Generated Summary

PDF764December 8, 2024