ChatPaper.aiChatPaper

ShareGPT4Video:通过更好的字幕提高视频理解和生成

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

June 6, 2024
作者: Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, Jiaqi Wang
cs.AI

摘要

我们推出了ShareGPT4Video系列,旨在通过密集而精确的字幕促进大型视频语言模型(LVLMs)对视频的理解,以及文本到视频模型(T2VMs)对视频的生成。该系列包括:1)ShareGPT4Video,其中包含40K个GPT4V注释的视频密集字幕,涵盖各种长度和来源的视频,通过精心设计的数据过滤和注释策略开发而成。2)ShareCaptioner-Video,一种高效而强大的任意视频字幕模型,由其注释的480万高质量美学视频。3)ShareGPT4Video-8B,一种简单而出色的LVLM,在三个先进视频基准测试中达到了最先进的性能。为了实现这一目标,我们发现,抛开不可扩展的昂贵人工注释者,使用GPT4V以天真的多帧或帧串联输入策略为视频加字幕会导致较少详细且有时出现时间混乱的结果。我们认为设计高质量视频字幕策略的挑战在于三个方面:1)帧间精确的时间变化理解。2)帧内详细的内容描述。3)对于任意长度视频的帧数可扩展性。为此,我们精心设计了一种差分视频字幕策略,稳定、可扩展且高效,适用于生成具有任意分辨率、宽高比和长度的视频字幕。基于此,我们构建了ShareGPT4Video,其中包含40K个高质量视频,涵盖各种类别,生成的字幕包含丰富的世界知识、物体属性、摄像机移动,关键是事件的详细和精确的时间描述。基于ShareGPT4Video,我们进一步开发了ShareCaptioner-Video,一种优越的字幕生成器,能够高效生成任意视频的高质量字幕...
English
We present the ShareGPT4Video series, aiming to facilitate the video understanding of large video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) via dense and precise captions. The series comprises: 1) ShareGPT4Video, 40K GPT4V annotated dense captions of videos with various lengths and sources, developed through carefully designed data filtering and annotating strategy. 2) ShareCaptioner-Video, an efficient and capable captioning model for arbitrary videos, with 4.8M high-quality aesthetic videos annotated by it. 3) ShareGPT4Video-8B, a simple yet superb LVLM that reached SOTA performance on three advancing video benchmarks. To achieve this, taking aside the non-scalable costly human annotators, we find using GPT4V to caption video with a naive multi-frame or frame-concatenation input strategy leads to less detailed and sometimes temporal-confused results. We argue the challenge of designing a high-quality video captioning strategy lies in three aspects: 1) Inter-frame precise temporal change understanding. 2) Intra-frame detailed content description. 3) Frame-number scalability for arbitrary-length videos. To this end, we meticulously designed a differential video captioning strategy, which is stable, scalable, and efficient for generating captions for videos with arbitrary resolution, aspect ratios, and length. Based on it, we construct ShareGPT4Video, which contains 40K high-quality videos spanning a wide range of categories, and the resulting captions encompass rich world knowledge, object attributes, camera movements, and crucially, detailed and precise temporal descriptions of events. Based on ShareGPT4Video, we further develop ShareCaptioner-Video, a superior captioner capable of efficiently generating high-quality captions for arbitrary videos...

Summary

AI-Generated Summary

PDF764December 8, 2024