ChatPaper.aiChatPaper

Vript:一段视频胜过千言万语

Vript: A Video Is Worth Thousands of Words

June 10, 2024
作者: Dongjie Yang, Suyuan Huang, Chengqiang Lu, Xiaodong Han, Haoxin Zhang, Yan Gao, Yao Hu, Hai Zhao
cs.AI

摘要

在多模态学习方面的进展,特别是在视频理解和生成方面,需要高质量的视频文本数据集来提高模型性能。Vript通过一个精心注释的语料库,包含12K个高分辨率视频,为超过420K个片段提供了详细、密集且类似脚本的字幕,来解决这一问题。每个片段都有大约145个字的字幕,比大多数视频文本数据集长10倍以上。与以往数据集中仅记录静态内容的字幕不同,我们通过不仅记录内容,还记录摄像机操作(包括镜头类型(中景、特写等)和摄像机移动(平移、倾斜等)),将视频字幕增强为视频脚本。通过利用Vript,我们探索了三种训练范式,即将更多文本与视频模态对齐,而不是片段-字幕对。这导致了Vriptor,一个在开源模型中表现最佳的视频字幕模型,性能与GPT-4V可媲美。Vriptor还是一个强大的模型,能够端到端地生成长视频的密集详细字幕。此外,我们引入了Vript-Hard,一个由三个比现有基准更具挑战性的视频理解任务组成的基准:Vript-HAL是第一个评估视频LLM中动作和物体幻觉的基准,Vript-RR将推理与检索相结合,解决长视频问答中问题歧义的基准,Vript-ERO是一个新任务,评估长视频中事件的时间理解,而不是以往作品中短视频中的动作。所有代码、模型和数据集都可在https://github.com/mutonix/Vript找到。
English
Advancements in multimodal learning, particularly in video understanding and generation, require high-quality video-text datasets for improved model performance. Vript addresses this issue with a meticulously annotated corpus of 12K high-resolution videos, offering detailed, dense, and script-like captions for over 420K clips. Each clip has a caption of ~145 words, which is over 10x longer than most video-text datasets. Unlike captions only documenting static content in previous datasets, we enhance video captioning to video scripting by documenting not just the content, but also the camera operations, which include the shot types (medium shot, close-up, etc) and camera movements (panning, tilting, etc). By utilizing the Vript, we explore three training paradigms of aligning more text with the video modality rather than clip-caption pairs. This results in Vriptor, a top-performing video captioning model among open-source models, comparable to GPT-4V in performance. Vriptor is also a powerful model capable of end-to-end generation of dense and detailed captions for long videos. Moreover, we introduce Vript-Hard, a benchmark consisting of three video understanding tasks that are more challenging than existing benchmarks: Vript-HAL is the first benchmark evaluating action and object hallucinations in video LLMs, Vript-RR combines reasoning with retrieval resolving question ambiguity in long-video QAs, and Vript-ERO is a new task to evaluate the temporal understanding of events in long videos rather than actions in short videos in previous works. All code, models, and datasets are available in https://github.com/mutonix/Vript.

Summary

AI-Generated Summary

PDF300December 8, 2024