Vript:一段影片勝過千言萬語
Vript: A Video Is Worth Thousands of Words
June 10, 2024
作者: Dongjie Yang, Suyuan Huang, Chengqiang Lu, Xiaodong Han, Haoxin Zhang, Yan Gao, Yao Hu, Hai Zhao
cs.AI
摘要
在多模式學習方面的進展,尤其是在視頻理解和生成方面,需要高質量的視頻文本數據集以提高模型性能。Vript通過一個精心標註的語料庫來解決這個問題,其中包含12K個高分辨率視頻,為超過420K個片段提供了詳細、密集且類似腳本的字幕。每個片段都有約145個字的字幕,比大多數視頻文本數據集長10倍以上。與先前數據集中僅記錄靜態內容的字幕不同,我們通過記錄不僅內容,還包括鏡頭操作,如鏡頭類型(中景、特寫等)和鏡頭運動(平移、傾斜等),將視頻字幕增強為視頻腳本。通過利用Vript,我們探索了三種訓練範式,即將更多文本與視頻模態對齊,而不僅僅是片段-字幕對。這導致Vriptor,一個在開源模型中表現優異的視頻字幕模型,與GPT-4V性能相當。Vriptor還是一個強大的模型,能夠端到端生成長視頻的密集且詳細的字幕。此外,我們介紹了Vript-Hard,這是一個由三個比現有基準更具挑戰性的視頻理解任務組成的基準:Vript-HAL是第一個評估視頻LLM中動作和對象幻覺的基準,Vript-RR結合推理和檢索,解決長視頻QA中問題模糊性的基準,而Vript-ERO是一個新任務,用於評估長視頻中事件的時間理解,而不是先前作品中短視頻中的動作。所有代碼、模型和數據集均可在https://github.com/mutonix/Vript 上找到。
English
Advancements in multimodal learning, particularly in video understanding and
generation, require high-quality video-text datasets for improved model
performance. Vript addresses this issue with a meticulously annotated corpus of
12K high-resolution videos, offering detailed, dense, and script-like captions
for over 420K clips. Each clip has a caption of ~145 words, which is over 10x
longer than most video-text datasets. Unlike captions only documenting static
content in previous datasets, we enhance video captioning to video scripting by
documenting not just the content, but also the camera operations, which include
the shot types (medium shot, close-up, etc) and camera movements (panning,
tilting, etc). By utilizing the Vript, we explore three training paradigms of
aligning more text with the video modality rather than clip-caption pairs. This
results in Vriptor, a top-performing video captioning model among open-source
models, comparable to GPT-4V in performance. Vriptor is also a powerful model
capable of end-to-end generation of dense and detailed captions for long
videos. Moreover, we introduce Vript-Hard, a benchmark consisting of three
video understanding tasks that are more challenging than existing benchmarks:
Vript-HAL is the first benchmark evaluating action and object hallucinations in
video LLMs, Vript-RR combines reasoning with retrieval resolving question
ambiguity in long-video QAs, and Vript-ERO is a new task to evaluate the
temporal understanding of events in long videos rather than actions in short
videos in previous works. All code, models, and datasets are available in
https://github.com/mutonix/Vript.