Vript：一段影片勝過千言萬語

摘要

在多模式學習方面的進展，尤其是在視頻理解和生成方面，需要高質量的視頻文本數據集以提高模型性能。Vript通過一個精心標註的語料庫來解決這個問題，其中包含12K個高分辨率視頻，為超過420K個片段提供了詳細、密集且類似腳本的字幕。每個片段都有約145個字的字幕，比大多數視頻文本數據集長10倍以上。與先前數據集中僅記錄靜態內容的字幕不同，我們通過記錄不僅內容，還包括鏡頭操作，如鏡頭類型（中景、特寫等）和鏡頭運動（平移、傾斜等），將視頻字幕增強為視頻腳本。通過利用Vript，我們探索了三種訓練範式，即將更多文本與視頻模態對齊，而不僅僅是片段-字幕對。這導致Vriptor，一個在開源模型中表現優異的視頻字幕模型，與GPT-4V性能相當。Vriptor還是一個強大的模型，能夠端到端生成長視頻的密集且詳細的字幕。此外，我們介紹了Vript-Hard，這是一個由三個比現有基準更具挑戰性的視頻理解任務組成的基準：Vript-HAL是第一個評估視頻LLM中動作和對象幻覺的基準，Vript-RR結合推理和檢索，解決長視頻QA中問題模糊性的基準，而Vript-ERO是一個新任務，用於評估長視頻中事件的時間理解，而不是先前作品中短視頻中的動作。所有代碼、模型和數據集均可在https://github.com/mutonix/Vript 上找到。

English

Advancements in multimodal learning, particularly in video understanding and generation, require high-quality video-text datasets for improved model performance. Vript addresses this issue with a meticulously annotated corpus of 12K high-resolution videos, offering detailed, dense, and script-like captions for over 420K clips. Each clip has a caption of ~145 words, which is over 10x longer than most video-text datasets. Unlike captions only documenting static content in previous datasets, we enhance video captioning to video scripting by documenting not just the content, but also the camera operations, which include the shot types (medium shot, close-up, etc) and camera movements (panning, tilting, etc). By utilizing the Vript, we explore three training paradigms of aligning more text with the video modality rather than clip-caption pairs. This results in Vriptor, a top-performing video captioning model among open-source models, comparable to GPT-4V in performance. Vriptor is also a powerful model capable of end-to-end generation of dense and detailed captions for long videos. Moreover, we introduce Vript-Hard, a benchmark consisting of three video understanding tasks that are more challenging than existing benchmarks: Vript-HAL is the first benchmark evaluating action and object hallucinations in video LLMs, Vript-RR combines reasoning with retrieval resolving question ambiguity in long-video QAs, and Vript-ERO is a new task to evaluate the temporal understanding of events in long videos rather than actions in short videos in previous works. All code, models, and datasets are available in https://github.com/mutonix/Vript.

Vript：一段影片勝過千言萬語

Vript: A Video Is Worth Thousands of Words

摘要

Support