大規模預訓練用於基於視覺的影片描述生成
Large-scale Pre-training for Grounded Video Caption Generation
March 13, 2025
作者: Evangelos Kazakos, Cordelia Schmid, Josef Sivic
cs.AI
摘要
我們提出了一種新穎的視頻字幕生成與物體定位方法,其中字幕中的物體通過時間密集的邊界框在視頻中進行定位。我們的主要貢獻如下:首先,我們提出了一種大規模自動標註方法,該方法將帶有邊界框的字幕跨幀聚合,形成時間上密集且一致的邊界框標註。我們將此方法應用於HowTo100M數據集,構建了一個名為HowToGround1M的大規模預訓練數據集。同時,我們引入了名為GROVE的基於視頻字幕生成的定位模型,並在HowToGround1M上對其進行預訓練。其次,我們引入了一個新的數據集iGround,包含3500個視頻,這些視頻配有手動標註的字幕及密集的時空定位邊界框。這使我們能夠衡量這一挑戰性問題的進展,並在此小規模但高質量的數據上對模型進行微調。第三,我們展示了與多種基線模型相比,我們的方法在提出的iGround數據集上達到了最先進的結果,同時在VidSTG和ActivityNet-Entities數據集上也表現優異。我們進行了廣泛的消融實驗,證明了使用我們自動標註的HowToGround1M數據集進行預訓練,隨後在手動標註的iGround數據集上進行微調的重要性,並驗證了我們模型的關鍵技術貢獻。
English
We propose a novel approach for captioning and object grounding in video,
where the objects in the caption are grounded in the video via temporally dense
bounding boxes. We introduce the following contributions. First, we present a
large-scale automatic annotation method that aggregates captions grounded with
bounding boxes across individual frames into temporally dense and consistent
bounding box annotations. We apply this approach on the HowTo100M dataset to
construct a large-scale pre-training dataset, named HowToGround1M. We also
introduce a Grounded Video Caption Generation model, dubbed GROVE, and
pre-train the model on HowToGround1M. Second, we introduce a new dataset,
called iGround, of 3500 videos with manually annotated captions and dense
spatio-temporally grounded bounding boxes. This allows us to measure progress
on this challenging problem, as well as to fine-tune our model on this
small-scale but high-quality data. Third, we demonstrate that our approach
achieves state-of-the-art results on the proposed iGround dataset compared to a
number of baselines, as well as on the VidSTG and ActivityNet-Entities
datasets. We perform extensive ablations that demonstrate the importance of
pre-training using our automatically annotated HowToGround1M dataset followed
by fine-tuning on the manually annotated iGround dataset and validate the key
technical contributions of our model.Summary
AI-Generated Summary