大規模預訓練用於基於視覺的影片描述生成

摘要

我們提出了一種新穎的視頻字幕生成與物體定位方法，其中字幕中的物體通過時間密集的邊界框在視頻中進行定位。我們的主要貢獻如下：首先，我們提出了一種大規模自動標註方法，該方法將帶有邊界框的字幕跨幀聚合，形成時間上密集且一致的邊界框標註。我們將此方法應用於HowTo100M數據集，構建了一個名為HowToGround1M的大規模預訓練數據集。同時，我們引入了名為GROVE的基於視頻字幕生成的定位模型，並在HowToGround1M上對其進行預訓練。其次，我們引入了一個新的數據集iGround，包含3500個視頻，這些視頻配有手動標註的字幕及密集的時空定位邊界框。這使我們能夠衡量這一挑戰性問題的進展，並在此小規模但高質量的數據上對模型進行微調。第三，我們展示了與多種基線模型相比，我們的方法在提出的iGround數據集上達到了最先進的結果，同時在VidSTG和ActivityNet-Entities數據集上也表現優異。我們進行了廣泛的消融實驗，證明了使用我們自動標註的HowToGround1M數據集進行預訓練，隨後在手動標註的iGround數據集上進行微調的重要性，並驗證了我們模型的關鍵技術貢獻。

English

We propose a novel approach for captioning and object grounding in video, where the objects in the caption are grounded in the video via temporally dense bounding boxes. We introduce the following contributions. First, we present a large-scale automatic annotation method that aggregates captions grounded with bounding boxes across individual frames into temporally dense and consistent bounding box annotations. We apply this approach on the HowTo100M dataset to construct a large-scale pre-training dataset, named HowToGround1M. We also introduce a Grounded Video Caption Generation model, dubbed GROVE, and pre-train the model on HowToGround1M. Second, we introduce a new dataset, called iGround, of 3500 videos with manually annotated captions and dense spatio-temporally grounded bounding boxes. This allows us to measure progress on this challenging problem, as well as to fine-tune our model on this small-scale but high-quality data. Third, we demonstrate that our approach achieves state-of-the-art results on the proposed iGround dataset compared to a number of baselines, as well as on the VidSTG and ActivityNet-Entities datasets. We perform extensive ablations that demonstrate the importance of pre-training using our automatically annotated HowToGround1M dataset followed by fine-tuning on the manually annotated iGround dataset and validate the key technical contributions of our model.

大規模預訓練用於基於視覺的影片描述生成

Large-scale Pre-training for Grounded Video Caption Generation

摘要

Support