SportsSloMo：人類中心視頻幀插值的新基準和基線

摘要

以人為中心的影片幀插補具有極大的潛力，可提升人們的娛樂體驗，並在體育分析行業中找到商業應用，例如合成慢動作影片。儘管社區中有多個基準數據集可用，但其中沒有專門針對以人為中心情境的數據集。為彌合這一差距，我們引入了SportsSloMo，這是一個基準數據集，包含超過130K個影片片段和100萬個高分辨率（≥720p）的來自YouTube的慢動作體育影片幀。我們對我們的基準數據集重新訓練了幾種最先進的方法，結果顯示它們的準確性較其他數據集有所降低。這凸顯了我們的基準數據集的困難性，並表明即使對於表現最佳的方法，它也構成了重大挑戰，因為人體高度可變形，並且在體育影片中頻繁出現遮擋。為了提高準確性，我們引入了兩個損失項，考慮到人類感知先驗知識，我們分別對全景分割和人體關鍵點檢測添加了輔助監督。這些損失項是模型不可知的，可以輕鬆地插入任何影片幀插補方法中。實驗結果驗證了我們提出的損失項的有效性，導致對5個現有模型的一致性性能改進，這些模型在我們的基準數據集上建立了強大的基準模型。數據集和代碼可在以下網址找到：https://neu-vi.github.io/SportsSlomo/.

English

Human-centric video frame interpolation has great potential for improving people's entertainment experiences and finding commercial applications in the sports analysis industry, e.g., synthesizing slow-motion videos. Although there are multiple benchmark datasets available in the community, none of them is dedicated for human-centric scenarios. To bridge this gap, we introduce SportsSloMo, a benchmark consisting of more than 130K video clips and 1M video frames of high-resolution (geq720p) slow-motion sports videos crawled from YouTube. We re-train several state-of-the-art methods on our benchmark, and the results show a decrease in their accuracy compared to other datasets. It highlights the difficulty of our benchmark and suggests that it poses significant challenges even for the best-performing methods, as human bodies are highly deformable and occlusions are frequent in sports videos. To improve the accuracy, we introduce two loss terms considering the human-aware priors, where we add auxiliary supervision to panoptic segmentation and human keypoints detection, respectively. The loss terms are model agnostic and can be easily plugged into any video frame interpolation approaches. Experimental results validate the effectiveness of our proposed loss terms, leading to consistent performance improvement over 5 existing models, which establish strong baseline models on our benchmark. The dataset and code can be found at: https://neu-vi.github.io/SportsSlomo/.

SportsSloMo：人類中心視頻幀插值的新基準和基線

SportsSloMo: A New Benchmark and Baselines for Human-centric Video Frame Interpolation

摘要

Support