具有隱藏對齊的視訊轉音頻生成

摘要

隨著在文本轉視頻生成領域取得顯著突破，根據視頻輸入生成語義和時間上對齊的音頻內容已成為研究人員的焦點。在這項工作中，我們旨在提供有關視頻轉音頻生成範式的洞察，重點關注三個關鍵方面：視覺編碼器、輔助嵌入和數據擴增技術。從基礎模型VTA-LDM開始，該模型建立在一個簡單但出乎意料地有效的直覺上，我們通過消融研究探索各種視覺編碼器和輔助嵌入。通過採用強調生成質量和視頻-音頻同步對齊的全面評估流程，我們展示了我們的模型展現出最先進的視頻轉音頻生成能力。此外，我們提供了有關不同數據擴增方法對增強生成框架整體能力的影響的關鍵見解。我們展示了從語義和時間角度生成同步音頻的挑戰的可能性。我們希望這些見解將成為發展更加真實和準確的音視覺生成模型的基石。

English

Generating semantically and temporally aligned audio content in accordance with video input has become a focal point for researchers, particularly following the remarkable breakthrough in text-to-video generation. In this work, we aim to offer insights into the video-to-audio generation paradigm, focusing on three crucial aspects: vision encoders, auxiliary embeddings, and data augmentation techniques. Beginning with a foundational model VTA-LDM built on a simple yet surprisingly effective intuition, we explore various vision encoders and auxiliary embeddings through ablation studies. Employing a comprehensive evaluation pipeline that emphasizes generation quality and video-audio synchronization alignment, we demonstrate that our model exhibits state-of-the-art video-to-audio generation capabilities. Furthermore, we provide critical insights into the impact of different data augmentation methods on enhancing the generation framework's overall capacity. We showcase possibilities to advance the challenge of generating synchronized audio from semantic and temporal perspectives. We hope these insights will serve as a stepping stone toward developing more realistic and accurate audio-visual generation models.

具有隱藏對齊的視訊轉音頻生成

Video-to-Audio Generation with Hidden Alignment

摘要

Support