시간을 초월한 메아리: 비디오-오디오 생성 모델의 길이 일반화 능력 해제하기

초록

비디오와 오디오 간의 멀티모달 정합을 확장하는 것은 데이터의 제한성과 텍스트 설명과 프레임 단위 비디오 정보 간의 불일치로 인해 특히 어려운 과제입니다. 본 연구에서는 멀티모달-오디오 생성에서의 확장 과제를 다루며, 짧은 인스턴스로 훈련된 모델이 테스트 시 더 긴 인스턴스로 일반화될 수 있는지 검토합니다. 이 문제를 해결하기 위해 우리는 최신 비디오-오디오 모델의 향상된 확장판인 MMHNet이라는 멀티모달 계층적 네트워크를 제시합니다. 우리의 접근 방식은 계층적 방법과 비인과적 맘바를 통합하여 장편 오디오 생성을 지원합니다. 제안한 방법은 5분 이상의 긴 오디오 생성 성능을 크게 향상시킵니다. 또한 더 긴 지속 시간으로 훈련하지 않고도 비디오-오디오 생성 작업에서 '짧게 훈련하고 길게 테스트'가 가능함을 입증합니다. 실험을 통해 우리의 방법이 장편 비디오-오디오 벤치마크에서 비디오-오디오 작업의 기존 방법들을 능가하는 뛰어난 결과를 달성할 수 있음을 보여줍니다. 더 나아가, 기존 비디오-오디오 방법들이 긴 지속 시간 생성에 어려움을 겪는 반면, 우리 모델이 5분 이상의 오디오를 생성하는 능력을 입증합니다.

English

Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-audio models. Our approach integrates a hierarchical method and non-causal Mamba to support long-form audio generation. Our proposed method significantly improves long audio generation up to more than 5 minutes. We also prove that training short and testing long is possible in the video-to-audio generation tasks without training on the longer durations. We show in our experiments that our proposed method could achieve remarkable results on long-video to audio benchmarks, beating prior works in video-to-audio tasks. Moreover, we showcase our model capability in generating more than 5 minutes, while prior video-to-audio methods fall short in generating with long durations.

시간을 초월한 메아리: 비디오-오디오 생성 모델의 길이 일반화 능력 해제하기

Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models

초록

Support