^RFLAV: 무한 오디오 비디오 생성을 위한 롤링 플로우 매칭

초록

오디오-비디오(AV) 결합 생성은 생성형 AI에서 여전히 중요한 과제로 남아 있으며, 이는 주로 세 가지 핵심 요구 사항 때문입니다: 생성된 샘플의 품질, 원활한 다중 모달 동기화 및 시간적 일관성(오디오 트랙이 시각 데이터와 일치하고 그 반대도 마찬가지여야 함), 그리고 무제한의 비디오 지속 시간. 본 논문에서는 AV 생성의 모든 주요 과제를 해결하는 새로운 트랜스포머 기반 아키텍처를 제시합니다. 우리는 세 가지 독특한 교차 모달리티 상호작용 모듈을 탐구하며, 그중에서도 경량화된 시간적 융합 모듈이 오디오와 시각 모달리티를 정렬하는 데 가장 효과적이고 계산적으로 효율적인 접근 방식으로 부각되었습니다. 우리의 실험 결과는 이 모델이 다중 모달 AV 생성 작업에서 기존의 최첨단 모델들을 능가함을 보여줍니다. 우리의 코드와 체크포인트는 https://github.com/ErgastiAlex/R-FLAV에서 확인할 수 있습니다.

English

Joint audio-video (AV) generation is still a significant challenge in generative AI, primarily due to three critical requirements: quality of the generated samples, seamless multimodal synchronization and temporal coherence, with audio tracks that match the visual data and vice versa, and limitless video duration. In this paper, we present , a novel transformer-based architecture that addresses all the key challenges of AV generation. We explore three distinct cross modality interaction modules, with our lightweight temporal fusion module emerging as the most effective and computationally efficient approach for aligning audio and visual modalities. Our experimental results demonstrate that outperforms existing state-of-the-art models in multimodal AV generation tasks. Our code and checkpoints are available at https://github.com/ErgastiAlex/R-FLAV.

^RFLAV: 무한 오디오 비디오 생성을 위한 롤링 플로우 매칭

^RFLAV: Rolling Flow matching for infinite Audio Video generation

초록

Support