스트리밍 동기화 공간 오디오 생성을 위한 자기회귀 확산 변환기

초록

실시간으로 정확한 공간 오디오 생성은 몰입형 경험을 제공하는 데 핵심적이다. 그러나 기존의 공간 오디오 합성 기술은 생성 품질과 높은 추론 지연 시간 간의 트레이드오프, 그리고 다중 모달 입력으로부터 정밀한 공간 정보를 포착하는 데 있어 어려움으로 인해 종종 제약을 받는다. 이러한 문제를 해결하기 위해, 우리는 파노라마 비디오와 텍스트 프롬프트로부터 고충실도 공간 오디오를 생성하는 통합 스트리밍 프레임워크인 SwanSphere를 제안한다. SwanSphere의 주요 기여는 다음과 같다: 1) 고품질 공간 오디오의 스트리밍 생성을 가능하게 하는 인과적 자기회귀 확산 트랜스포머 아키텍처를 도입한다. 2) 비디오 인코더를 음향 도메인에 정렬하기 위한 공간 비디오-오디오 대조 학습(SVAC) 전략을 설계하고, 추가로 다중 목표 온라인 직접 선호 최적화(ODPO) 기법을 적용하여 강력한 공간 인식과 견고한 다중 모달 공간 오디오 합성을 달성한다. 3) 현재 공간 오디오 데이터셋의 부족을 완화하기 위해, 상세한 공간 캡션을 생성하는 자동 주석 파이프라인을 개발한다. 실험 결과는 SwanSphere가 비디오-공간 오디오 및 텍스트-공간 오디오 생성 작업 모두에서 우수한 성능을 달성함을 보여준다. 데모는 https://swanaigc.github.io에서 확인할 수 있다.

English

Real-time and accurate spatial audio generation is pivotal for delivering an immersive experience. However, existing spatial audio synthesis technologies are often encumbered by a tradeoff between generation quality and high inference latency, as well as difficulty in capturing precise spatial information from multimodal inputs. To address these challenges, we propose SwanSphere, a unified streaming framework for high-fidelity spatial audio generation from panoramic videos and text prompts. SwanSphere mainly makes the following contributions: 1) We introduce a causal autoregressive diffusion transformer architecture that enables streaming high-quality spatial audio generation. 2) We design a Spatial Video-Audio Contrastive (SVAC) learning strategy to align the video encoder with the acoustic domain, and further employ a multi-objective online direct preference optimization (ODPO) scheme, resulting in strong spatial perception and robust multimodal spatial audio synthesis. 3) To alleviate the current scarcity of spatial audio datasets, we also develop an automated annotation pipeline for generating detailed spatial captions. Experimental results demonstrate that SwanSphere achieves superior performance in both video-to-spatial and text-to-spatial audio generation tasks. Demos can be found at: https://swanaigc.github.io.