Klear:統一的音視訊聯合生成多任務框架
Klear: Unified Multi-Task Audio-Video Joint Generation
January 7, 2026
作者: Jun Wang, Chunyu Qiang, Yuxin Guo, Yiran Wang, Xijuan Zeng, Chen Zhang, Pengfei Wan
cs.AI
摘要
音視訊聯合生成技術雖進展迅速,仍面臨重大挑戰。非商業化方法普遍存在音視訊非同步、唇語語音對位偏差及單模態退化等問題,其根源在於音視訊對應建模薄弱、泛化能力有限及高質量密集標註數據稀缺。為解決這些難題,我們提出Klear系統並從三大維度展開探索:模型架構、訓練策略與數據建構。架構方面採用單塔式設計,整合統一DiT模塊與全向全注意力機制,實現緊密音視訊對齊與強大擴展性。訓練策略上實施漸進式多任務方案——通過隨機模態掩碼實現跨任務聯合優化,配合多階段課程學習,形成魯棒表徵、強化音視訊對齊的世界知識,避免單模態崩塌。數據層面我們首創大規模密集標註音視訊數據集,並提出新型自動化數據建構流程,可對數百萬條多樣化、高質量、嚴格對齊的音-視-文三元組進行標註篩選。基於此,Klear能擴展至大規模數據集,在聯合與單模態設定下均實現高保真、語義時序精準對齊的指令跟隨生成,並對分佈外場景展現強健泛化能力。在各項任務中,其性能大幅超越現有方法,達到與Veo 3相當的水準,為新一代音視訊合成提供統一可擴展的解決路徑。
English
Audio-video joint generation has progressed rapidly, yet substantial challenges still remain. Non-commercial approaches still suffer audio-visual asynchrony, poor lip-speech alignment, and unimodal degradation, which can be stemmed from weak audio-visual correspondence modeling, limited generalization, and scarce high-quality dense-caption data. To address these issues, we introduce Klear and delve into three axes--model architecture, training strategy, and data curation. Architecturally, we adopt a single-tower design with unified DiT blocks and an Omni-Full Attention mechanism, achieving tight audio-visual alignment and strong scalability. Training-wise, we adopt a progressive multitask regime--random modality masking to joint optimization across tasks, and a multistage curriculum, yielding robust representations, strengthening A-V aligned world knowledge, and preventing unimodal collapse. For datasets, we present the first large-scale audio-video dataset with dense captions, and introduce a novel automated data-construction pipeline which annotates and filters millions of diverse, high-quality, strictly aligned audio-video-caption triplets. Building on this, Klear scales to large datasets, delivering high-fidelity, semantically and temporally aligned, instruction-following generation in both joint and unimodal settings while generalizing robustly to out-of-distribution scenarios. Across tasks, it substantially outperforms prior methods by a large margin and achieves performance comparable to Veo 3, offering a unified, scalable path toward next-generation audio-video synthesis.