Lumos-Nexus：用於視頻統一模型的具有同質潛在空間的高效頻率橋接

摘要

基於連接器的視頻統一模型在指令引導的視頻合成中展現了強大的能力，但將大型高保真生成器整合到統一訓練循環中的計算成本過高，限制了可實現的視覺品質。因此，我們提出了Lumos-Nexus，這是一個訓練高效的統一視頻生成框架，能夠在顯著提升視覺逼真度的同時，促進強大的推理驅動生成能力的發展。Lumos-Nexus採用兩階段設計：1）在訓練階段，僅將輕量級生成器與理解區塊對齊，以學習接收推理驅動的語義控制。2）在推論階段，我們引入了統一漸進頻率橋接（UPFB），以逐步將生成任務移交給共享潛在空間中的高容量預訓練生成器，實現從粗到細的細化，並在不影響推理品質的情況下生成高保真視頻。為填補推理驅動視頻生成基準的空白，我們提出了VR-Bench，該基準評估模型將推斷意圖轉化為連貫且語義一致的視頻內容的能力。大量實驗表明，Lumos-Nexus在VBench上顯著提升了視覺真實感和時間連貫性，同時在VR-Bench上展現出強大的基於推理的生成性能。程式碼和模型可於 https://jiazheng-xing.github.io/nexus-lumos-home/ 取得。

English

Connector-based video unified models have demonstrated strong capability in instruction-grounded video synthesis, but integrating a large high-fidelity generator into the unified training loop is computationally prohibitive, limiting achievable visual quality. We therefore propose Lumos-Nexus, a training-efficient unified video generation framework that facilitates the development of strong reasoning-driven generation capabilities while significantly enhancing visual fidelity. Lumos-Nexus adopts a two-stage design: 1) During training, only a lightweight generator is aligned with the understanding block to learn to take in reasoning-driven semantic control. 2) During inference, we introduce Unified Progressive Frequency Bridging (UPFB) to progressively hand off generation to a high-capacity pretrained generator in the shared latent space, enabling coarse-to-fine refinement and producing high-fidelity videos without compromising reasoning quality. To fill the gap in reasoning-driven video generation benchmarks, we introduce VR-Bench, which assesses a model's capability to translate inferred intent into coherent and semantically aligned video content. Extensive experiments demonstrate that Lumos-Nexus achieves substantial gains in visual realism and temporal coherence on VBench, while exhibiting strong reasoning-based generative performance on VR-Bench. Code and models are available at https://jiazheng-xing.github.io/nexus-lumos-home/.