SANA-WM：基於混合線性擴散 Transformer 的高效分鐘級世界建模

摘要

我們介紹 SANA-WM，這是一個高效的 26 億參數開源世界模型，原生訓練用於生成一分鐘長度的影片，能夠合成高保真、720p、分鐘級影片並實現精確的相機控制。SANA-WM 在視覺品質上可與 LingBot-World 和 HY-WorldPlay 等大規模工業基準相媲美，同時顯著提升效率。其架構由四項核心設計驅動：(1) 混合線性注意力結合逐幀門控 DeltaNet (GDN) 與 softmax 注意力，實現記憶體高效的長上下文建模。(2) 雙分支相機控制確保精確的六自由度軌跡遵循。(3) 兩階段生成流程對第一階段輸出應用長影片精煉器，提升序列的品質與一致性。(4) 穩健的標註流程從公開影片中提取精確的度量尺度六自由度相機姿態，生成高品質、時空一致的行動標籤。在這些設計的驅動下，SANA-WM 在資料、訓練計算與推理硬體方面展現出卓越效率：僅使用 21.3 萬個帶有度量尺度姿態監督的公開影片片段，在 64 塊 H100 上 15 天完成訓練，並在單一 GPU 上生成每段 60 秒的剪輯；其蒸餾版本可部署於單張 RTX 5090 上，透過 NVFP4 量化以 34 秒去噪生成一段 60 秒的 720p 影片。在我們的一分鐘世界模型基準測試中，SANA-WM 展現出比先前開源基準更強的行動跟隨精度，並以 36 倍更高的吞吐量達到可比較的視覺品質，實現可擴展的世界建模。

English

We introduce SANA-WM, an efficient 2.6B-parameter open-source world model natively trained for one-minute generation, synthesizing high-fidelity, 720p, minute-scale videos with precise camera control. SANA-WM achieves visual quality comparable to large-scale industrial baselines such as LingBot-World and HY-WorldPlay, while significantly improving efficiency. Four core designs drive our architecture: (1) Hybrid Linear Attention combines frame-wise Gated DeltaNet (GDN) with softmax attention for memory-efficient long-context modeling. (2) Dual-Branch Camera Control ensures precise 6-DoF trajectory adherence. (3) Two-Stage Generation Pipeline applies a long-video refiner to stage-1 outputs, improving quality and consistency across sequences. (4) Robust Annotation Pipeline extracts accurate metric-scale 6-DoF camera poses from public videos to yield high-quality, spatiotemporally consistent action labels. Driven by these designs, SANA-WMdemonstrates remarkable efficiency across data, training compute, and inference hardware: it uses only sim213K public video clips with metric-scale pose supervision, completes training in 15 days on 64 H100s, and generates each 60s clip on a single GPU; its distilled variant can be deployed on a single RTX 5090 with NVFP4 quantization to denoise a 60s 720p clip in 34s. On our one-minute world-model benchmark, SANA-WM demonstrates stronger action-following accuracy than prior open-source baselines and achieves comparable visual quality at 36times higher throughput for scalable world modeling.