Stream-T1:串流影片生成的即時測試階段縮放技術
Stream-T1: Test-Time Scaling for Streaming Video Generation
May 6, 2026
作者: Yijing Tu, Shaojin Wu, Mengqi Huang, Wenchuan Wang, Yuxin Wang, Chunxiao Liu, Zhendong Mao
cs.AI
摘要
雖然測試時縮放技術為提升影片生成品質開闢了無需昂貴訓練成本的新途徑,但當前基於擴散模型的測試時影片生成方法存在候選樣本探索成本過高與缺乏時序引導的缺陷。為突破這些結構性瓶頸,我們提出將重心轉向串流影片生成。我們發現其區塊級合成特性與少量去噪步驟本質上契合測試時縮放技術,既能大幅降低計算開銷,又可實現細粒度時序控制。基於此洞見,我們提出Stream-T1——首個專為串流影片生成設計的開創性測試時縮放框架。具體而言,Stream-T1包含三大核心單元:(1)串流縮放噪聲傳播機制,透過動態採用歷史驗證的高品質前序區塊噪聲來優化生成區塊的初始潛在噪聲,有效建立時序依賴關係並利用歷史高斯先驗引導當前生成;(2)串流縮放獎勵剪枝機制,結合即時短期評估與基於滑動視窗的長期評測,全面權衡生成候選樣本的局部空間美學與全局時序連貫性;(3)串流縮放記憶沉降機制,根據獎勵回饋將從KV快取淘汰的上下文動態路由至不同更新路徑,確保已生成視覺資訊能有效錨定並引導後續影片流。在5秒與30秒綜合影片基準測試中,Stream-T1展現出卓越優勢,顯著提升了時序一致性、運動平滑度及幀級視覺品質。
English
While Test-Time Scaling (TTS) offers a promising direction to enhance video generation without the surging costs of training, current test-time video generation methods based on diffusion models suffer from exorbitant candidate exploration costs and lack temporal guidance. To address these structural bottlenecks, we propose shifting the focus to streaming video generation. We identify that its chunk-level synthesis and few denoising steps are intrinsically suited for TTS, significantly lowering computational overhead while enabling fine-grained temporal control. Driven by this insight, we introduced Stream-T1, a pioneering comprehensive TTS framework exclusively tailored for streaming video generation. Specifically, Stream-T1 is composed of three units: (1) Stream -Scaled Noise Propagation, which actively refines the initial latent noise of the generating chunk using historically proven, high-quality previous chunk noise, effectively establishes temporal dependency and utilizing the historical Gaussian prior to guide the current generation; (2) Stream -Scaled Reward Pruning, which comprehensively evaluates generated candidates to strike an optimal balance between local spatial aesthetics and global temporal coherence by integrating immediate short-term assessments with sliding-window-based long-term evaluations; (3) Stream-Scaled Memory Sinking, which dynamically routes the context evicted from KV-cache into distinct updating pathways guided by the reward feedback, ensuring that previously generated visual information effectively anchors and guides the subsequent video stream. Evaluated on both 5s and 30s comprehensive video benchmarks, Stream-T1 demonstrates profound superiority, significantly improving temporal consistency, motion smoothness, and frame-level visual quality.