Stream-T1：流式视频生成的测试时缩放技术

摘要

尽管测试时缩放（TTS）为提升视频生成质量提供了无需高昂训练成本的新思路，但当前基于扩散模型的测试时视频生成方法仍面临候选序列探索成本过高且缺乏时序引导的瓶颈。为突破这些结构性限制，我们提出将研究重心转向流式视频生成。我们发现其分块合成机制与少量去噪步骤的特性天然契合TTS框架，既能显著降低计算开销，又可实现细粒度的时序控制。基于这一洞见，我们首创了Stream-T1——专为流式视频生成设计的综合性TTS框架。该框架包含三大核心单元：（1）流式缩放噪声传播机制，通过动态优化生成块的初始潜在噪声，主动利用历史生成块中经过验证的高质量噪声建立时序依赖，借助历史高斯先验指导当前生成；（2）流式缩放奖励剪枝机制，综合评估生成候选序列，结合即时短期评估与基于滑动窗口的长期评估，在局部空间美学与全局时序连贯性间实现最优平衡；（3）流式缩放记忆沉淀机制，根据奖励反馈将KV缓存中置换出的上下文动态路由至不同更新路径，确保已生成视觉信息有效锚定并引导后续视频流。在5秒与30秒视频生成基准测试中，Stream-T1展现出显著优势，大幅提升了时序一致性、运动平滑度及帧级视觉质量。

English

While Test-Time Scaling (TTS) offers a promising direction to enhance video generation without the surging costs of training, current test-time video generation methods based on diffusion models suffer from exorbitant candidate exploration costs and lack temporal guidance. To address these structural bottlenecks, we propose shifting the focus to streaming video generation. We identify that its chunk-level synthesis and few denoising steps are intrinsically suited for TTS, significantly lowering computational overhead while enabling fine-grained temporal control. Driven by this insight, we introduced Stream-T1, a pioneering comprehensive TTS framework exclusively tailored for streaming video generation. Specifically, Stream-T1 is composed of three units: (1) Stream -Scaled Noise Propagation, which actively refines the initial latent noise of the generating chunk using historically proven, high-quality previous chunk noise, effectively establishes temporal dependency and utilizing the historical Gaussian prior to guide the current generation; (2) Stream -Scaled Reward Pruning, which comprehensively evaluates generated candidates to strike an optimal balance between local spatial aesthetics and global temporal coherence by integrating immediate short-term assessments with sliding-window-based long-term evaluations; (3) Stream-Scaled Memory Sinking, which dynamically routes the context evicted from KV-cache into distinct updating pathways guided by the reward feedback, ensuring that previously generated visual information effectively anchors and guides the subsequent video stream. Evaluated on both 5s and 30s comprehensive video benchmarks, Stream-T1 demonstrates profound superiority, significantly improving temporal consistency, motion smoothness, and frame-level visual quality.