块级联：无需训练的块因果视频模型加速方法

摘要

块因果视频生成面临严峻的速度-质量权衡：1.3B小模型仅能实现16 FPS，而14B大模型更是低至4.5 FPS，迫使用户在响应速度与生成质量间做出取舍。块级联技术通过无需训练的并行化方案显著缓解了这一矛盾。我们的核心发现是：后续视频块无需等待前驱块完全去噪即可开始生成。通过基于部分去噪的上下文信息启动块生成，我们将串行流程转换为并行级联，使多个块可同时进行去噪处理。借助5张GPU实现时序并行，所有模型规模均实现约2倍加速：1.3B模型从16 FPS提升至30 FPS，14B模型从4.5 FPS提升至12.5 FPS。除推理速度提升外，块级联技术还消除了交互式生成中上下文切换时的KV重缓存开销（约200毫秒）。针对多种块因果流程的广泛评估表明，在从块因果推理切换至块级联推理时，生成质量未见显著下降。项目页面：https://hmrishavbandy.github.io/block_cascading_page/

English

Block-causal video generation faces a stark speed-quality trade-off: small 1.3B models manage only 16 FPS while large 14B models crawl at 4.5 FPS, forcing users to choose between responsiveness and quality. Block Cascading significantly mitigates this trade-off through training-free parallelization. Our key insight: future video blocks do not need fully denoised current blocks to begin generation. By starting block generation with partially denoised context from predecessors, we transform sequential pipelines into parallel cascades where multiple blocks denoise simultaneously. With 5 GPUs exploiting temporal parallelism, we achieve ~2x acceleration across all model scales: 1.3B models accelerate from 16 to 30 FPS, 14B models from 4.5 to 12.5 FPS. Beyond inference speed, Block Cascading eliminates overhead from KV-recaching (of ~200ms) during context switches for interactive generation. Extensive evaluations validated against multiple block-causal pipelines demonstrate no significant loss in generation quality when switching from block-causal to Block Cascading pipelines for inference. Project Page: https://hmrishavbandy.github.io/block_cascading_page/