视频-T1:视频生成中的测试时缩放
Video-T1: Test-Time Scaling for Video Generation
March 24, 2025
作者: Fangfu Liu, Hanyang Wang, Yimo Cai, Kaiyan Zhang, Xiaohang Zhan, Yueqi Duan
cs.AI
摘要
随着训练数据规模、模型体量和计算成本的不断提升,视频生成在数字创作领域取得了令人瞩目的成果,使用户能够在多个领域展现创意。近期,大型语言模型(LLMs)的研究者将扩展能力延伸至测试阶段,通过增加推理时的计算资源,显著提升了LLM的性能。不同于通过高昂训练成本来扩大视频基础模型规模,我们探索了测试时扩展(Test-Time Scaling, TTS)在视频生成中的潜力,旨在解答一个问题:若允许视频生成模型在推理时使用相当数量的计算资源,面对具有挑战性的文本提示,其生成质量能提升多少。在本研究中,我们将视频生成的测试时扩展重新诠释为一个搜索问题,即从高斯噪声空间中采样更优轨迹以逼近目标视频分布。具体而言,我们构建了包含测试时验证器的搜索空间,以提供反馈,并采用启发式算法指导搜索过程。给定一个文本提示,我们首先探索了一种直观的线性搜索策略,即在推理时增加噪声候选。由于全步骤同时去噪所有帧需要巨大的测试时计算成本,我们进一步设计了一种更为高效的视频生成TTS方法,称为“帧之树”(Tree-of-Frames, ToF),该方法以自回归方式自适应地扩展和修剪视频分支。在文本条件视频生成基准上的大量实验表明,增加测试时计算资源持续显著提升了视频质量。项目页面:https://liuff19.github.io/Video-T1
English
With the scale capability of increasing training data, model size, and
computational cost, video generation has achieved impressive results in digital
creation, enabling users to express creativity across various domains.
Recently, researchers in Large Language Models (LLMs) have expanded the scaling
to test-time, which can significantly improve LLM performance by using more
inference-time computation. Instead of scaling up video foundation models
through expensive training costs, we explore the power of Test-Time Scaling
(TTS) in video generation, aiming to answer the question: if a video generation
model is allowed to use non-trivial amount of inference-time compute, how much
can it improve generation quality given a challenging text prompt. In this
work, we reinterpret the test-time scaling of video generation as a searching
problem to sample better trajectories from Gaussian noise space to the target
video distribution. Specifically, we build the search space with test-time
verifiers to provide feedback and heuristic algorithms to guide searching
process. Given a text prompt, we first explore an intuitive linear search
strategy by increasing noise candidates at inference time. As full-step
denoising all frames simultaneously requires heavy test-time computation costs,
we further design a more efficient TTS method for video generation called
Tree-of-Frames (ToF) that adaptively expands and prunes video branches in an
autoregressive manner. Extensive experiments on text-conditioned video
generation benchmarks demonstrate that increasing test-time compute
consistently leads to significant improvements in the quality of videos.
Project page: https://liuff19.github.io/Video-T1Summary
AI-Generated Summary