ChatPaper.aiChatPaper

Video-T1:視頻生成的測試時間縮放

Video-T1: Test-Time Scaling for Video Generation

March 24, 2025
作者: Fangfu Liu, Hanyang Wang, Yimo Cai, Kaiyan Zhang, Xiaohang Zhan, Yueqi Duan
cs.AI

摘要

隨著訓練數據規模、模型大小和計算成本的提升,視頻生成在數字創作領域取得了令人矚目的成果,使用戶能夠在多個領域中展現創造力。最近,大型語言模型(LLMs)的研究者將擴展能力延伸至測試階段,通過增加推理時的計算量顯著提升了LLM的性能。與其通過昂貴的訓練成本來擴展視頻基礎模型,我們探索了測試時擴展(Test-Time Scaling, TTS)在視頻生成中的潛力,旨在回答一個問題:如果允許視頻生成模型使用相當數量的推理時計算資源,面對具有挑戰性的文本提示,其生成質量能提升多少。在本研究中,我們將視頻生成的測試時擴展重新解讀為一個搜索問題,即從高斯噪聲空間中採樣出更優的軌跡以逼近目標視頻分佈。具體而言,我們構建了一個包含測試時驗證器的搜索空間,以提供反饋並使用啟發式算法來指導搜索過程。給定一個文本提示,我們首先探索了一種直觀的線性搜索策略,即在推理時增加噪聲候選。由於同時對所有幀進行全步去噪需要沉重的測試時計算成本,我們進一步設計了一種更高效的視頻生成TTS方法,稱為“幀之樹”(Tree-of-Frames, ToF),該方法以自回歸的方式自適應地擴展和修剪視頻分支。在文本條件下的視頻生成基準上的大量實驗表明,增加測試時計算資源持續顯著提升了視頻質量。項目頁面:https://liuff19.github.io/Video-T1
English
With the scale capability of increasing training data, model size, and computational cost, video generation has achieved impressive results in digital creation, enabling users to express creativity across various domains. Recently, researchers in Large Language Models (LLMs) have expanded the scaling to test-time, which can significantly improve LLM performance by using more inference-time computation. Instead of scaling up video foundation models through expensive training costs, we explore the power of Test-Time Scaling (TTS) in video generation, aiming to answer the question: if a video generation model is allowed to use non-trivial amount of inference-time compute, how much can it improve generation quality given a challenging text prompt. In this work, we reinterpret the test-time scaling of video generation as a searching problem to sample better trajectories from Gaussian noise space to the target video distribution. Specifically, we build the search space with test-time verifiers to provide feedback and heuristic algorithms to guide searching process. Given a text prompt, we first explore an intuitive linear search strategy by increasing noise candidates at inference time. As full-step denoising all frames simultaneously requires heavy test-time computation costs, we further design a more efficient TTS method for video generation called Tree-of-Frames (ToF) that adaptively expands and prunes video branches in an autoregressive manner. Extensive experiments on text-conditioned video generation benchmarks demonstrate that increasing test-time compute consistently leads to significant improvements in the quality of videos. Project page: https://liuff19.github.io/Video-T1

Summary

AI-Generated Summary

PDF881March 25, 2025