Video-T1: 비디오 생성을 위한 테스트 타임 스케일링

초록

훈련 데이터, 모델 크기, 그리고 계산 비용의 규모가 증가함에 따라, 비디오 생성은 디지털 창작 분야에서 인상적인 결과를 달성하며 사용자들이 다양한 영역에서 창의성을 표현할 수 있게 하였습니다. 최근 대형 언어 모델(LLMs) 연구자들은 테스트 시간에서의 스케일링을 확장하여, 더 많은 추론 시간 계산을 사용함으로써 LLM 성능을 크게 향상시킬 수 있음을 보여주었습니다. 비디오 기반 모델을 비싼 훈련 비용을 통해 스케일업하는 대신, 우리는 비디오 생성에서 테스트 시간 스케일링(TTS)의 힘을 탐구하며, 다음과 같은 질문에 답하고자 합니다: 만약 비디오 생성 모델이 상당한 양의 추론 시간 계산을 사용할 수 있다면, 도전적인 텍스트 프롬프트가 주어졌을 때 생성 품질을 얼마나 향상시킬 수 있을까? 이 연구에서, 우리는 비디오 생성의 테스트 시간 스케일링을 가우시안 노이즈 공간에서 목표 비디오 분포로 더 나은 궤적을 샘플링하는 탐색 문제로 재해석합니다. 구체적으로, 우리는 테스트 시간 검증기를 사용하여 탐색 공간을 구축하고, 탐색 과정을 안내하기 위한 휴리스틱 알고리즘을 제공합니다. 주어진 텍스트 프롬프트에 대해, 우리는 먼저 추론 시간에 노이즈 후보를 증가시켜 직관적인 선형 탐색 전략을 탐구합니다. 모든 프레임을 동시에 완전히 디노이징하는 것은 높은 테스트 시간 계산 비용을 요구하기 때문에, 우리는 더 효율적인 TTS 방법인 Tree-of-Frames(ToF)를 설계합니다. 이 방법은 비디오 브랜치를 자동회귀 방식으로 적응적으로 확장하고 가지치기합니다. 텍스트 조건 비디오 생성 벤치마크에서의 광범위한 실험은 테스트 시간 계산을 증가시키는 것이 비디오 품질의 지속적인 개선으로 이어진다는 것을 보여줍니다. 프로젝트 페이지: https://liuff19.github.io/Video-T1

English

With the scale capability of increasing training data, model size, and computational cost, video generation has achieved impressive results in digital creation, enabling users to express creativity across various domains. Recently, researchers in Large Language Models (LLMs) have expanded the scaling to test-time, which can significantly improve LLM performance by using more inference-time computation. Instead of scaling up video foundation models through expensive training costs, we explore the power of Test-Time Scaling (TTS) in video generation, aiming to answer the question: if a video generation model is allowed to use non-trivial amount of inference-time compute, how much can it improve generation quality given a challenging text prompt. In this work, we reinterpret the test-time scaling of video generation as a searching problem to sample better trajectories from Gaussian noise space to the target video distribution. Specifically, we build the search space with test-time verifiers to provide feedback and heuristic algorithms to guide searching process. Given a text prompt, we first explore an intuitive linear search strategy by increasing noise candidates at inference time. As full-step denoising all frames simultaneously requires heavy test-time computation costs, we further design a more efficient TTS method for video generation called Tree-of-Frames (ToF) that adaptively expands and prunes video branches in an autoregressive manner. Extensive experiments on text-conditioned video generation benchmarks demonstrate that increasing test-time compute consistently leads to significant improvements in the quality of videos. Project page: https://liuff19.github.io/Video-T1

Video-T1: 비디오 생성을 위한 테스트 타임 스케일링

Video-T1: Test-Time Scaling for Video Generation

초록

Support