VISTA:一种测试时自我优化的视频生成智能体
VISTA: A Test-Time Self-Improving Video Generation Agent
October 17, 2025
作者: Do Xuan Long, Xingchen Wan, Hootan Nakhost, Chen-Yu Lee, Tomas Pfister, Sercan Ö. Arık
cs.AI
摘要
尽管文本到视频合成技术取得了快速进展,生成视频的质量仍然高度依赖于精确的用户提示。在其他领域取得成功的现有测试时优化方法,面对视频的多维特性时显得力不从心。在本研究中,我们提出了VISTA(视频迭代自我提升代理),这是一个新颖的多代理系统,通过迭代循环中的提示优化自主提升视频生成质量。VISTA首先将用户创意分解为结构化的时间规划。生成后,通过一场稳健的成对竞赛选出最佳视频。随后,一个专注于视觉、音频和上下文保真度的三人专家小组对获胜视频进行评审。最后,一个推理代理综合这些反馈,内省式地重写并增强提示,进入下一轮生成周期。在单场景和多场景视频生成情境下的实验表明,尽管先前方法带来的提升参差不齐,VISTA却能持续提升视频质量及与用户意图的契合度,相较于最先进的基线模型,其成对胜率高达60%。人类评估者也一致认可,在66.4%的比较中更倾向于VISTA的输出。
English
Despite rapid advances in text-to-video synthesis, generated video quality
remains critically dependent on precise user prompts. Existing test-time
optimization methods, successful in other domains, struggle with the
multi-faceted nature of video. In this work, we introduce VISTA (Video
Iterative Self-improvemenT Agent), a novel multi-agent system that autonomously
improves video generation through refining prompts in an iterative loop. VISTA
first decomposes a user idea into a structured temporal plan. After generation,
the best video is identified through a robust pairwise tournament. This winning
video is then critiqued by a trio of specialized agents focusing on visual,
audio, and contextual fidelity. Finally, a reasoning agent synthesizes this
feedback to introspectively rewrite and enhance the prompt for the next
generation cycle. Experiments on single- and multi-scene video generation
scenarios show that while prior methods yield inconsistent gains, VISTA
consistently improves video quality and alignment with user intent, achieving
up to 60% pairwise win rate against state-of-the-art baselines. Human
evaluators concur, preferring VISTA outputs in 66.4% of comparisons.