ChatPaper.aiChatPaper

VISTA:一種測試時自我提升的視頻生成代理

VISTA: A Test-Time Self-Improving Video Generation Agent

October 17, 2025
作者: Do Xuan Long, Xingchen Wan, Hootan Nakhost, Chen-Yu Lee, Tomas Pfister, Sercan Ö. Arık
cs.AI

摘要

儘管文本到視頻合成技術迅速進步,生成的視頻質量仍然極大地依賴於精確的用戶提示。現有的測試時優化方法在其他領域取得了成功,但在處理視頻的多面性時卻顯得力不從心。在本研究中,我們引入了VISTA(視頻迭代自我改進代理),這是一種新穎的多代理系統,通過在迭代循環中精煉提示來自主提升視頻生成質量。VISTA首先將用戶的想法分解為結構化的時間計劃。生成後,通過一對一淘汰賽選出最佳視頻。隨後,這部獲勝視頻會受到專注於視覺、音頻和上下文保真度的三位專業代理的評審。最後,一個推理代理綜合這些反饋,以內省的方式重寫並增強提示,用於下一個生成週期。在單場景和多場景視頻生成場景中的實驗表明,雖然先前的方法帶來不一致的增益,但VISTA始終如一地提高了視頻質量及其與用戶意圖的契合度,相對於最先進的基線,實現了高達60%的一對一勝率。人類評估者也認同這一點,在66.4%的比較中更偏好VISTA的輸出。
English
Despite rapid advances in text-to-video synthesis, generated video quality remains critically dependent on precise user prompts. Existing test-time optimization methods, successful in other domains, struggle with the multi-faceted nature of video. In this work, we introduce VISTA (Video Iterative Self-improvemenT Agent), a novel multi-agent system that autonomously improves video generation through refining prompts in an iterative loop. VISTA first decomposes a user idea into a structured temporal plan. After generation, the best video is identified through a robust pairwise tournament. This winning video is then critiqued by a trio of specialized agents focusing on visual, audio, and contextual fidelity. Finally, a reasoning agent synthesizes this feedback to introspectively rewrite and enhance the prompt for the next generation cycle. Experiments on single- and multi-scene video generation scenarios show that while prior methods yield inconsistent gains, VISTA consistently improves video quality and alignment with user intent, achieving up to 60% pairwise win rate against state-of-the-art baselines. Human evaluators concur, preferring VISTA outputs in 66.4% of comparisons.
PDF162October 20, 2025