VQQA:一种面向视频评估与质量提升的智能体方法
VQQA: An Agentic Approach for Video Evaluation and Quality Improvement
March 12, 2026
作者: Yiwen Song, Tomas Pfister, Yale Song
cs.AI
摘要
尽管视频生成模型发展迅速,但其输出与复杂用户意图的匹配仍具挑战性。现有测试时优化方法通常计算成本高昂或需白盒访问模型内部。为此,我们提出VQQA(视频质量问答)——一个可泛化于多模态输入与视频生成任务的统一多智能体框架。通过动态生成视觉问题并利用视觉语言模型的语义批判作为梯度,VQQA以人类可解读、可执行的反馈取代传统被动评估指标。这种基于黑盒自然语言接口的高效闭环提示优化机制,经大量实验证明能有效定位并修复视觉瑕疵,仅需数次优化即可显著提升生成质量。本方法适用于文生视频(T2V)和图生视频(I2V)任务,在T2V-CompBench和VBench2基准上分别实现11.57%和8.43%的绝对提升,显著优于当前最先进的随机搜索与提示优化技术。
English
Despite rapid advancements in video generation models, aligning their outputs with complex user intent remains challenging. Existing test-time optimization methods are typically either computationally expensive or require white-box access to model internals. To address this, we present VQQA (Video Quality Question Answering), a unified, multi-agent framework generalizable across diverse input modalities and video generation tasks. By dynamically generating visual questions and using the resulting Vision-Language Model (VLM) critiques as semantic gradients, VQQA replaces traditional, passive evaluation metrics with human-interpretable, actionable feedback. This enables a highly efficient, closed-loop prompt optimization process via a black-box natural language interface. Extensive experiments demonstrate that VQQA effectively isolates and resolves visual artifacts, substantially improving generation quality in just a few refinement steps. Applicable to both text-to-video (T2V) and image-to-video (I2V) tasks, our method achieves absolute improvements of +11.57% on T2V-CompBench and +8.43% on VBench2 over vanilla generation, significantly outperforming state-of-the-art stochastic search and prompt optimization techniques.