VQQA：一种面向视频评估与质量提升的智能体方法

摘要

尽管视频生成模型发展迅速，但其输出与复杂用户意图的匹配仍具挑战性。现有测试时优化方法通常计算成本高昂或需白盒访问模型内部。为此，我们提出VQQA（视频质量问答）——一个可泛化于多模态输入与视频生成任务的统一多智能体框架。通过动态生成视觉问题并利用视觉语言模型的语义批判作为梯度，VQQA以人类可解读、可执行的反馈取代传统被动评估指标。这种基于黑盒自然语言接口的高效闭环提示优化机制，经大量实验证明能有效定位并修复视觉瑕疵，仅需数次优化即可显著提升生成质量。本方法适用于文生视频（T2V）和图生视频（I2V）任务，在T2V-CompBench和VBench2基准上分别实现11.57%和8.43%的绝对提升，显著优于当前最先进的随机搜索与提示优化技术。

English

Despite rapid advancements in video generation models, aligning their outputs with complex user intent remains challenging. Existing test-time optimization methods are typically either computationally expensive or require white-box access to model internals. To address this, we present VQQA (Video Quality Question Answering), a unified, multi-agent framework generalizable across diverse input modalities and video generation tasks. By dynamically generating visual questions and using the resulting Vision-Language Model (VLM) critiques as semantic gradients, VQQA replaces traditional, passive evaluation metrics with human-interpretable, actionable feedback. This enables a highly efficient, closed-loop prompt optimization process via a black-box natural language interface. Extensive experiments demonstrate that VQQA effectively isolates and resolves visual artifacts, substantially improving generation quality in just a few refinement steps. Applicable to both text-to-video (T2V) and image-to-video (I2V) tasks, our method achieves absolute improvements of +11.57% on T2V-CompBench and +8.43% on VBench2 over vanilla generation, significantly outperforming state-of-the-art stochastic search and prompt optimization techniques.

VQQA：一种面向视频评估与质量提升的智能体方法

VQQA: An Agentic Approach for Video Evaluation and Quality Improvement

摘要

Support