视频模型有多自信？赋予视频模型表达其不确定性的能力

摘要

生成式视频模型展现了卓越的文本到视频转换能力，推动了其在众多现实应用中的广泛采用。然而，与大型语言模型（LLMs）类似，视频生成模型也易产生幻觉，即便在事实错误的情况下仍能生成看似合理的视频。尽管先前研究已对LLMs的不确定性量化（UQ）进行了深入探讨，但针对视频模型的UQ方法尚属空白，这引发了重大的安全隐患。据我们所知，本文是首次尝试量化视频模型不确定性的研究工作。我们提出了一套生成式视频模型不确定性量化框架，包含：（i）一种基于稳健秩相关估计的视频模型校准评估指标，无需严格的建模假设；（ii）一种面向视频模型的黑箱UQ方法（命名为S-QUBED），该方法利用潜在建模严格地将预测不确定性分解为偶然性和认知性成分；（iii）一个UQ数据集，以促进视频模型校准的基准测试。通过在潜在空间中设定生成任务，我们区分了因任务描述模糊引起的不确定性与因知识不足导致的不确定性。通过在基准视频数据集上的大量实验，我们证明了S-QUBED能够计算出与任务准确度负相关的校准总不确定性估计，并有效分离出偶然性和认知性成分。

English

Generative video models demonstrate impressive text-to-video capabilities, spurring widespread adoption in many real-world applications. However, like large language models (LLMs), video generation models tend to hallucinate, producing plausible videos even when they are factually wrong. Although uncertainty quantification (UQ) of LLMs has been extensively studied in prior work, no UQ method for video models exists, raising critical safety concerns. To our knowledge, this paper represents the first work towards quantifying the uncertainty of video models. We present a framework for uncertainty quantification of generative video models, consisting of: (i) a metric for evaluating the calibration of video models based on robust rank correlation estimation with no stringent modeling assumptions; (ii) a black-box UQ method for video models (termed S-QUBED), which leverages latent modeling to rigorously decompose predictive uncertainty into its aleatoric and epistemic components; and (iii) a UQ dataset to facilitate benchmarking calibration in video models. By conditioning the generation task in the latent space, we disentangle uncertainty arising due to vague task specifications from that arising from lack of knowledge. Through extensive experiments on benchmark video datasets, we demonstrate that S-QUBED computes calibrated total uncertainty estimates that are negatively correlated with the task accuracy and effectively computes the aleatoric and epistemic constituents.

视频模型有多自信？赋予视频模型表达其不确定性的能力

How Confident are Video Models? Empowering Video Models to Express their Uncertainty

摘要

Support