影片模型有多自信？賦能影片模型表達其不確定性

摘要

生成式視頻模型展現了令人印象深刻的文本到視頻轉換能力，推動了其在眾多實際應用中的廣泛採用。然而，與大型語言模型（LLMs）類似，視頻生成模型也容易產生幻覺，即使事實錯誤，也能生成看似合理的視頻。儘管先前的研究已對LLMs的不確定性量化（UQ）進行了廣泛探討，但目前尚無針對視頻模型的UQ方法，這引發了重大的安全隱患。據我們所知，本文是首次嘗試量化視頻模型不確定性的研究。我們提出了一個用於生成式視頻模型不確定性量化的框架，包括：（i）基於無嚴格建模假設的穩健秩相關估計，用於評估視頻模型校準的指標；（ii）一種針對視頻模型的黑箱UQ方法（稱為S-QUBED），該方法利用潛在建模嚴格地將預測不確定性分解為其偶然性和認知性成分；以及（iii）一個UQ數據集，以促進視頻模型校準的基準測試。通過在潛在空間中條件化生成任務，我們將由模糊任務規範引起的不確定性與由知識缺乏引起的不確定性分離。在基準視頻數據集上的大量實驗表明，S-QUBED計算的總不確定性估計值與任務準確性呈負相關，並有效地計算了偶然性和認知性成分。

English

Generative video models demonstrate impressive text-to-video capabilities, spurring widespread adoption in many real-world applications. However, like large language models (LLMs), video generation models tend to hallucinate, producing plausible videos even when they are factually wrong. Although uncertainty quantification (UQ) of LLMs has been extensively studied in prior work, no UQ method for video models exists, raising critical safety concerns. To our knowledge, this paper represents the first work towards quantifying the uncertainty of video models. We present a framework for uncertainty quantification of generative video models, consisting of: (i) a metric for evaluating the calibration of video models based on robust rank correlation estimation with no stringent modeling assumptions; (ii) a black-box UQ method for video models (termed S-QUBED), which leverages latent modeling to rigorously decompose predictive uncertainty into its aleatoric and epistemic components; and (iii) a UQ dataset to facilitate benchmarking calibration in video models. By conditioning the generation task in the latent space, we disentangle uncertainty arising due to vague task specifications from that arising from lack of knowledge. Through extensive experiments on benchmark video datasets, we demonstrate that S-QUBED computes calibrated total uncertainty estimates that are negatively correlated with the task accuracy and effectively computes the aleatoric and epistemic constituents.