비디오 모델은 얼마나 확신할 수 있는가? 비디오 모델이 불확실성을 표현할 수 있도록 강화하기

초록

생성적 비디오 모델은 텍스트-투-비디오 능력에서 인상적인 성능을 보여주며, 많은 실제 응용 분야에서 광범위한 채택을 촉진하고 있습니다. 그러나 대형 언어 모델(LLM)과 마찬가지로, 비디오 생성 모델도 사실과 다를지라도 그럴듯한 비디오를 생성하는 환각(hallucination) 현상을 보이는 경향이 있습니다. LLM의 불확실성 정량화(UQ)는 기존 연구에서 광범위하게 연구되었지만, 비디오 모델을 위한 UQ 방법은 존재하지 않아 중요한 안전 문제를 제기하고 있습니다. 우리가 아는 한, 이 논문은 비디오 모델의 불확실성을 정량화하는 첫 번째 연구를 대표합니다. 우리는 생성적 비디오 모델의 불확실성 정량화를 위한 프레임워크를 제시하며, 이는 다음과 같이 구성됩니다: (i) 엄격한 모델링 가정 없이 강력한 순위 상관 관계 추정을 기반으로 비디오 모델의 보정을 평가하기 위한 메트릭; (ii) 비디오 모델을 위한 블랙박스 UQ 방법(S-QUBED라고 명명), 이는 잠재 모델링을 활용하여 예측 불확실성을 알레토릭(aleatoric)과 에피스테믹(epistemic) 구성 요소로 엄격하게 분해합니다; 그리고 (iii) 비디오 모델의 보정 벤치마킹을 용이하게 하기 위한 UQ 데이터셋. 잠재 공간에서 생성 작업을 조건화함으로써, 우리는 모호한 작업 명세로 인해 발생하는 불확실성과 지식 부족으로 인해 발생하는 불확실성을 분리합니다. 벤치마크 비디오 데이터셋에 대한 광범위한 실험을 통해, 우리는 S-QUBED가 작업 정확도와 음의 상관 관계를 가지는 보정된 총 불확실성 추정치를 계산하며, 알레토릭과 에피스테믹 구성 요소를 효과적으로 계산함을 입증합니다.

English

Generative video models demonstrate impressive text-to-video capabilities, spurring widespread adoption in many real-world applications. However, like large language models (LLMs), video generation models tend to hallucinate, producing plausible videos even when they are factually wrong. Although uncertainty quantification (UQ) of LLMs has been extensively studied in prior work, no UQ method for video models exists, raising critical safety concerns. To our knowledge, this paper represents the first work towards quantifying the uncertainty of video models. We present a framework for uncertainty quantification of generative video models, consisting of: (i) a metric for evaluating the calibration of video models based on robust rank correlation estimation with no stringent modeling assumptions; (ii) a black-box UQ method for video models (termed S-QUBED), which leverages latent modeling to rigorously decompose predictive uncertainty into its aleatoric and epistemic components; and (iii) a UQ dataset to facilitate benchmarking calibration in video models. By conditioning the generation task in the latent space, we disentangle uncertainty arising due to vague task specifications from that arising from lack of knowledge. Through extensive experiments on benchmark video datasets, we demonstrate that S-QUBED computes calibrated total uncertainty estimates that are negatively correlated with the task accuracy and effectively computes the aleatoric and epistemic constituents.

비디오 모델은 얼마나 확신할 수 있는가? 비디오 모델이 불확실성을 표현할 수 있도록 강화하기

How Confident are Video Models? Empowering Video Models to Express their Uncertainty

초록

Support