모델은 이미 최적의 노이즈를 알고 있다: 비디오 확산 모델의 어텐션을 통한 베이지안 능동적 노이즈 선택

초록

초기 노이즈 선택은 비디오 확산 모델의 품질과 프롬프트 정렬에 상당한 영향을 미치며, 동일한 프롬프트에 대해 다른 노이즈 시드를 사용하면 크게 다른 결과물이 생성될 수 있습니다. 최근의 방법들은 주파수 필터나 프레임 간 평활화와 같은 외부적으로 설계된 사전 지식에 의존하지만, 어떤 노이즈 시드가 본질적으로 더 나은지를 나타내는 내부 모델 신호를 종종 간과합니다. 이를 해결하기 위해, 우리는 주의 기반 불확실성을 정량화하여 고품질 노이즈 시드를 선택하는 모델 인식 프레임워크인 ANSE(Active Noise Selection for Generation)를 제안합니다. ANSE의 핵심은 BANSA(Bayesian Active Noise Selection via Attention)로, 이는 다중 확률적 주의 샘플 간의 엔트로피 불일치를 측정하여 모델의 신뢰도와 일관성을 추정하는 획득 함수입니다. 효율적인 추론 시간 배포를 위해, 우리는 단일 확산 단계와 주의 계층의 부분 집합을 사용하여 점수를 추정할 수 있는 BANSA의 베르누이 마스크 근사치를 도입했습니다. CogVideoX-2B와 5B에 대한 실험 결과, ANSE는 추론 시간이 각각 8%와 13%만 증가하면서도 비디오 품질과 시간적 일관성을 개선하여 비디오 확산에서 노이즈 선택에 대한 원칙적이고 일반화 가능한 접근 방식을 제공합니다. 프로젝트 페이지를 참조하세요: https://anse-project.github.io/anse-project/

English

The choice of initial noise significantly affects the quality and prompt alignment of video diffusion models, where different noise seeds for the same prompt can lead to drastically different generations. While recent methods rely on externally designed priors such as frequency filters or inter-frame smoothing, they often overlook internal model signals that indicate which noise seeds are inherently preferable. To address this, we propose ANSE (Active Noise Selection for Generation), a model-aware framework that selects high-quality noise seeds by quantifying attention-based uncertainty. At its core is BANSA (Bayesian Active Noise Selection via Attention), an acquisition function that measures entropy disagreement across multiple stochastic attention samples to estimate model confidence and consistency. For efficient inference-time deployment, we introduce a Bernoulli-masked approximation of BANSA that enables score estimation using a single diffusion step and a subset of attention layers. Experiments on CogVideoX-2B and 5B demonstrate that ANSE improves video quality and temporal coherence with only an 8% and 13% increase in inference time, respectively, providing a principled and generalizable approach to noise selection in video diffusion. See our project page: https://anse-project.github.io/anse-project/