숫자가 말할 때: 텍스트-비디오 확산 모델에서 텍스트 숫자와 시각적 인스턴스의 정렬

초록

텍스트-비디오 확산 모델은 개방형 비디오 합성을 가능하게 하지만, 종종 프롬프트에 지정된 정확한 수의 객체 생성에 어려움을 겪습니다. 본 논문에서는 수치적 일관성을 향상시키기 위한 학습이 필요 없는 식별-후-유도 프레임워크인 NUMINA를 소개합니다. NUMINA는 판별적인 자기 주의 및 교차 주의 헤드를 선택하여 계수 가능한 잠재 레이아웃을 도출함으로써 프롬프트-레이아웃 불일치를 식별합니다. 그런 다음 이 레이아웃을 보수적으로 정제하고 교차 주의를 변조하여 재생성을 유도합니다. 새로 도입된 CountBench에서 NUMINA는 Wan2.1-1.3B 모델에서 계수 정확도를 최대 7.4% 향상시켰으며, 5B 및 14B 모델에서 각각 4.9%와 5.5% 향상시켰습니다. 더 나아가 시간적 일관성을 유지하면서 CLIP 일관성도 개선되었습니다. 이러한 결과는 구조적 유도가 시드 탐색 및 프롬프트 향상을 보완하여, 정확한 계수가 가능한 텍스트-비디오 확산을 위한 실용적인 경로를 제공함을 보여줍니다. 코드는 https://github.com/H-EmbodVis/NUMINA에서 확인할 수 있습니다.

English

Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA , a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies prompt-layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout. It then refines this layout conservatively and modulates cross-attention to guide regeneration. On the introduced CountBench, NUMINA improves counting accuracy by up to 7.4% on Wan2.1-1.3B, and by 4.9% and 5.5% on 5B and 14B models, respectively. Furthermore, CLIP alignment is improved while maintaining temporal consistency. These results demonstrate that structural guidance complements seed search and prompt enhancement, offering a practical path toward count-accurate text-to-video diffusion. The code is available at https://github.com/H-EmbodVis/NUMINA.

숫자가 말할 때: 텍스트-비디오 확산 모델에서 텍스트 숫자와 시각적 인스턴스의 정렬

When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models

초록

Support