當數字發聲:文字數字與視覺實例在文字轉影片擴散模型中的對齊
When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models
April 9, 2026
作者: Zhengyang Sun, Yu Chen, Xin Zhou, Xiaofan Li, Xiwu Chen, Dingkang Liang, Xiang Bai
cs.AI
摘要
文字轉視訊擴散模型雖能實現開放式視影片合成,卻常難以準確生成提示詞中指定數量的物體。我們提出NUMINA——一種免訓練的「識別後引導」框架,旨在提升數值對齊能力。該框架通過選取具有辨識力的自注意力與交叉注意力頭部,推導出可計數的潛在佈局,從而檢測提示詞與佈局之間的不一致性。接著保守地優化此佈局,並調製交叉注意力機制以引導重新生成。在新建的CountBench測試集上,NUMINA使Wan2.1-1.3B模型的計數準確率最高提升7.4%,在50億與140億參數模型上分別提升4.9%與5.5%。此外,在保持時間一致性的同時,CLIP對齊度也獲得改善。這些結果證明結構化引導可與種子搜索及提示詞增強技術形成互補,為實現精準計數的文字轉視訊擴散提供實用路徑。程式碼已開源於https://github.com/H-EmbodVis/NUMINA。
English
Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA , a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies prompt-layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout. It then refines this layout conservatively and modulates cross-attention to guide regeneration. On the introduced CountBench, NUMINA improves counting accuracy by up to 7.4% on Wan2.1-1.3B, and by 4.9% and 5.5% on 5B and 14B models, respectively. Furthermore, CLIP alignment is improved while maintaining temporal consistency. These results demonstrate that structural guidance complements seed search and prompt enhancement, offering a practical path toward count-accurate text-to-video diffusion. The code is available at https://github.com/H-EmbodVis/NUMINA.