当数字发声：文本到视频扩散模型中文本数字与视觉实例的对齐

摘要

文本到视频扩散模型实现了开放式视频生成，但往往难以准确生成提示词中指定数量的物体。我们提出NUMINA训练框架，这是一种无需训练的"识别-引导"方法，可有效提升数值对齐能力。该框架通过筛选具有判别力的自注意力与交叉注意力头来识别提示词与布局的不一致性，从而推导出可量化的潜在布局。随后对布局进行保守优化，并通过调节交叉注意力来引导重新生成。在全新构建的CountBench测试集上，NUMINA使Wan2.1-1.3B模型的计数准确率最高提升7.4%，在50亿和140亿参数模型上分别提升4.9%和5.5%。此外，在保持时间一致性的同时提升了CLIP对齐度。这些结果表明，结构化引导与种子搜索、提示词增强形成互补，为实现精确计数的文本到视频生成提供了可行路径。代码已开源：https://github.com/H-EmbodVis/NUMINA。

English

Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA , a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies prompt-layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout. It then refines this layout conservatively and modulates cross-attention to guide regeneration. On the introduced CountBench, NUMINA improves counting accuracy by up to 7.4% on Wan2.1-1.3B, and by 4.9% and 5.5% on 5B and 14B models, respectively. Furthermore, CLIP alignment is improved while maintaining temporal consistency. These results demonstrate that structural guidance complements seed search and prompt enhancement, offering a practical path toward count-accurate text-to-video diffusion. The code is available at https://github.com/H-EmbodVis/NUMINA.

当数字发声：文本到视频扩散模型中文本数字与视觉实例的对齐

When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models

摘要

Support