D2D:基于检测器至可微判别器的文本到图像生成数值精度提升方法
D2D: Detector-to-Differentiable Critic for Improved Numeracy in Text-to-Image Generation
October 22, 2025
作者: Nobline Yoo, Olga Russakovsky, Ye Zhu
cs.AI
摘要
文本到图像(T2I)扩散模型在语义对齐方面已展现出强大性能,但在生成符合提示词中指定数量的物体时仍存在困难。现有方法通常引入辅助计数网络作为外部评判器以提升数值感知能力。然而,由于这些评判器需在生成过程中提供梯度指导,其只能采用本身可微分的回归模型,从而排除了具备更强计数能力但基于枚举计数原理而不可微分的检测器模型。为突破这一局限,我们提出Detector-to-Differentiable(D2D)框架,将不可微分的检测模型转化为可微分评判器,借此利用其卓越的计数能力指导数值感知生成。具体而言,我们设计定制化激活函数将检测器逻辑值转换为软性二元指示符,进而结合预训练T2I模型在推理阶段优化噪声先验。通过在SDXL-Turbo、SD-Turbo和Pixart-DMD模型上对四个不同复杂度基准(低密度、高密度及多物体场景)开展广泛实验,我们的方法在物体计数准确率上实现持续显著提升(如在400条提示词的低密度基准D2D-Small上最高提升13.7%),同时图像整体质量与计算开销仅出现轻微下降。
English
Text-to-image (T2I) diffusion models have achieved strong performance in
semantic alignment, yet they still struggle with generating the correct number
of objects specified in prompts. Existing approaches typically incorporate
auxiliary counting networks as external critics to enhance numeracy. However,
since these critics must provide gradient guidance during generation, they are
restricted to regression-based models that are inherently differentiable, thus
excluding detector-based models with superior counting ability, whose
count-via-enumeration nature is non-differentiable. To overcome this
limitation, we propose Detector-to-Differentiable (D2D), a novel framework that
transforms non-differentiable detection models into differentiable critics,
thereby leveraging their superior counting ability to guide numeracy
generation. Specifically, we design custom activation functions to convert
detector logits into soft binary indicators, which are then used to optimize
the noise prior at inference time with pre-trained T2I models. Our extensive
experiments on SDXL-Turbo, SD-Turbo, and Pixart-DMD across four benchmarks of
varying complexity (low-density, high-density, and multi-object scenarios)
demonstrate consistent and substantial improvements in object counting accuracy
(e.g., boosting up to 13.7% on D2D-Small, a 400-prompt, low-density benchmark),
with minimal degradation in overall image quality and computational overhead.