ChatPaper.aiChatPaper

D2D:面向文本到图像生成数值精度提升的检测器-可微分评判器框架

D2D: Detector-to-Differentiable Critic for Improved Numeracy in Text-to-Image Generation

October 22, 2025
作者: Nobline Yoo, Olga Russakovsky, Ye Zhu
cs.AI

摘要

文本到图像(T2I)扩散模型在语义对齐方面已取得显著成效,但在生成符合提示词中指定数量的物体时仍存在困难。现有方法通常引入辅助计数网络作为外部评判器以增强数值理解能力。然而,由于这些评判器需在生成过程中提供梯度指导,其只能采用本身可微分的回归模型,从而排除了具有更强计数能力但基于枚举计数原理而不可微分的检测器模型。为突破此限制,我们提出检测器可微分化框架(D2D),通过将不可微分的检测模型转化为可微分评判器,有效利用其卓越的计数能力来指导数值生成。具体而言,我们设计定制化激活函数将检测器逻辑值转换为软二元指示器,进而结合预训练T2I模型在推理阶段优化噪声先验。基于SDXL-Turbo、SD-Turbo和Pixart-DMD模型在四个不同复杂度基准测试(涵盖低密度、高密度及多物体场景)上的广泛实验表明,该方法能持续显著提升物体计数准确率(例如在包含400条提示词的低密度基准D2D-Small上最高提升13.7%),且图像整体质量与计算开销仅略有影响。
English
Text-to-image (T2I) diffusion models have achieved strong performance in semantic alignment, yet they still struggle with generating the correct number of objects specified in prompts. Existing approaches typically incorporate auxiliary counting networks as external critics to enhance numeracy. However, since these critics must provide gradient guidance during generation, they are restricted to regression-based models that are inherently differentiable, thus excluding detector-based models with superior counting ability, whose count-via-enumeration nature is non-differentiable. To overcome this limitation, we propose Detector-to-Differentiable (D2D), a novel framework that transforms non-differentiable detection models into differentiable critics, thereby leveraging their superior counting ability to guide numeracy generation. Specifically, we design custom activation functions to convert detector logits into soft binary indicators, which are then used to optimize the noise prior at inference time with pre-trained T2I models. Our extensive experiments on SDXL-Turbo, SD-Turbo, and Pixart-DMD across four benchmarks of varying complexity (low-density, high-density, and multi-object scenarios) demonstrate consistent and substantial improvements in object counting accuracy (e.g., boosting up to 13.7% on D2D-Small, a 400-prompt, low-density benchmark), with minimal degradation in overall image quality and computational overhead.
PDF22December 2, 2025