何处、何事、为何及重要性：结构化缺陷定位用于文本到图像反馈

摘要

尽管文本到图像（T2I）模型能够生成越来越逼真的图像，但其仍然存在局部、细微且结构复杂的缺陷。诊断这些缺陷需要实例级别的反馈，以明确缺陷发生的位置、类型、原因及其对整体图像质量的重要性。虽然近年来提出的密集反馈方法已超越标量监督，但其以热图为中心的表示方式仍将诊断问题归结为像素场回归，难以定位数量可变的缺陷，也无法将语义原因与单个缺陷关联起来。为了解决这一表示瓶颈，我们提出了结构化缺陷定位（Structured Defect Grounding, SDG），该方法通过将每个缺陷建模为（位置、类型、原因、重要性）元组，将T2I诊断转化为结构化集合预测问题。为了使这一公式可训练且可度量，我们引入了SDG-30K数据集，该数据集包含3万张图像，覆盖四种现代T2I生成器的框级标注，并配套了专用评估协议SDG-Eval。基于这一结构化表示，我们进一步提出了一种从诊断到对齐的框架：视觉语言模型（VLM）作为SDG检测器，BoxFlow-GRPO将预测的缺陷集合转化为源自边界框且经重要性加权的空间奖励，用于扩散模型对齐。大量实验表明，我们的SDG检测器在结构化缺陷定位任务上优于领先的专有VLM模型，而SDG引导的奖励则一致地提升了T2I对齐效果，并支持局部图像的精细化改进。这些结果确立了SDG作为诊断、评估和增强现代生成模型的统一、实例级接口。

English

Despite generating increasingly photorealistic images, text-to-image (T2I) models still exhibit localized, subtle, and structurally complex failures. Diagnosing these failures requires instance-level feedback that answers where a defect occurs, what type it is, why it is defective, and its importance to overall image quality. While recent dense-feedback methods move beyond scalar supervision, their heatmap-centric representations still formulate diagnosis as pixel-field regression, making it difficult to localize variable-cardinality defects and bind semantic reasons to individual failures. To address this representation bottleneck, we propose Structured Defect Grounding (SDG), which casts T2I diagnosis as structured set prediction by modeling each defect as a (location, type, reason, importance) tuple. To make this formulation trainable and measurable, we introduce SDG-30K, a 30K-image dataset with box-grounded annotations across four modern T2I generators, together with a dedicated evaluation protocol, SDG-Eval. Building on this structured representation, we further present a diagnosis-to-alignment framework in which a Vision-Language Model (VLM) serves as the SDG detector, and BoxFlow-GRPO converts predicted defect sets into box-derived, importance-weighted spatial rewards for diffusion model alignment. Extensive experiments show that our SDG detector outperforms leading proprietary VLMs on structured defect grounding, while SDG-guided rewards consistently improve T2I alignment and support localized image refinement. These results establish SDG as a unified, instance-level interface for diagnosing, evaluating, and enhancing modern generative models.