어디서, 무엇이, 왜, 그리고 중요성: 텍스트-이미지 피드백을 위한 구조적 결함 근거

초록

점점 더 사실적인 이미지를 생성함에도 불구하고, 텍스트-이미지(T2I) 모델은 여전히 국소적이고 미묘하며 구조적으로 복잡한 실패를 보인다. 이러한 실패를 진단하려면 결함이 어디에서 발생하는지, 어떤 유형인지, 왜 결함인지, 그리고 전체 이미지 품질에 대한 중요도에 답하는 인스턴스 수준의 피드백이 필요하다. 최근의 밀집 피드백 방법은 스칼라 감독을 넘어서지만, 히트맵 중심 표현은 여전히 진단을 픽셀 필드 회귀로 정식화하여 가변 개수 결함을 국소화하고 개별 실패에 의미론적 이유를 결부시키기 어렵게 만든다. 이러한 표현 병목 현상을 해결하기 위해, 우리는 각 결함을 (위치, 유형, 이유, 중요도) 튜플로 모델링하여 T2I 진단을 구조적 집합 예측으로 전환하는 구조적 결함 근거 부여(SDG)를 제안한다. 이 정식화를 훈련 가능하고 측정 가능하게 만들기 위해, 우리는 4개의 최신 T2I 생성기에 걸친 박스 기반 주석이 포함된 30K 이미지 데이터셋인 SDG-30K와 전용 평가 프로토콜인 SDG-Eval을 도입한다. 이러한 구조적 표현을 기반으로, 우리는 시각-언어 모델(VLM)이 SDG 검출기 역할을 하고 BoxFlow-GRPO가 예측된 결함 집합을 박스에서 파생된 중요도 가중 공간 보상으로 변환하여 확산 모델 정렬에 활용하는 진단-정렬 프레임워크를 추가로 제시한다. 광범위한 실험을 통해 우리의 SDG 검출기가 구조적 결함 근거 부여에서 선도적인 독점 VLM보다 뛰어난 성능을 보이며, SDG 기반 보상이 T2I 정렬을 일관되게 개선하고 국소적 이미지 개선을 지원함을 보여준다. 이러한 결과는 SDG를 현대 생성 모델을 진단, 평가 및 개선하기 위한 통합된 인스턴스 수준 인터페이스로 확립한다.

English

Despite generating increasingly photorealistic images, text-to-image (T2I) models still exhibit localized, subtle, and structurally complex failures. Diagnosing these failures requires instance-level feedback that answers where a defect occurs, what type it is, why it is defective, and its importance to overall image quality. While recent dense-feedback methods move beyond scalar supervision, their heatmap-centric representations still formulate diagnosis as pixel-field regression, making it difficult to localize variable-cardinality defects and bind semantic reasons to individual failures. To address this representation bottleneck, we propose Structured Defect Grounding (SDG), which casts T2I diagnosis as structured set prediction by modeling each defect as a (location, type, reason, importance) tuple. To make this formulation trainable and measurable, we introduce SDG-30K, a 30K-image dataset with box-grounded annotations across four modern T2I generators, together with a dedicated evaluation protocol, SDG-Eval. Building on this structured representation, we further present a diagnosis-to-alignment framework in which a Vision-Language Model (VLM) serves as the SDG detector, and BoxFlow-GRPO converts predicted defect sets into box-derived, importance-weighted spatial rewards for diffusion model alignment. Extensive experiments show that our SDG detector outperforms leading proprietary VLMs on structured defect grounding, while SDG-guided rewards consistently improve T2I alignment and support localized image refinement. These results establish SDG as a unified, instance-level interface for diagnosing, evaluating, and enhancing modern generative models.