洞察与修正缺陷：通过智能数据合成赋能视觉语言模型与扩散模型理解视觉伪影

摘要

尽管扩散模型近期取得了进展，但AI生成图像仍常存在影响真实感的视觉伪影。虽然更充分的预训练与更大规模的模型或许能减少伪影，但无法保证完全消除，这使得伪影消减成为至关重要的研究方向。现有基于人工标注伪影数据集的方法成本高昂且难以扩展，凸显了对自动化获取伪影标注数据集的迫切需求。本文提出ArtiAgent系统，能高效生成真实图像与注入伪影的图像对。该系统包含三个智能体：感知智能体负责识别并定位真实图像中的实体与子实体，合成智能体通过创新的扩散变压器块状嵌入操作与伪影注入工具引入伪影，策展智能体则对合成伪影进行筛选并为每个实例生成局部与全局解释。基于ArtiAgent，我们合成了包含10万张具有丰富伪影标注的图像，并在多类应用中验证了其有效性与通用性。代码发布于link。

English

Despite recent advances in diffusion models, AI generated images still often contain visual artifacts that compromise realism. Although more thorough pre-training and bigger models might reduce artifacts, there is no assurance that they can be completely eliminated, which makes artifact mitigation a highly crucial area of study. Previous artifact-aware methodologies depend on human-labeled artifact datasets, which are costly and difficult to scale, underscoring the need for an automated approach to reliably acquire artifact-annotated datasets. In this paper, we propose ArtiAgent, which efficiently creates pairs of real and artifact-injected images. It comprises three agents: a perception agent that recognizes and grounds entities and subentities from real images, a synthesis agent that introduces artifacts via artifact injection tools through novel patch-wise embedding manipulation within a diffusion transformer, and a curation agent that filters the synthesized artifacts and generates both local and global explanations for each instance. Using ArtiAgent, we synthesize 100K images with rich artifact annotations and demonstrate both efficacy and versatility across diverse applications. Code is available at link.