視覚的欠陥を認識し修正する：エージェント的データ合成によるVLMと拡散モデルの視覚的異常理解の実現

要旨

拡散モデルの最近の進歩にもかかわらず、AI生成画像には未だに写実性を損なう視覚的アーティファクトが含まれることが多い。より徹底した事前学習や大規模モデルによってアーティファクトを軽減できる可能性はあるが、完全に排除できる保証はなく、アーティファクトの軽減は極めて重要な研究領域となっている。従来のアーティファクトを意識した手法は、人的コストが高く拡張が困難な人手ラベル付きアーティファクトデータセットに依存しており、アーティファクト注釈付きデータセットを確実に取得する自動化手法の必要性が浮き彫りになっている。本論文では、実画像とアーティファクト注入画像のペアを効率的に生成するArtiAgentを提案する。本手法は3つのエージェントで構成される：実画像からエンティティとサブエンティティを認識・接地する知覚エージェント、拡散Transformer内での新規パッチ単位埋め込み操作によりアーティファクト注入ツールを通じてアーティファクトを導入する合成エージェント、合成されたアーティファクトをフィルタリングし各インスタンスに対する局所的・全体的な説明を生成する選定エージェントである。ArtiAgentを用いて、豊富なアーティファクト注釈を持つ10万枚の画像を合成し、多様な応用分野で有効性と汎用性を実証する。コードはリンクで公開されている。

English

Despite recent advances in diffusion models, AI generated images still often contain visual artifacts that compromise realism. Although more thorough pre-training and bigger models might reduce artifacts, there is no assurance that they can be completely eliminated, which makes artifact mitigation a highly crucial area of study. Previous artifact-aware methodologies depend on human-labeled artifact datasets, which are costly and difficult to scale, underscoring the need for an automated approach to reliably acquire artifact-annotated datasets. In this paper, we propose ArtiAgent, which efficiently creates pairs of real and artifact-injected images. It comprises three agents: a perception agent that recognizes and grounds entities and subentities from real images, a synthesis agent that introduces artifacts via artifact injection tools through novel patch-wise embedding manipulation within a diffusion transformer, and a curation agent that filters the synthesized artifacts and generates both local and global explanations for each instance. Using ArtiAgent, we synthesize 100K images with rich artifact annotations and demonstrate both efficacy and versatility across diverse applications. Code is available at link.

視覚的欠陥を認識し修正する：エージェント的データ合成によるVLMと拡散モデルの視覚的異常理解の実現

See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis

要旨

Support