LAION-SG: 複雑な画像テキストモデルを訓練するための拡張された大規模データセットと構造アノテーション

要旨

最近のテキストから画像への生成（T2I）の進歩は、テキストから高品質な画像を生成する際に顕著な成功を示しています。しかし、既存のT2Iモデルは、複数のオブジェクトや入り組んだ関係を含む構成画像生成において性能が低下していることが報告されています。この問題は、画像テキストのペアの既存データセットに、正確なオブジェクト間の関係の注釈が不足していることに起因すると考えられます。この問題に対処するために、複数のオブジェクトの属性や関係を正確に記述するシーングラフ（SG）の高品質な構造アノテーションを備えた大規模なデータセットであるLAION-SGを構築しました。LAION-SGを基に、構造アノテーション情報を生成プロセスに組み込む新しい基礎モデルSDXL-SGを訓練しました。広範な実験により、当社のLAION-SGで訓練された高度なモデルが既存のデータセット上のモデルよりも複雑なシーン生成において顕著な性能向上を誇ることが示されました。また、複成画像生成においてモデルを評価するベンチマークであるCompSG-Benchを導入し、この領域の新たな標準を確立しました。

English

Recent advances in text-to-image (T2I) generation have shown remarkable success in producing high-quality images from text. However, existing T2I models show decayed performance in compositional image generation involving multiple objects and intricate relationships. We attribute this problem to limitations in existing datasets of image-text pairs, which lack precise inter-object relationship annotations with prompts only. To address this problem, we construct LAION-SG, a large-scale dataset with high-quality structural annotations of scene graphs (SG), which precisely describe attributes and relationships of multiple objects, effectively representing the semantic structure in complex scenes. Based on LAION-SG, we train a new foundation model SDXL-SG to incorporate structural annotation information into the generation process. Extensive experiments show advanced models trained on our LAION-SG boast significant performance improvements in complex scene generation over models on existing datasets. We also introduce CompSG-Bench, a benchmark that evaluates models on compositional image generation, establishing a new standard for this domain.

LAION-SG: 複雑な画像テキストモデルを訓練するための拡張された大規模データセットと構造アノテーション

LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations

要旨

Support