交互に配置されたシーングラフによる交互のテキストと画像の生成の評価

要旨

多くの実世界のユーザークエリ（例：「卵チャーハンの作り方は？」）は、料理本のように、テキスト手順とそれに付随する画像を生成するシステムの恩恵を受ける可能性があります。交互にテキストと画像を生成するモデルは、これらのモダリティ内および間での一貫性を確保する際に課題に直面します。これらの課題に対処するために、交互にテキストと画像を生成するための包括的な評価フレームワークであるISGを提案します。ISGは、テキストと画像ブロック間の関係を捉えるためにシーングラフ構造を活用し、全体的、構造的、ブロックレベル、画像固有の4つの粒度レベルで応答を評価します。この多層評価により、一貫性、結合性、および正確性の微妙な評価が可能となり、解釈可能な質問回答フィードバックを提供します。ISGと併せて、8つのカテゴリと21のサブカテゴリにまたがる1,150のサンプルを含むISG-Benchというベンチマークを紹介します。このベンチマークデータセットには、複雑な言語ビジョン依存関係とヴィジョン中心のタスクであるスタイル変換などを効果的に評価するためのゴールデンアンサーが含まれています。ISG-Benchを使用して、最近の統合ビジョン言語モデルが交互コンテンツを生成する際に性能が低いことを示します。統合モデルよりも別々の言語と画像モデルを組み合わせる構成的アプローチは、全体レベルで統合モデルよりも111%の改善を示しますが、ブロックおよび画像レベルでのパフォーマンスは依然として最適ではありません。将来の研究を促進するために、「計画-実行-改善」パイプラインを用いたISG-Agentというベースラインエージェントを開発し、ツールを呼び出して122%の性能向上を達成します。

English

Many real-world user queries (e.g. "How do to make egg fried rice?") could benefit from systems capable of generating responses with both textual steps with accompanying images, similar to a cookbook. Models designed to generate interleaved text and images face challenges in ensuring consistency within and across these modalities. To address these challenges, we present ISG, a comprehensive evaluation framework for interleaved text-and-image generation. ISG leverages a scene graph structure to capture relationships between text and image blocks, evaluating responses on four levels of granularity: holistic, structural, block-level, and image-specific. This multi-tiered evaluation allows for a nuanced assessment of consistency, coherence, and accuracy, and provides interpretable question-answer feedback. In conjunction with ISG, we introduce a benchmark, ISG-Bench, encompassing 1,150 samples across 8 categories and 21 subcategories. This benchmark dataset includes complex language-vision dependencies and golden answers to evaluate models effectively on vision-centric tasks such as style transfer, a challenging area for current models. Using ISG-Bench, we demonstrate that recent unified vision-language models perform poorly on generating interleaved content. While compositional approaches that combine separate language and image models show a 111% improvement over unified models at the holistic level, their performance remains suboptimal at both block and image levels. To facilitate future work, we develop ISG-Agent, a baseline agent employing a "plan-execute-refine" pipeline to invoke tools, achieving a 122% performance improvement.

交互に配置されたシーングラフによる交互のテキストと画像の生成の評価

Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment

要旨

Support