Complex-Edit: 複雑度制御可能な画像編集ベンチマークのためのCoT風指示生成

要旨

私たちは、Complex-Editという包括的なベンチマークを紹介します。このベンチマークは、さまざまな複雑さの指示に基づいて画像編集モデルを体系的に評価するために設計されています。このベンチマークを開発するために、GPT-4oを活用して多様な編集指示を自動的に大規模に収集しました。私たちのアプローチは、構造化された「Chain-of-Edit」パイプラインに従っています。まず、個別の原子編集タスクを独立して生成し、それらを統合して一貫性のある複雑な指示を形成します。さらに、編集性能のさまざまな側面を評価するための一連のメトリクスを導入し、大規模評価をサポートするVLMベースの自動評価パイプラインを提供します。私たちのベンチマークからは、以下の注目すべき洞察が得られました：1) オープンソースモデルは、プロプライエタリなクローズドソースモデルに比べて大幅に性能が低く、指示の複雑さが増すほどその性能差が拡大する。2) 指示の複雑さが増すと、モデルが入力画像から重要な要素を保持し、全体的な美的品質を維持する能力が主に損なわれる。3) 複雑な指示を原子ステップのシーケンスに分解し、段階的に実行すると、複数のメトリクスで性能が大幅に低下する。4) シンプルなBest-of-N選択戦略は、直接編集と段階的シーケンシャルアプローチの両方で結果を改善する。5) 「合成データの呪い」が観察される：モデルのトレーニングに合成データが関与している場合、そのようなモデルから編集された画像は、編集指示の複雑さが増すにつれてますます合成されたように見える傾向がある。この現象は、興味深いことに最新のGPT-4oの出力にも現れている。

English

We introduce Complex-Edit, a comprehensive benchmark designed to systematically evaluate instruction-based image editing models across instructions of varying complexity. To develop this benchmark, we harness GPT-4o to automatically collect a diverse set of editing instructions at scale. Our approach follows a well-structured ``Chain-of-Edit'' pipeline: we first generate individual atomic editing tasks independently and then integrate them to form cohesive, complex instructions. Additionally, we introduce a suite of metrics to assess various aspects of editing performance, along with a VLM-based auto-evaluation pipeline that supports large-scale assessments. Our benchmark yields several notable insights: 1) Open-source models significantly underperform relative to proprietary, closed-source models, with the performance gap widening as instruction complexity increases; 2) Increased instructional complexity primarily impairs the models' ability to retain key elements from the input images and to preserve the overall aesthetic quality; 3) Decomposing a complex instruction into a sequence of atomic steps, executed in a step-by-step manner, substantially degrades performance across multiple metrics; 4) A straightforward Best-of-N selection strategy improves results for both direct editing and the step-by-step sequential approach; and 5) We observe a ``curse of synthetic data'': when synthetic data is involved in model training, the edited images from such models tend to appear increasingly synthetic as the complexity of the editing instructions rises -- a phenomenon that intriguingly also manifests in the latest GPT-4o outputs.

Complex-Edit: 複雑度制御可能な画像編集ベンチマークのためのCoT風指示生成

Complex-Edit: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark

要旨

Support