ByteMorph: 非剛体運動を伴う指示誘導型画像編集のベンチマーキング

要旨

非剛体運動、カメラ視点の変化、物体の変形、人間の関節動作、複雑な相互作用を反映する指示による画像編集は、コンピュータビジョンにおいて挑戦的でありながら未開拓の問題である。既存のアプローチやデータセットは主に静的なシーンや剛体変換に焦点を当てており、動的な動きを含む表現力豊かな編集を扱う能力が限られている。このギャップを埋めるため、非剛体運動に重点を置いた指示ベースの画像編集の包括的フレームワークであるByteMorphを提案する。ByteMorphは、大規模データセットByteMorph-6Mと、Diffusion Transformer（DiT）に基づく強力なベースラインモデルByteMorpherで構成される。ByteMorph-6Mは、トレーニング用の600万以上の高解像度画像編集ペアと、慎重に選定された評価ベンチマークByteMorph-Benchを含む。これらは、多様な環境、人間の姿、物体カテゴリにわたる幅広い非剛体運動のタイプを捉えている。データセットは、モーションガイド付きデータ生成、レイヤー合成技術、自動キャプション生成を用いて構築され、多様性、リアリズム、意味的整合性を確保している。さらに、学術界と商業界の両方から最近の指示ベースの画像編集手法の包括的な評価を実施する。

English

Editing images with instructions to reflect non-rigid motions, camera viewpoint shifts, object deformations, human articulations, and complex interactions, poses a challenging yet underexplored problem in computer vision. Existing approaches and datasets predominantly focus on static scenes or rigid transformations, limiting their capacity to handle expressive edits involving dynamic motion. To address this gap, we introduce ByteMorph, a comprehensive framework for instruction-based image editing with an emphasis on non-rigid motions. ByteMorph comprises a large-scale dataset, ByteMorph-6M, and a strong baseline model built upon the Diffusion Transformer (DiT), named ByteMorpher. ByteMorph-6M includes over 6 million high-resolution image editing pairs for training, along with a carefully curated evaluation benchmark ByteMorph-Bench. Both capture a wide variety of non-rigid motion types across diverse environments, human figures, and object categories. The dataset is constructed using motion-guided data generation, layered compositing techniques, and automated captioning to ensure diversity, realism, and semantic coherence. We further conduct a comprehensive evaluation of recent instruction-based image editing methods from both academic and commercial domains.

ByteMorph: 非剛体運動を伴う指示誘導型画像編集のベンチマーキング

ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions

要旨

Support