ByteMorph：基于非刚性运动的指令引导图像编辑基准测试

摘要

在计算机视觉领域，根据指令编辑图像以反映非刚性运动、摄像机视角变化、物体形变、人体关节活动及复杂交互，是一个具有挑战性且尚未充分探索的问题。现有方法和数据集主要集中于静态场景或刚性变换，限制了其处理涉及动态运动的富有表现力编辑的能力。为填补这一空白，我们推出了ByteMorph，一个专注于非刚性运动的指令驱动图像编辑综合框架。ByteMorph包含一个大规模数据集ByteMorph-6M，以及一个基于扩散变换器（DiT）构建的强大基线模型——ByteMorpher。ByteMorph-6M提供了超过600万对高分辨率图像编辑样本用于训练，并精心策划了评估基准ByteMorph-Bench，两者均涵盖了多样环境、人物及物体类别中的广泛非刚性运动类型。该数据集通过运动引导的数据生成、分层合成技术和自动标注构建，确保了多样性、真实感及语义一致性。此外，我们还对学术界和商业领域最新的指令驱动图像编辑方法进行了全面评估。

English

Editing images with instructions to reflect non-rigid motions, camera viewpoint shifts, object deformations, human articulations, and complex interactions, poses a challenging yet underexplored problem in computer vision. Existing approaches and datasets predominantly focus on static scenes or rigid transformations, limiting their capacity to handle expressive edits involving dynamic motion. To address this gap, we introduce ByteMorph, a comprehensive framework for instruction-based image editing with an emphasis on non-rigid motions. ByteMorph comprises a large-scale dataset, ByteMorph-6M, and a strong baseline model built upon the Diffusion Transformer (DiT), named ByteMorpher. ByteMorph-6M includes over 6 million high-resolution image editing pairs for training, along with a carefully curated evaluation benchmark ByteMorph-Bench. Both capture a wide variety of non-rigid motion types across diverse environments, human figures, and object categories. The dataset is constructed using motion-guided data generation, layered compositing techniques, and automated captioning to ensure diversity, realism, and semantic coherence. We further conduct a comprehensive evaluation of recent instruction-based image editing methods from both academic and commercial domains.

ByteMorph：基于非刚性运动的指令引导图像编辑基准测试

ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions

摘要

Support